986
Software Quality Journal (2024) 32:985–1005
1 3
1 Introduction
Deep learning frameworks play a crucial role in building robust deep learning systems
(Zhang etal., 2020). With the rapid advancement of deep learning technology, the demand
for deep learning frameworks has experienced exponential growth (Guo etal., 2018). This
expansion encompasses the incorporation of new interfaces, the enhancement of function-
alities, and the optimization of compatibility with a wide array of hardware devices and
underlying drivers. Throughout this evolutionary process, the continuous iteration of code
and version updates inevitably introduces bugs into deep learning frameworks (Zhang etal.,
2018). Bugs in deep learning frameworks can have a significant and wide-reaching impact
on a larger user base compared to specific deep learning models. Particularly in safety- and
security-critical domains like autonomous driving (Chen et al., 2015) and healthcare (Cai
etal., 2014), the consequences of these bugs can be more severe. Therefore, ensuring the
reliability of deep learning frameworks is of utmost importance.
Numerous studies have been conducted to gain insights into the characteristics of bugs
in deep learning frameworks and provide assistance in their resolution. For instance, Jia
et al. (2021) conducted an analysis of bugs in TensorFlow based on 202 bug fixes. The
findings revealed that bugs in TensorFlow can be classified into 6 distinct categories based
on symptoms and 11 distinct categories based on root causes. In (Islam etal., 2019), Islam
etal. examined five deep learning libraries, namely Caffe (Jia etal., 2014), Keras (Lux &
Bertini, 2019), TensorFlow (Girija,2016), Theano (Team etal.,2016) and Torch (Collobert
et al., 2002). They analyzed 2,716 posts from Stack Overflow and 500 bug fix commits
from GitHub to identify commonly occurring bug types in deep learning frameworks.
According to the classification results, there are five different bug types, including API
bugs, Coding bugs, Data bugs, Structural bugs, and Non model structural bugs. In (Du
et al., 2022), we conducted a classification of bug reports in TensorFlow, MXNET, and
PaddlePaddle based on fault-triggering conditions. Bugs were categorized into Bohrbugs
(BOHs) and Mandelbugs (MANs), taking into account the conditions of fault activation
and error propagation. Moreover, within the MAN category, bugs were further classified as
either non-aging related Mandelbugs (NAMs) or aging-related bugs (ARBs).
However, the bug classification process in the aforementioned studies was all performed
manually. As the number of bug reports in deep learning frameworks continues to increase,
manually classifying all bug reports becomes impractical. Therefore, the development
of bug report classification methods becomes essential. In (Xia etal., 2014), the authors
employed the bag-of-words model to represent bug reports and utilized machine learn-
ing classifiers to classify them. However, the bag-of-words model neglects the contextual
semantic information present in bug reports, resulting in inadequate classification results.
To address this limitation and effectively utilize the semantic information embed-
ded within bug reports, we proposed the DeepSIM method in Du et al. (2021). Deep-
SIM employed a word2vec semantic model that was trained based on over two million
bug reports. However, the effectiveness of DeepSIM is hindered by the constrained size
of the training corpus utilized for the semantic model. To address the aforementioned
issues, we propose a Large Language Model-based Bug Report Classification framework
(LLM–BRC) for deep learning frameworks. Large language models (LLMs), particularly
GPT-3 and GPT-4 (Brown etal., 2020; Radford etal., 2018, 2019) have proven transforma-
tive in numerous fields and have made remarkable contributions in domains ranging from
mathematics (Frieder etal., 2023) and communication (Guo etal., 2023) to even medicine
(Nov etal., 2023). In particular, the prowess of LLMs lies in their ability to revolutionize