10篇代码生成的论文，包括代码评估、代码搜索、代码生成、survey、代码或bug分类

7.68MB

56 需要积分: 1

立即下载

资源介绍:

题目类型分区摘要精读链接 Comparing large language models and humanprogrammers for generating programming code 代码评估 arxiv 评估七种LLMs在生成编程代码方面的性能，探讨不同提示策略对LLMs编码性能的影响，直接比较LLMs与人类程序员的编程能力，评估LLMs在不同编程语言之间生成和翻译代码的能力，以及考察LLMs的计算效率和从过去错误中学习的能力。 A Comparison of the Effectiveness of ChatGPT andCo-Pilot for Generating Quality Python Code 代码评估会议包括评估ChatGPT和Copilot在解决LeetCode编程问题上的有效性，探讨ChatGPT在接收到反馈后纠正代码的能力，以及其在提高代码质量和性能方面的潜力。 Program Code Generation with Generative AIs 代码评估 MDPI水刊-Algorithms非SCI 比较了人类生成的代码

Vol.:(0123456789)

Software Quality Journal (2024) 32:985–1005

https://doi.org/10.1007/s11219-024-09675-3

1 3

RESEARCH

LLM‑BRC: Alarge language model‑based bug report

classification framework

XiaotingDu

1,2

· ZhihaoLiu

· ChenglongLi

· XiangyueMa

· YingzhuoLi

·

XinyuWang

Accepted: 23 April 2024 / Published online: 24 May 2024

Abstract

Deep learning frameworks serve as the cornerstone for constructing robust deep learning

systems. However, bugs within these frameworks can have severe consequences, nega-

tively affecting various applications. Accurately classifying and understanding these

bugs is essential to ensure framework reliability. By doing so, developers can proactively

take appropriate measures to mitigate potential risks associated with specific bug types

in both current and future software releases. Despite the significance of bug report clas-

sification, existing methods fall short in terms of performance, rendering them impractical

for real-world applications. To address this limitation, we propose a bug report classifi-

cation framework for deep learning frameworks, called LLM–BRC, leveraging OpenAI’s

latest embedding model, text-embedding-ada-002. OurLLM–BRCframework achieves an

impressive accuracy range of 92% to 98.75% in bug report classification for three deep

learning frameworks: TensorFlow, MXNET, and PaddlePaddle. This represents a substan-

tial improvement of 17.21% to 69.15% compared to existing methods. Furthermore, we

conduct a comprehensive investigation into the impact of different bug report components

and different models.

Keywords Bug report classification· Deep learning framework· Large-language model

* Xiaoting Du

duxiaoting@bupt.edu.cn

Chenglong Li

li_chenglong@buaa.edu.cn

School ofComputer Science (National Pilot Software Engineering School), Beijing University

ofPosts andTelecommunications, Beijing, China

Shanghai Key Laboratory ofTrustworthy Computing (East China Normal University),

Shanghai200062, China

School ofAutomation Science andElectrical Engineering, Beihang University, Beijing, China

986

Software Quality Journal (2024) 32:985–1005

1 3

1 Introduction

Deep learning frameworks play a crucial role in building robust deep learning systems

(Zhang etal., 2020). With the rapid advancement of deep learning technology, the demand

for deep learning frameworks has experienced exponential growth (Guo etal., 2018). This

expansion encompasses the incorporation of new interfaces, the enhancement of function-

alities, and the optimization of compatibility with a wide array of hardware devices and

underlying drivers. Throughout this evolutionary process, the continuous iteration of code

and version updates inevitably introduces bugs into deep learning frameworks (Zhang etal.,

2018). Bugs in deep learning frameworks can have a significant and wide-reaching impact

on a larger user base compared to specific deep learning models. Particularly in safety- and

security-critical domains like autonomous driving (Chen et al., 2015) and healthcare (Cai

etal., 2014), the consequences of these bugs can be more severe. Therefore, ensuring the

reliability of deep learning frameworks is of utmost importance.

Numerous studies have been conducted to gain insights into the characteristics of bugs

in deep learning frameworks and provide assistance in their resolution. For instance, Jia

et al. (2021) conducted an analysis of bugs in TensorFlow based on 202 bug fixes. The

findings revealed that bugs in TensorFlow can be classified into 6 distinct categories based

on symptoms and 11 distinct categories based on root causes. In (Islam etal., 2019), Islam

etal. examined five deep learning libraries, namely Caffe (Jia etal., 2014), Keras (Lux &

Bertini, 2019), TensorFlow (Girija,2016), Theano (Team etal.,2016) and Torch (Collobert

et al., 2002). They analyzed 2,716 posts from Stack Overflow and 500 bug fix commits

from GitHub to identify commonly occurring bug types in deep learning frameworks.

According to the classification results, there are five different bug types, including API

bugs, Coding bugs, Data bugs, Structural bugs, and Non model structural bugs. In (Du

et al., 2022), we conducted a classification of bug reports in TensorFlow, MXNET, and

PaddlePaddle based on fault-triggering conditions. Bugs were categorized into Bohrbugs

(BOHs) and Mandelbugs (MANs), taking into account the conditions of fault activation

and error propagation. Moreover, within the MAN category, bugs were further classified as

either non-aging related Mandelbugs (NAMs) or aging-related bugs (ARBs).

However, the bug classification process in the aforementioned studies was all performed

manually. As the number of bug reports in deep learning frameworks continues to increase,

manually classifying all bug reports becomes impractical. Therefore, the development

of bug report classification methods becomes essential. In (Xia etal., 2014), the authors

employed the bag-of-words model to represent bug reports and utilized machine learn-

ing classifiers to classify them. However, the bag-of-words model neglects the contextual

semantic information present in bug reports, resulting in inadequate classification results.

To address this limitation and effectively utilize the semantic information embed-

ded within bug reports, we proposed the DeepSIM method in Du et al. (2021). Deep-

SIM employed a word2vec semantic model that was trained based on over two million

bug reports. However, the effectiveness of DeepSIM is hindered by the constrained size

of the training corpus utilized for the semantic model. To address the aforementioned

issues, we propose a Large Language Model-based Bug Report Classification framework

(LLM–BRC) for deep learning frameworks. Large language models (LLMs), particularly

GPT-3 and GPT-4 (Brown etal., 2020; Radford etal., 2018, 2019) have proven transforma-

tive in numerous fields and have made remarkable contributions in domains ranging from

mathematics (Frieder etal., 2023) and communication (Guo etal., 2023) to even medicine

(Nov etal., 2023). In particular, the prowess of LLMs lies in their ability to revolutionize

987

Software Quality Journal (2024) 32:985–1005

1 3

text processing across diverse tasks, substantially propelling the fields of natural language

understanding and generation to new heights (Ray, 2023). One of the core strengths of

LLMs is their mastery of language representation through dense vector embeddings. By

capturing intricate semantic meaning and contextual information, these embeddings allow

for a more nuanced understanding of language and context-aware language processing.

In our framework, we leverage the text-embedding-ada-002 model, which is the second-

generation embedding model announced by OpenAI on December 15, 2022, to represent

bug reports and facilitate bug report classification. Based on this model, bug reports can

be transformed into embeddings of a dimension size of 1,536. These embedding vectors

are then fed into a feed-forward neural network (FFN) for bug report classification. Unlike

traditional machine learning classifiers, FFN excels at capturing intricate patterns and

dependencies within the data, enabling it to learn highly representative and discriminative

features. This allows for enhanced accuracy of bug report classification and the ability to

handle high-dimensional input data efficiently. Finally, the effectiveness ofLLM–BRCis

evaluated on bug reports from three deep learning frameworks.

In summary, this article makes the following main contributions.

1. We present LLM–BRC, a Large Language Model-based Bug Report Classification

framework that combines a large language model with a deep learning classifier.

Through this method, we achieved accurate classification of bugs in deep learning

frameworks, with an accuracy ranging from 92% to 98.75%.

2. We explore the factors influencings classification results, including information from

different components of bug reports and types of language models, to further promote

the practical application of this method.

3. In order to facilitate bug report classification research, we have open-sourced both the

data and the method, which can be accessed at the following webpage: https:// sites.

google. com/ view/ llmbp/.

The rest of the paper is organized as follows. Section II presents the proposed approach.

Section III provides an overview of the experimental setup. Section IV describes the evalu-

ation and analysis of the results. In section V, we discuss the threats to validity. Section VI

presents the related work. Finally, the last section concludes the paper.

2 Our approach

In this section, we propose a bug report classification framework called LLM–BRC.

The overall procedure of LLM–BRC is depicted in Fig. 1. As shown in the figure,

LLM–BRCcomprises three sequential steps: data preparation, LLM-based bug report rep-

resentation, and bug report classification. In the data preparation phase, we start by extract-

ing information from bug reports in deep learning frameworks’ GitHub repositories, using

a custom-designed web crawl tool. Next, the preprocessed bug reports are fed into the

OpenAI’s text-embedding-ada-002 model, which transforms the natural language text into

dense embedding vector representations. These embeddings capture the semantic meaning

and contextual information present in the bug reports. Finally, a FFN is constructed and

trained using labeled bug reports. The FFN utilizes the learned embeddings to perform the

bug report classification task. In the subsequent parts of this section, we provide a detailed

explanation of each step of LLM–BRC.

988

Software Quality Journal (2024) 32:985–1005

1 3

2.1 Data preparation

We initiate the data preparation process by crawling bug reports based on their Bug-ID

from the GitHub repositories of TensorFlow, MXNET, and PaddlePaddle. This crawl

phase considers a total of 3,110 bug reports from these three deep learning frame-

works, which were previously labeled in our previous work (Du etal., 2022). Since text

is the dominant feature contained in bug reports, we collect natural language informa-

tion including title, description, and comments from each bug report. Among them, the

title provides a concise summary of the entire bug report, offering a brief overview of

the entire bug report. The description section contains a detailed account of the issue,

including observed software anomalies, the software runtime environment, reproduction

steps, and other relevant details. Furthermore, the comment section comprises discus-

sions and exchanges among developers, the report submitter, and other interest parties.

These comments provide valuable insights and additional information related to the

reported issue.

2.2 LLM‑based bug report representation

After extracting bug reports, we obtain a corpus of text data. To represent these texts effec-

tively, we utilize a powerful pre-trained large language model called text-embedding-ada-002.

By applying text-embedding-ada-002 to the texts, we obtain dense and low-dimensional

embedding vectors that serve as compact representations of the original bug reports.

Specifically, text-embedding-ada-002 model employs the Transformer architecture

(Ashish etal., 2017) to convert input into a 1,536-dimensional vector. Firstly, each input

bug report is tokenized and segmented into tokens. Next, the tokens pass through 96

decoder layers, each comprising a masked multi-head self-attention mechanism and a feed-

forward neural network. The multi-head self-attention layer computes self-attention on the

input sequential data, generating feature representations for each position in the sequence.

The feed-forward network performs fully connected calculations on the feature vectors at

each position, producing new feature representations. Its crucial role is to provide nonlin-

ear transformations.

Fig. 1 Detailed structure of LLM-BRC

989

Software Quality Journal (2024) 32:985–1005

1 3

The decoder layers start by applying

different linear projections to the Query, Key, and

Value. The resulting attention values for each head i are calculated as follows:

where

, and

represent the query vector, key vector, and value vector, respectively.

The attention mechanism used in the transformer employs scaled dot-product attention,

which can be defined as:

where,

represents the dimension of the query/key vectors.

The resulting attention values from all the heads are concatenated together, resulting in

a single multi-head attention output:

where

is a weight matrix used to combine the multi-head attention outputs.

Additionally, the decoder includes an additional masked multi-head self-attention layer.

This layer prevents the model from seeing future information during sequence prediction.

Hence, the final output of the decoder can be represented as:

where

represents the input sequential data,

refers to the output sequence from the

encoder,

MHA

denotes the multi-head self-attention layer,

FFN

represents the feed forward

layer,

represents the layer normalization layer, and

MHA

signifies the masked multi-

head self-attention layer.

Finally, the output of the attention layer undergoes processing through a feed-forward

neural network. The position-wise feed-forward network is a fully connected feed-forward

neural network where each word at a position passes through the same network indepen-

dently. It essentially consists of two fully connected layers. After passing through all the

decoder layers, the final output is generated by the last decoder layer. This output contains

the contextual information of the bug reports and serves as the ultimate embedding vector

representation for bug reports. This embedding vector will be used for subsequent classifi-

cation tasks.

2.3 Bug report classification

In this section, we conduct the bug report classification task at three levels, as depicted

in Fig.2. At the first level, we classify bug reports into two categories: bugs and non-

bugs. As depicted in Herzig et al. (2013), not all bug reports contain actual bugs.

Therefore, bug reports related to requests for new features or enhancements, documen-

tation issues (e.g., missing information, outdated documentation, or harmless warn-

ing outputs), compile-time issues (e.g., cmake errors or linking errors), operator errors

or duplicate reports are considered non-bugs and should be filtered out. Based on the

complexity of fault activation and/or error propagation conditions, we predict bugs into

Bohrbugs (BOHs) and Mandelbugs (MANs) in the second level (Grottke & Trivedi,

2005). Finally, within the MAN category, we further differentiate between aging-related

(1)

head

= attention(QW

, KW

, VW

)

(2)

Attention(Q, K, V)=softmax(

√

)

(3)

MultiHead

(

concat

(

head

, ...,

head

)

(4)

DecoderLayer

(

MHA(y)

MHA

(

y, x

FFN

(

y))

资源文件列表:

代码生成论文_20241021.zip 大约有28个文件

代码生成论文_20241021/
代码生成论文_20241021/代码或bug分类/
代码生成论文_20241021/.DS_Store 6KB
__MACOSX/代码生成论文_20241021/._.DS_Store 120B
代码生成论文_20241021/代码生成/
代码生成论文_20241021/代码评估/
代码生成论文_20241021/代码搜索/
代码生成论文_20241021/代码模型survey/
代码生成论文_20241021/代码或bug分类/.DS_Store 6KB
__MACOSX/代码生成论文_20241021/代码或bug分类/._.DS_Store 120B
代码生成论文_20241021/代码或bug分类/LLMBRC A large language model-based bug report classification framework.pdf 2.56MB
__MACOSX/代码生成论文_20241021/代码或bug分类/._LLMBRC A large language model-based bug report classification framework.pdf 418B
代码生成论文_20241021/代码评估/.DS_Store 6KB
__MACOSX/代码生成论文_20241021/代码评估/._.DS_Store 120B
代码生成论文_20241021/代码评估/Program Code Generation with Generative AIs.pdf 480.83KB
__MACOSX/代码生成论文_20241021/代码评估/._Program Code Generation with Generative AIs.pdf 425B
代码生成论文_20241021/代码评估/A_Comparison_of_the_Effectiveness_of_ChatGPT_and_Co-Pilot_for_Generating_Quality_Python_Code_Solutions.pdf 352.52KB
__MACOSX/代码生成论文_20241021/代码评估/._A_Comparison_of_the_Effectiveness_of_ChatGPT_and_Co-Pilot_for_Generating_Quality_Python_Code_Solutions.pdf 510B
代码生成论文_20241021/代码评估/Comparing large language models and human programmers for generating programming code.pdf 2.04MB
__MACOSX/代码生成论文_20241021/代码评估/._Comparing large language models and human programmers for generating programming code.pdf 340B
代码生成论文_20241021/代码搜索/.DS_Store 6KB
__MACOSX/代码生成论文_20241021/代码搜索/._.DS_Store 120B
代码生成论文_20241021/代码搜索/Multimodal Representation for Neural Code Search.pdf 1019.4KB
__MACOSX/代码生成论文_20241021/代码搜索/._Multimodal Representation for Neural Code Search.pdf 340B
代码生成论文_20241021/代码模型survey/A Survey on Large Language Models for Code Generation .pdf 2.33MB
__MACOSX/代码生成论文_20241021/代码模型survey/._A Survey on Large Language Models for Code Generation .pdf 340B
代码生成论文_20241021/代码模型survey/.DS_Store 6KB
__MACOSX/代码生成论文_20241021/代码模型survey/._.DS_Store 120B