Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Transactions on Software Engineering Pub Date : 2024-09-04 DOI:10.1109/TSE.2024.3450837
Sungmin Kang;Juyeon Yoon;Nargiz Askarbekkyzy;Shin Yoo
{"title":"Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction","authors":"Sungmin Kang;Juyeon Yoon;Nargiz Askarbekkyzy;Shin Yoo","doi":"10.1109/TSE.2024.3450837","DOIUrl":null,"url":null,"abstract":"Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique \n<sc>Libro</small>\n could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM's training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using \n<sc>Libro</small>\n improves as LLM size increases, providing information as to which LLMs can be used with the \n<sc>Libro</small>\n pipeline.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 10","pages":"2677-2694"},"PeriodicalIF":6.5000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10664637/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly focused on crash bugs, which are easier to automatically detect and verify. In this work, we overcome this limitation by using large language models (LLMs), which have been demonstrated to be adept at natural language processing and code generation. By prompting LLMs to generate bug-reproducing tests, and via a post-processing pipeline to automatically identify promising generated tests, our proposed technique Libro could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark. Furthermore, our extensive evaluation on 15 LLMs, including 11 open-source LLMs, suggests that open-source LLMs also demonstrate substantial potential, with the StarCoder LLM achieving 70% of the reproduction performance of the closed-source OpenAI LLM code-davinci-002 on the large Defects4J benchmark, and 90% of performance on a held-out bug dataset likely not part of any LLM's training data. In addition, our experiments on LLMs of different sizes show that bug reproduction using Libro improves as LLM size increases, providing information as to which LLMs can be used with the Libro pipeline.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估用于自动和通用错误复制的各种大型语言模型
Bug 重现是一项关键的开发人员活动,也是一项具有挑战性的自动化活动,因为 Bug 报告通常使用自然语言,因此很难一致地转换为测试用例。因此,现有技术大多集中在崩溃错误上,而崩溃错误更容易自动检测和验证。在这项工作中,我们通过使用大型语言模型(LLM)克服了这一局限性,大型语言模型已被证明擅长自然语言处理和代码生成。通过提示 LLM 生成会产生错误的测试,并通过后处理管道自动识别有希望生成的测试,我们提出的 Libro 技术可以成功地重现广泛使用的 Defects4J 基准中约三分之一的错误。此外,我们对包括 11 个开源 LLM 在内的 15 个 LLM 进行了广泛评估,结果表明开源 LLM 也展现出了巨大的潜力,其中 StarCoder LLM 在大型 Defects4J 基准上的重现性能达到了闭源 OpenAI LLM code-davinci-002 的 70%,在可能不属于任何 LLM 训练数据的保留错误数据集上的重现性能达到了 90%。此外,我们在不同规模的 LLM 上进行的实验表明,随着 LLM 规模的增加,使用 Libro 的错误再现能力也会提高,这为哪些 LLM 可以与 Libro 管道一起使用提供了信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Software Engineering
IEEE Transactions on Software Engineering 工程技术-工程:电子与电气
CiteScore
9.70
自引率
10.80%
发文量
724
审稿时长
6 months
期刊介绍: IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.
期刊最新文献
Triple Peak Day: Work Rhythms of Software Developers in Hybrid Work GenProgJS: a Baseline System for Test-based Automated Repair of JavaScript Programs On Inter-dataset Code Duplication and Data Leakage in Large Language Models Line-Level Defect Prediction by Capturing Code Contexts with Graph Convolutional Networks Does Treatment Adherence Impact Experiment Results in TDD?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1