重新思考源代码对测试用例生成的影响

Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui
{"title":"重新思考源代码对测试用例生成的影响","authors":"Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui","doi":"arxiv-2409.09464","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have been widely applied to assist test\ngeneration with the source code under test provided as the context. This paper\naims to answer the question: If the source code under test is incorrect, will\nLLMs be misguided when generating tests? The effectiveness of test cases is\nmeasured by their accuracy, coverage, and bug detection effectiveness. Our\nevaluation results with five open- and six closed-source LLMs on four datasets\ndemonstrate that incorrect code can significantly mislead LLMs in generating\ncorrect, high-coverage, and bug-revealing tests. For instance, in the HumanEval\ndataset, LLMs achieve 80.45% test accuracy when provided with task descriptions\nand correct code, but only 57.12% when given task descriptions and incorrect\ncode. For the APPS dataset, prompts with correct code yield tests that detect\n39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These\nfindings have important implications for the deployment of LLM-based testing:\nusing it on mature code may help protect against future regression, but on\nearly-stage immature code, it may simply bake in errors. Our findings also\nunderscore the need for further research to improve LLMs resilience against\nincorrect code in generating reliable and bug-revealing tests.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Rethinking the Influence of Source Code on Test Case Generation\",\"authors\":\"Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui\",\"doi\":\"arxiv-2409.09464\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language models (LLMs) have been widely applied to assist test\\ngeneration with the source code under test provided as the context. This paper\\naims to answer the question: If the source code under test is incorrect, will\\nLLMs be misguided when generating tests? The effectiveness of test cases is\\nmeasured by their accuracy, coverage, and bug detection effectiveness. Our\\nevaluation results with five open- and six closed-source LLMs on four datasets\\ndemonstrate that incorrect code can significantly mislead LLMs in generating\\ncorrect, high-coverage, and bug-revealing tests. For instance, in the HumanEval\\ndataset, LLMs achieve 80.45% test accuracy when provided with task descriptions\\nand correct code, but only 57.12% when given task descriptions and incorrect\\ncode. For the APPS dataset, prompts with correct code yield tests that detect\\n39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These\\nfindings have important implications for the deployment of LLM-based testing:\\nusing it on mature code may help protect against future regression, but on\\nearly-stage immature code, it may simply bake in errors. Our findings also\\nunderscore the need for further research to improve LLMs resilience against\\nincorrect code in generating reliable and bug-revealing tests.\",\"PeriodicalId\":501278,\"journal\":{\"name\":\"arXiv - CS - Software Engineering\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Software Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09464\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09464","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

大语言模型(LLM)已被广泛应用于以被测源代码为上下文的辅助测试生成。本文旨在回答这个问题:如果被测源代码不正确,LLM 在生成测试时是否会被误导?测试用例的有效性通过其准确性、覆盖率和错误检测有效性来衡量。在四个数据集上使用五种开放源代码和六种封闭源代码的 LLM 的评估结果表明,不正确的代码会严重误导 LLM 生成正确、高覆盖率和能揭示错误的测试。例如,在 HumanEval 数据集中,当提供任务描述和正确代码时,LLM 的测试准确率为 80.45%,而当提供任务描述和错误代码时,LLM 的测试准确率仅为 57.12%。在 APPS 数据集中,提示正确代码的测试能检测出 39.85% 的错误,而提示错误代码的测试只能检测出 19.61%。这些发现对部署基于 LLM 的测试具有重要意义:在成熟代码上使用 LLM 可能有助于防止未来的回归,但在早期阶段的不成熟代码上使用 LLM 可能会简单地造成错误。我们的发现还强调了进一步研究的必要性,以提高 LLM 在生成可靠且能揭示错误的测试时对错误代码的抵御能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Rethinking the Influence of Source Code on Test Case Generation
Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization Shannon Entropy is better Feature than Category and Sentiment in User Feedback Processing Motivations, Challenges, Best Practices, and Benefits for Bots and Conversational Agents in Software Engineering: A Multivocal Literature Review A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems Investigating team maturity in an agile automotive reorganization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1