重新思考源代码对测试用例生成的影响

arXiv - CS - Software Engineering Pub Date : 2024-09-14 DOI:arxiv-2409.09464

Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui

{"title":"重新思考源代码对测试用例生成的影响","authors":"Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui","doi":"arxiv-2409.09464","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have been widely applied to assist test\ngeneration with the source code under test provided as the context. This paper\naims to answer the question: If the source code under test is incorrect, will\nLLMs be misguided when generating tests? The effectiveness of test cases is\nmeasured by their accuracy, coverage, and bug detection effectiveness. Our\nevaluation results with five open- and six closed-source LLMs on four datasets\ndemonstrate that incorrect code can significantly mislead LLMs in generating\ncorrect, high-coverage, and bug-revealing tests. For instance, in the HumanEval\ndataset, LLMs achieve 80.45% test accuracy when provided with task descriptions\nand correct code, but only 57.12% when given task descriptions and incorrect\ncode. For the APPS dataset, prompts with correct code yield tests that detect\n39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These\nfindings have important implications for the deployment of LLM-based testing:\nusing it on mature code may help protect against future regression, but on\nearly-stage immature code, it may simply bake in errors. Our findings also\nunderscore the need for further research to improve LLMs resilience against\nincorrect code in generating reliable and bug-revealing tests.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Rethinking the Influence of Source Code on Test Case Generation\",\"authors\":\"Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui\",\"doi\":\"arxiv-2409.09464\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language models (LLMs) have been widely applied to assist test\\ngeneration with the source code under test provided as the context. This paper\\naims to answer the question: If the source code under test is incorrect, will\\nLLMs be misguided when generating tests? The effectiveness of test cases is\\nmeasured by their accuracy, coverage, and bug detection effectiveness. Our\\nevaluation results with five open- and six closed-source LLMs on four datasets\\ndemonstrate that incorrect code can significantly mislead LLMs in generating\\ncorrect, high-coverage, and bug-revealing tests. For instance, in the HumanEval\\ndataset, LLMs achieve 80.45% test accuracy when provided with task descriptions\\nand correct code, but only 57.12% when given task descriptions and incorrect\\ncode. For the APPS dataset, prompts with correct code yield tests that detect\\n39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These\\nfindings have important implications for the deployment of LLM-based testing:\\nusing it on mature code may help protect against future regression, but on\\nearly-stage immature code, it may simply bake in errors. Our findings also\\nunderscore the need for further research to improve LLMs resilience against\\nincorrect code in generating reliable and bug-revealing tests.\",\"PeriodicalId\":501278,\"journal\":{\"name\":\"arXiv - CS - Software Engineering\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Software Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09464\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09464","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大语言模型（LLM）已被广泛应用于以被测源代码为上下文的辅助测试生成。本文旨在回答这个问题：如果被测源代码不正确，LLM 在生成测试时是否会被误导？测试用例的有效性通过其准确性、覆盖率和错误检测有效性来衡量。在四个数据集上使用五种开放源代码和六种封闭源代码的 LLM 的评估结果表明，不正确的代码会严重误导 LLM 生成正确、高覆盖率和能揭示错误的测试。例如，在 HumanEval 数据集中，当提供任务描述和正确代码时，LLM 的测试准确率为 80.45%，而当提供任务描述和错误代码时，LLM 的测试准确率仅为 57.12%。在 APPS 数据集中，提示正确代码的测试能检测出 39.85% 的错误，而提示错误代码的测试只能检测出 19.61%。这些发现对部署基于 LLM 的测试具有重要意义：在成熟代码上使用 LLM 可能有助于防止未来的回归，但在早期阶段的不成熟代码上使用 LLM 可能会简单地造成错误。我们的发现还强调了进一步研究的必要性，以提高 LLM 在生成可靠且能揭示错误的测试时对错误代码的抵御能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Rethinking the Influence of Source Code on Test Case Generation

Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Software Engineering

自引率

0.00%

发文量