Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui
{"title":"Rethinking the Influence of Source Code on Test Case Generation","authors":"Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui","doi":"arxiv-2409.09464","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have been widely applied to assist test\ngeneration with the source code under test provided as the context. This paper\naims to answer the question: If the source code under test is incorrect, will\nLLMs be misguided when generating tests? The effectiveness of test cases is\nmeasured by their accuracy, coverage, and bug detection effectiveness. Our\nevaluation results with five open- and six closed-source LLMs on four datasets\ndemonstrate that incorrect code can significantly mislead LLMs in generating\ncorrect, high-coverage, and bug-revealing tests. For instance, in the HumanEval\ndataset, LLMs achieve 80.45% test accuracy when provided with task descriptions\nand correct code, but only 57.12% when given task descriptions and incorrect\ncode. For the APPS dataset, prompts with correct code yield tests that detect\n39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These\nfindings have important implications for the deployment of LLM-based testing:\nusing it on mature code may help protect against future regression, but on\nearly-stage immature code, it may simply bake in errors. Our findings also\nunderscore the need for further research to improve LLMs resilience against\nincorrect code in generating reliable and bug-revealing tests.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09464","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) have been widely applied to assist test
generation with the source code under test provided as the context. This paper
aims to answer the question: If the source code under test is incorrect, will
LLMs be misguided when generating tests? The effectiveness of test cases is
measured by their accuracy, coverage, and bug detection effectiveness. Our
evaluation results with five open- and six closed-source LLMs on four datasets
demonstrate that incorrect code can significantly mislead LLMs in generating
correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval
dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions
and correct code, but only 57.12% when given task descriptions and incorrect
code. For the APPS dataset, prompts with correct code yield tests that detect
39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These
findings have important implications for the deployment of LLM-based testing:
using it on mature code may help protect against future regression, but on
early-stage immature code, it may simply bake in errors. Our findings also
underscore the need for further research to improve LLMs resilience against
incorrect code in generating reliable and bug-revealing tests.