{"title":"为人工智能生成的代码创建全面的测试很难","authors":"Shreya Singhal, Viraj Kumar","doi":"10.1145/3627217.3627238","DOIUrl":null,"url":null,"abstract":"Before implementing a function, programmers are encouraged to write a suite of test cases that specify its intended behaviour on several inputs. A suite of tests is thorough if any buggy implementation fails at least one of these tests. We posit that as the proportion of code generated by Large Language Models (LLMs) grows, so must the ability of students to create test suites that are thorough enough to detect subtle bugs in such code. Our paper makes two contributions. First, we demonstrate how difficult it can be to create thorough tests for LLM-generated code by evaluating 27 test suites from a public dataset (EvalPlus). Second, by identifying deficiencies in these test suites, we propose strategies for improving the ability of students to develop thorough test suites for LLM-generated code.","PeriodicalId":508655,"journal":{"name":"Proceedings of the 16th Annual ACM India Compute Conference","volume":"2 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Creating Thorough Tests for AI-Generated Code is Hard\",\"authors\":\"Shreya Singhal, Viraj Kumar\",\"doi\":\"10.1145/3627217.3627238\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Before implementing a function, programmers are encouraged to write a suite of test cases that specify its intended behaviour on several inputs. A suite of tests is thorough if any buggy implementation fails at least one of these tests. We posit that as the proportion of code generated by Large Language Models (LLMs) grows, so must the ability of students to create test suites that are thorough enough to detect subtle bugs in such code. Our paper makes two contributions. First, we demonstrate how difficult it can be to create thorough tests for LLM-generated code by evaluating 27 test suites from a public dataset (EvalPlus). Second, by identifying deficiencies in these test suites, we propose strategies for improving the ability of students to develop thorough test suites for LLM-generated code.\",\"PeriodicalId\":508655,\"journal\":{\"name\":\"Proceedings of the 16th Annual ACM India Compute Conference\",\"volume\":\"2 4\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th Annual ACM India Compute Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3627217.3627238\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th Annual ACM India Compute Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3627217.3627238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Creating Thorough Tests for AI-Generated Code is Hard
Before implementing a function, programmers are encouraged to write a suite of test cases that specify its intended behaviour on several inputs. A suite of tests is thorough if any buggy implementation fails at least one of these tests. We posit that as the proportion of code generated by Large Language Models (LLMs) grows, so must the ability of students to create test suites that are thorough enough to detect subtle bugs in such code. Our paper makes two contributions. First, we demonstrate how difficult it can be to create thorough tests for LLM-generated code by evaluating 27 test suites from a public dataset (EvalPlus). Second, by identifying deficiencies in these test suites, we propose strategies for improving the ability of students to develop thorough test suites for LLM-generated code.