基于测试和基于度量的实际问题回答代码生成模型的评估

2023 International Conference on Code Quality (ICCQ) Pub Date : 2023-04-22 DOI:10.1109/ICCQ57276.2023.10114665

Sergey Kovalchuk, Dmitriy Fedrushkov, Vadim Lomshakov, Artem Aliev

{"title":"基于测试和基于度量的实际问题回答代码生成模型的评估","authors":"Sergey Kovalchuk, Dmitriy Fedrushkov, Vadim Lomshakov, Artem Aliev","doi":"10.1109/ICCQ57276.2023.10114665","DOIUrl":null,"url":null,"abstract":"We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don't pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.","PeriodicalId":318687,"journal":{"name":"2023 International Conference on Code Quality (ICCQ)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Test-based and metric-based evaluation of code generation models for practical question answering\",\"authors\":\"Sergey Kovalchuk, Dmitriy Fedrushkov, Vadim Lomshakov, Artem Aliev\",\"doi\":\"10.1109/ICCQ57276.2023.10114665\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don't pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.\",\"PeriodicalId\":318687,\"journal\":{\"name\":\"2023 International Conference on Code Quality (ICCQ)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Code Quality (ICCQ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCQ57276.2023.10114665\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Code Quality (ICCQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCQ57276.2023.10114665","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们对代码生成模型的性能进行了比较分析，并与基于测试的评估相比，使用了通用的NLP度量。调查是在用代码回答问题(从测试到代码的问题)的环境中进行的，目的是以全自动的方式检查生成代码评估的两种方法的适用性。我们使用CodeGen和GPTNeo预训练模型应用于使用基于堆栈溢出的语料库(APIzation)的问题回答。对于基于测试的评估，工业测试生成解决方案(Machinet, UTBot)用于提供自动生成的测试。分析表明，仅基于NLP度量或测试的性能评估对生成的代码质量提供了相当有限的评估。我们看到有证据表明，具有高和低NLP指标的预测存在通过和不通过测试的情况。根据我们在本文中讨论的经验研究的早期结果，我们相信两种方法的结合可能会增加构建、评估和训练代码生成模型的可能方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Test-based and metric-based evaluation of code generation models for practical question answering

We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don't pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 International Conference on Code Quality (ICCQ)

自引率

0.00%

发文量

期刊最新文献

Test-based and metric-based evaluation of code generation models for practical question answering What IS Code Quality: Keynote Mutant Selection Strategies in Mutation Testing Understanding Software Performance Challenges an Empirical Study on Stack Overflow Applying Machine Learning Analysis for Software Quality Test