基于测试和基于度量的实际问题回答代码生成模型的评估

Sergey Kovalchuk, Dmitriy Fedrushkov, Vadim Lomshakov, Artem Aliev
{"title":"基于测试和基于度量的实际问题回答代码生成模型的评估","authors":"Sergey Kovalchuk, Dmitriy Fedrushkov, Vadim Lomshakov, Artem Aliev","doi":"10.1109/ICCQ57276.2023.10114665","DOIUrl":null,"url":null,"abstract":"We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don't pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.","PeriodicalId":318687,"journal":{"name":"2023 International Conference on Code Quality (ICCQ)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Test-based and metric-based evaluation of code generation models for practical question answering\",\"authors\":\"Sergey Kovalchuk, Dmitriy Fedrushkov, Vadim Lomshakov, Artem Aliev\",\"doi\":\"10.1109/ICCQ57276.2023.10114665\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don't pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.\",\"PeriodicalId\":318687,\"journal\":{\"name\":\"2023 International Conference on Code Quality (ICCQ)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Code Quality (ICCQ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCQ57276.2023.10114665\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Code Quality (ICCQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCQ57276.2023.10114665","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们对代码生成模型的性能进行了比较分析,并与基于测试的评估相比,使用了通用的NLP度量。调查是在用代码回答问题(从测试到代码的问题)的环境中进行的,目的是以全自动的方式检查生成代码评估的两种方法的适用性。我们使用CodeGen和GPTNeo预训练模型应用于使用基于堆栈溢出的语料库(APIzation)的问题回答。对于基于测试的评估,工业测试生成解决方案(Machinet, UTBot)用于提供自动生成的测试。分析表明,仅基于NLP度量或测试的性能评估对生成的代码质量提供了相当有限的评估。我们看到有证据表明,具有高和低NLP指标的预测存在通过和不通过测试的情况。根据我们在本文中讨论的经验研究的早期结果,我们相信两种方法的结合可能会增加构建、评估和训练代码生成模型的可能方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Test-based and metric-based evaluation of code generation models for practical question answering
We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don't pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Test-based and metric-based evaluation of code generation models for practical question answering What IS Code Quality: Keynote Mutant Selection Strategies in Mutation Testing Understanding Software Performance Challenges an Empirical Study on Stack Overflow Applying Machine Learning Analysis for Software Quality Test
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1