Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation

David Gros, Hariharan Sezhiyan, Prem Devanbu, Zhou Yu
{"title":"Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation","authors":"David Gros, Hariharan Sezhiyan, Prem Devanbu, Zhou Yu","doi":"10.1145/3324884.3416546","DOIUrl":null,"url":null,"abstract":"The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep-learning methods to this task-specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CODENN, DEEPCOM, FUNCOM, and Docstring. We compare them with WMT19, a standard dataset frequently used to train state-of-the-art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using “affinity pairs” of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3324884.3416546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 45

Abstract

The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep-learning methods to this task-specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CODENN, DEEPCOM, FUNCOM, and Docstring. We compare them with WMT19, a standard dataset frequently used to train state-of-the-art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using “affinity pairs” of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
注释“翻译”的代码:数据、度量、基线和评估
注释与代码的关系,特别是生成给定代码的有用注释的任务,一直是人们感兴趣的问题。最早的方法是基于注释结构的强大语法理论,并依赖于文本模板。最近,研究人员将深度学习方法应用于这项任务-特别是可训练的生成翻译模型,该模型已知在自然语言翻译(例如,从德语到英语)中工作得非常好。我们仔细检查了这里的基本假设:生成注释的任务与在自然语言之间进行翻译的任务非常相似,因此可以使用类似的模型和评估指标。我们为这个任务分析了几个最近的代码注释数据集:CODENN、DEEPCOM、FUNCOM和Docstring。我们将它们与WMT19进行比较,WMT19是一个经常用于训练最先进的自然语言翻译的标准数据集。我们在代码注释数据和WMT19自然语言数据之间发现了一些有趣的差异。接下来,我们描述并进行一些研究来校准BLEU(通常用于衡量评论质量)。使用“亲和对”的方法,来自不同项目、同一项目、同一类等;我们的研究表明,目前在一些数据集上的性能可能需要大幅提高。我们还认为,相当朴素的信息检索(IR)方法在这项任务中做得足够好,可以被认为是一个合理的基线。最后,我们对如何将我们的发现应用于该领域的未来研究提出了一些建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Towards Generating Thread-Safe Classes Automatically Anti-patterns for Java Automated Program Repair Tools Automating Just-In-Time Comment Updating Synthesizing Smart Solving Strategy for Symbolic Execution Identifying and Describing Information Seeking Tasks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1