Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE) Pub Date : 2020-09-01 DOI:10.1145/3324884.3416546

David Gros, Hariharan Sezhiyan, Prem Devanbu, Zhou Yu

{"title":"Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation","authors":"David Gros, Hariharan Sezhiyan, Prem Devanbu, Zhou Yu","doi":"10.1145/3324884.3416546","DOIUrl":null,"url":null,"abstract":"The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep-learning methods to this task-specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CODENN, DEEPCOM, FUNCOM, and Docstring. We compare them with WMT19, a standard dataset frequently used to train state-of-the-art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using “affinity pairs” of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3324884.3416546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 45

Abstract

The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep-learning methods to this task-specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CODENN, DEEPCOM, FUNCOM, and Docstring. We compare them with WMT19, a standard dataset frequently used to train state-of-the-art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using “affinity pairs” of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

注释“翻译”的代码:数据、度量、基线和评估

注释与代码的关系，特别是生成给定代码的有用注释的任务，一直是人们感兴趣的问题。最早的方法是基于注释结构的强大语法理论，并依赖于文本模板。最近，研究人员将深度学习方法应用于这项任务-特别是可训练的生成翻译模型，该模型已知在自然语言翻译(例如，从德语到英语)中工作得非常好。我们仔细检查了这里的基本假设:生成注释的任务与在自然语言之间进行翻译的任务非常相似，因此可以使用类似的模型和评估指标。我们为这个任务分析了几个最近的代码注释数据集:CODENN、DEEPCOM、FUNCOM和Docstring。我们将它们与WMT19进行比较，WMT19是一个经常用于训练最先进的自然语言翻译的标准数据集。我们在代码注释数据和WMT19自然语言数据之间发现了一些有趣的差异。接下来，我们描述并进行一些研究来校准BLEU(通常用于衡量评论质量)。使用“亲和对”的方法，来自不同项目、同一项目、同一类等;我们的研究表明，目前在一些数据集上的性能可能需要大幅提高。我们还认为，相当朴素的信息检索(IR)方法在这项任务中做得足够好，可以被认为是一个合理的基线。最后，我们对如何将我们的发现应用于该领域的未来研究提出了一些建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

自引率

0.00%

发文量

期刊最新文献

Towards Generating Thread-Safe Classes Automatically Anti-patterns for Java Automated Program Repair Tools Automating Just-In-Time Comment Updating Synthesizing Smart Solving Strategy for Symbolic Execution Identifying and Describing Information Seeking Tasks