Context-aware Retrieval-based Deep Commit Message Generation

ACM Transactions on Software Engineering and Methodology (TOSEM) Pub Date : 2021-07-01 DOI:10.1145/3464689

Haoye Wang, Xin Xia, D. Lo, Qiang He, Xinyu Wang, J. Grundy

{"title":"Context-aware Retrieval-based Deep Commit Message Generation","authors":"Haoye Wang, Xin Xia, D. Lo, Qiang He, Xinyu Wang, J. Grundy","doi":"10.1145/3464689","DOIUrl":null,"url":null,"abstract":"Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit diffs. Recent studies make use of neural machine translation algorithms to try and translate git diffs into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase. In this article, we propose CoRec to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a diff for testing, the trained model is reused to retrieve the most similar diff from the training set. Finally, we use the retrieval diff to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate CoRec on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that CoRec significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"106 1","pages":"1 - 30"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology (TOSEM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3464689","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit diffs. Recent studies make use of neural machine translation algorithms to try and translate git diffs into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase. In this article, we propose CoRec to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a diff for testing, the trained model is reused to retrieve the most similar diff from the training set. Finally, we use the retrieval diff to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate CoRec on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that CoRec significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于上下文感知检索的深度提交消息生成

版本控制系统中记录的提交消息包含对软件开发、维护和理解有价值的信息。不幸的是，开发人员经常使用空的或低质量的提交消息提交代码。为了解决这个问题，一些研究提出了从提交差异生成提交消息的方法。最近的研究利用神经机器翻译算法尝试将git差异翻译成提交消息，并取得了一些有希望的结果。然而，这些基于学习的方法倾向于生成高频词而忽略低频词。此外，它们还存在暴露偏差问题，这导致了训练阶段和测试阶段之间的差距。在本文中，我们提出CoRec来解决上述两个限制。具体来说，我们首先训练一个上下文感知的编码器-解码器模型，该模型随机选择解码器的先前输出或基础真值词的嵌入向量作为上下文，使模型逐渐意识到先前的对齐选择。给定一个用于测试的diff，将重用训练好的模型以从训练集中检索最相似的diff。最后，我们使用检索难度来指导最终生成词汇的概率分布。我们的方法结合了信息检索和神经机器翻译的优点。我们在Liu等人的数据集和从Github上的10K流行Java存储库抓取的大型数据集上评估CoRec。我们的实验结果表明，CoRec在BLEU方面显著优于最先进的NNGen方法，平均高出19%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Software Engineering and Methodology (TOSEM)

自引率

0.00%

发文量

期刊最新文献

Turnover of Companies in OpenStack: Prevalence and Rationale Super-optimization of Smart Contracts Verification of Programs Sensitive to Heap Layout Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones via Deep Learning Guaranteeing Timed Opacity using Parametric Timed Model Checking