Semantic Similarity Metrics for Evaluating Source Code Summarization

2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC) Pub Date : 2017-05-10 DOI:10.1145/NNNNNNN.NNNNNNN

Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, D. Parkes

{"title":"Semantic Similarity Metrics for Evaluating Source Code Summarization","authors":"Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, D. Parkes","doi":"10.1145/NNNNNNN.NNNNNNN","DOIUrl":null,"url":null,"abstract":"Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"77","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/NNNNNNN.NNNNNNN","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 77

Abstract

Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估源代码摘要的语义相似度度量

源代码摘要包括用自然语言创建源代码的简要描述。这些描述是软件文档(如JavaDocs)的关键组件。自动代码摘要是软件工程研究的一个重要目标，因为它对程序员来说具有很高的价值，同时手工编写和维护文档的成本也很高。目前的工作几乎都是基于通过大数据输入训练的机器模型。代码示例和代码摘要的大型数据集用于训练例如编码器-解码器神经模型。然后根据一组参考摘要评估模型的输出预测。输入是模型看不到的代码，预测将与参考进行比较。将预测与参考进行比较的方法本质上是单词重叠，通过BLEU或ROUGE等度量来计算。使用单词重叠的问题是，并不是一个句子中所有的单词都具有相同的重要性，而且许多单词都有同义词。结果是计算出的相似度可能与人类读者感知到的相似度不匹配。在本文中，我们进行了一项实验来衡量各种单词重叠度量与预测摘要和参考摘要的人类评级相似度的关联程度。我们基于语义相似度度量的当前工作评估替代方案，并提出评估源代码摘要的建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)

自引率

0.00%

发文量

期刊最新文献

Context-based Cluster Fault Localization Fine-Grained Code-Comment Semantic Interaction Analysis Find Bugs in Static Bug Finders Self-Supervised Learning of Smart Contract Representations An Exploratory Study of Analyzing JavaScript Online Code Clones