Semantic Similarity Metrics for Evaluating Source Code Summarization

Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, D. Parkes
{"title":"Semantic Similarity Metrics for Evaluating Source Code Summarization","authors":"Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, D. Parkes","doi":"10.1145/NNNNNNN.NNNNNNN","DOIUrl":null,"url":null,"abstract":"Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"77","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/NNNNNNN.NNNNNNN","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 77

Abstract

Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估源代码摘要的语义相似度度量
源代码摘要包括用自然语言创建源代码的简要描述。这些描述是软件文档(如JavaDocs)的关键组件。自动代码摘要是软件工程研究的一个重要目标,因为它对程序员来说具有很高的价值,同时手工编写和维护文档的成本也很高。目前的工作几乎都是基于通过大数据输入训练的机器模型。代码示例和代码摘要的大型数据集用于训练例如编码器-解码器神经模型。然后根据一组参考摘要评估模型的输出预测。输入是模型看不到的代码,预测将与参考进行比较。将预测与参考进行比较的方法本质上是单词重叠,通过BLEU或ROUGE等度量来计算。使用单词重叠的问题是,并不是一个句子中所有的单词都具有相同的重要性,而且许多单词都有同义词。结果是计算出的相似度可能与人类读者感知到的相似度不匹配。在本文中,我们进行了一项实验来衡量各种单词重叠度量与预测摘要和参考摘要的人类评级相似度的关联程度。我们基于语义相似度度量的当前工作评估替代方案,并提出评估源代码摘要的建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Context-based Cluster Fault Localization Fine-Grained Code-Comment Semantic Interaction Analysis Find Bugs in Static Bug Finders Self-Supervised Learning of Smart Contract Representations An Exploratory Study of Analyzing JavaScript Online Code Clones
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1