Semantic similarity loss for neural source code summarization

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Journal of Software-Evolution and Process Pub Date : 2024-07-07 DOI:10.1002/smr.2706
Chia-Yi Su, Collin McMillan
{"title":"Semantic similarity loss for neural source code summarization","authors":"Chia-Yi Su,&nbsp;Collin McMillan","doi":"10.1002/smr.2706","DOIUrl":null,"url":null,"abstract":"<p>This paper presents a procedure for and evaluation of using a semantic similarity metric as a loss function for neural source code summarization. Code summarization is the task of writing natural language descriptions of source code. Neural code summarization refers to automated techniques for generating these descriptions using neural networks. Almost all current approaches involve neural networks as either standalone models or as part of a pretrained large language models, for example, GPT, Codex, and LLaMA. Yet almost all also use a categorical cross-entropy (CCE) loss function for network optimization. Two problems with CCE are that (1) it computes loss over each word prediction one-at-a-time, rather than evaluating a whole sentence, and (2) it requires a perfect prediction, leaving no room for partial credit for synonyms. In this paper, we extend our previous work on semantic similarity metrics to show a procedure for using semantic similarity as a loss function to alleviate this problem, and we evaluate this procedure in several settings in both metrics-driven and human studies. In essence, we propose to use a semantic similarity metric to calculate loss over the whole output sentence prediction per training batch, rather than just loss for each word. We also propose to combine our loss with CCE for each word, which streamlines the training process compared to baselines. We evaluate our approach over several baselines and report improvement in the vast majority of conditions.</p>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"36 11","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.2706","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a procedure for and evaluation of using a semantic similarity metric as a loss function for neural source code summarization. Code summarization is the task of writing natural language descriptions of source code. Neural code summarization refers to automated techniques for generating these descriptions using neural networks. Almost all current approaches involve neural networks as either standalone models or as part of a pretrained large language models, for example, GPT, Codex, and LLaMA. Yet almost all also use a categorical cross-entropy (CCE) loss function for network optimization. Two problems with CCE are that (1) it computes loss over each word prediction one-at-a-time, rather than evaluating a whole sentence, and (2) it requires a perfect prediction, leaving no room for partial credit for synonyms. In this paper, we extend our previous work on semantic similarity metrics to show a procedure for using semantic similarity as a loss function to alleviate this problem, and we evaluate this procedure in several settings in both metrics-driven and human studies. In essence, we propose to use a semantic similarity metric to calculate loss over the whole output sentence prediction per training batch, rather than just loss for each word. We also propose to combine our loss with CCE for each word, which streamlines the training process compared to baselines. We evaluate our approach over several baselines and report improvement in the vast majority of conditions.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
神经源代码摘要的语义相似性损失
本文介绍了使用语义相似度量作为神经源代码摘要损失函数的程序和评估。代码总结是编写源代码自然语言描述的任务。神经代码摘要是指使用神经网络生成这些描述的自动化技术。目前几乎所有的方法都将神经网络作为独立模型或预训练的大型语言模型的一部分,例如 GPT、Codex 和 LLaMA。然而,几乎所有的方法都使用分类交叉熵(CCE)损失函数进行网络优化。CCE 的两个问题是:(1) 它对每个单词的预测逐一计算损失,而不是对整个句子进行评估;(2) 它要求完美的预测,没有为同义词的部分损失留有余地。在本文中,我们扩展了之前在语义相似性度量方面的工作,展示了一种使用语义相似性作为损失函数来缓解这一问题的程序,并在度量驱动和人类研究的多个环境中对这一程序进行了评估。从本质上讲,我们建议使用语义相似性度量来计算每个训练批次中整个输出句子预测的损失,而不仅仅是每个单词的损失。我们还建议将我们的损失与每个单词的 CCE 结合起来,这样就能比基准方法简化训练过程。我们对我们的方法与几种基线方法进行了评估,结果表明我们的方法在绝大多数情况下都有所改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Software-Evolution and Process
Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-
自引率
10.00%
发文量
109
期刊最新文献
Issue Information Issue Information A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB Issue Information LLMs for science: Usage for code generation and data analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1