Esale: Enhancing Code-Summary Alignment Learning for Source Code Summarization

IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Transactions on Software Engineering Pub Date : 2024-07-03 DOI:10.1109/TSE.2024.3422274
Chunrong Fang;Weisong Sun;Yuchen Chen;Xiao Chen;Zhao Wei;Quanjun Zhang;Yudu You;Bin Luo;Yang Liu;Zhenyu Chen
{"title":"Esale: Enhancing Code-Summary Alignment Learning for Source Code Summarization","authors":"Chunrong Fang;Weisong Sun;Yuchen Chen;Xiao Chen;Zhao Wei;Quanjun Zhang;Yudu You;Bin Luo;Yang Liu;Zhenyu Chen","doi":"10.1109/TSE.2024.3422274","DOIUrl":null,"url":null,"abstract":"(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code (e.g., CodeBERT and UniXcoder) are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on the code summarization task. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. In a nutshell, they fail to learn the alignment between code snippets and summaries (code-summary alignment for short). In this paper, we propose a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. In addition, existing work shows that AWP affects the prediction of the entire summary. Therefore, we further introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. We evaluate the effectiveness of our approach, called \n<sc>Esale</small>\n, by conducting extensive experiments on four datasets, including two widely used datasets JCSD and PCSD, a cross-project Java dataset CPJD, and a multilingual language dataset CodeSearchNet. Experimental results show that \n<sc>Esale</small>\n significantly outperforms state-of-the-art baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L. Moreover, the human evaluation proves that the summaries generated by \n<sc>Esale</small>\n are more informative and closer to the ground-truth summaries.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2077-2095"},"PeriodicalIF":6.5000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10584357/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code (e.g., CodeBERT and UniXcoder) are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on the code summarization task. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. In a nutshell, they fail to learn the alignment between code snippets and summaries (code-summary alignment for short). In this paper, we propose a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. In addition, existing work shows that AWP affects the prediction of the entire summary. Therefore, we further introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. We evaluate the effectiveness of our approach, called Esale , by conducting extensive experiments on four datasets, including two widely used datasets JCSD and PCSD, a cross-project Java dataset CPJD, and a multilingual language dataset CodeSearchNet. Experimental results show that Esale significantly outperforms state-of-the-art baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L. Moreover, the human evaluation proves that the summaries generated by Esale are more informative and closer to the ground-truth summaries.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ESALE:增强源代码摘要的代码摘要对齐学习
(源)代码摘要旨在为给定的代码片段自动生成简洁的自然语言摘要。这些摘要在促进开发人员理解和维护代码方面发挥着重要作用。受神经机器翻译的启发,基于深度学习的代码摘要技术广泛采用编码器-解码器框架,其中编码器将给定代码片段转换为上下文向量,解码器将上下文向量解码为摘要。最近,针对源代码的大规模预训练模型(如 CodeBERT 和 UniXcoder)配备了能够生成一般上下文向量的编码器,并在代码摘要任务中取得了重大改进。不过,尽管它们通常主要针对代码任务进行训练,并能捕捉一般代码特征,但在捕捉需要总结的特定特征方面仍有不足。简而言之,它们无法学习代码片段和摘要之间的对齐(简称代码摘要对齐)。在本文中,我们提出了一种基于以摘要为重点的任务来改进代码摘要的新方法。具体来说,我们利用多任务学习范式,在三个以摘要为重点的任务上训练编码器,以增强其学习代码-摘要对齐的能力,包括单向语言建模(ULM)、屏蔽语言建模(MLM)和动作词预测(AWP)。与主要预测代码片段中被掩盖的标记的预训练模型不同,我们设计了 ULM 和 MLM 来预测摘要中被掩盖的单词。直观地说,根据给定的代码片段预测单词有助于学习代码与摘要的对齐。此外,现有工作表明 AWP 会影响整个摘要的预测。因此,我们进一步引入了特定领域任务 AWP,以增强编码器学习动作词和代码片段之间对齐的能力。我们在四个数据集(包括两个广泛使用的数据集 JCSD 和 PCSD、一个跨项目 Java 数据集 CPJD 和一个多语种语言数据集 CodeSearchNet)上进行了大量实验,评估了我们的方法(称为 Esale)的有效性。实验结果表明,在所有三个广泛使用的指标(包括 BLEU、METEOR 和 ROUGE-L)上,Esale 都明显优于最先进的基线。此外,人工评估证明,Esale 生成的摘要信息量更大,更接近地面实况摘要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Software Engineering
IEEE Transactions on Software Engineering 工程技术-工程:电子与电气
CiteScore
9.70
自引率
10.80%
发文量
724
审稿时长
6 months
期刊介绍: IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.
期刊最新文献
GenProgJS: a Baseline System for Test-based Automated Repair of JavaScript Programs On Inter-dataset Code Duplication and Data Leakage in Large Language Models Line-Level Defect Prediction by Capturing Code Contexts with Graph Convolutional Networks Does Treatment Adherence Impact Experiment Results in TDD? Scoping Software Engineering for AI: The TSE Perspective
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1