Multi-task Learning based Pre-trained Language Model for Code Completion

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE) Pub Date : 2020-09-01 DOI:10.1145/3324884.3416591

F. Liu, Ge Li, Yunfei Zhao, Zhi Jin

{"title":"Multi-task Learning based Pre-trained Language Model for Code Completion","authors":"F. Liu, Ge Li, Yunfei Zhao, Zhi Jin","doi":"10.1145/3324884.3416591","DOIUrl":null,"url":null,"abstract":"Code completion is one of the most useful features in the Integrated Development Environments (IDEs), which can accelerate software development by suggesting the next probable token based on the contextual code in real-time. Recent studies have shown that statistical language modeling techniques can improve the performance of code completion tools through learning from large-scale software repositories. However, these models suffer from two major drawbacks: a) Existing research uses static embeddings, which map a word to the same vector regardless of its context. The differences in the meaning of a token in varying contexts are lost when each token is associated with a single representation; b) Existing language model based code completion models perform poor on completing identifiers, and the type information of the identifiers is ignored in most of these models. To address these challenges, in this paper, we develop a multi-task learning based pre-trained language model for code understanding and code generation with a Transformer-based neural architecture. We pre-train it with hybrid objective functions that incorporate both code understanding and code generation tasks. Then we fine-tune the pre-trained model on code completion. During the completion, our model does not directly predict the next token. Instead, we adopt multi-task learning to predict the token and its type jointly and utilize the predicted type to assist the token prediction. Experiments results on two real-world datasets demonstrate the effectiveness of our model when compared with state-of-the-art methods.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"126","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3324884.3416591","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 126

Abstract

Code completion is one of the most useful features in the Integrated Development Environments (IDEs), which can accelerate software development by suggesting the next probable token based on the contextual code in real-time. Recent studies have shown that statistical language modeling techniques can improve the performance of code completion tools through learning from large-scale software repositories. However, these models suffer from two major drawbacks: a) Existing research uses static embeddings, which map a word to the same vector regardless of its context. The differences in the meaning of a token in varying contexts are lost when each token is associated with a single representation; b) Existing language model based code completion models perform poor on completing identifiers, and the type information of the identifiers is ignored in most of these models. To address these challenges, in this paper, we develop a multi-task learning based pre-trained language model for code understanding and code generation with a Transformer-based neural architecture. We pre-train it with hybrid objective functions that incorporate both code understanding and code generation tasks. Then we fine-tune the pre-trained model on code completion. During the completion, our model does not directly predict the next token. Instead, we adopt multi-task learning to predict the token and its type jointly and utilize the predicted type to assist the token prediction. Experiments results on two real-world datasets demonstrate the effectiveness of our model when compared with state-of-the-art methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于多任务学习的预训练语言代码完成模型

代码完成是集成开发环境(ide)中最有用的特性之一，它可以根据上下文代码实时建议下一个可能的令牌，从而加速软件开发。最近的研究表明，统计语言建模技术可以通过学习大规模软件存储库来提高代码完成工具的性能。然而，这些模型有两个主要的缺点:a)现有的研究使用静态嵌入，将一个词映射到相同的向量，而不考虑其上下文。当每个标记与单个表示相关联时，在不同上下文中标记的含义差异就会丢失;b)现有的基于语言模型的代码补全模型在标识符补全方面的性能较差，大多数模型忽略了标识符的类型信息。为了解决这些挑战，在本文中，我们开发了一个基于多任务学习的预训练语言模型，用于代码理解和代码生成，并使用基于transformer的神经架构。我们使用混合目标函数对其进行预训练，其中包含代码理解和代码生成任务。然后我们根据代码完成对预训练模型进行微调。在完成过程中，我们的模型不直接预测下一个令牌。相反，我们采用多任务学习来共同预测token及其类型，并利用预测的类型来辅助token预测。在两个真实数据集上的实验结果表明，与最先进的方法相比，我们的模型是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

自引率

0.00%

发文量

期刊最新文献

Towards Generating Thread-Safe Classes Automatically Anti-patterns for Java Automated Program Repair Tools Automating Just-In-Time Comment Updating Synthesizing Smart Solving Strategy for Symbolic Execution Identifying and Describing Information Seeking Tasks