Zero-Shot Program Representation Learning

2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC) Pub Date : 2022-04-18 DOI:10.1145/3524610.3527888

Nan Cui, Yuze Jiang, Xiaodong Gu, Beijun Shen

{"title":"Zero-Shot Program Representation Learning","authors":"Nan Cui, Yuze Jiang, Xiaodong Gu, Beijun Shen","doi":"10.1145/3524610.3527888","DOIUrl":null,"url":null,"abstract":"Learning program representations has been the core prerequisite of code intelligence tasks (e.g., code search and code clone detection). The state-of-the-art pre-trained models such as CodeBERT require the availability of large-scale code corpora. However, gathering training samples can be costly and infeasible for domain-specific languages such as Solidity for smart contracts. In this paper, we propose Zecoler, a zero-shot learning approach for code representations. Zecoler is built upon a pre-trained programming language model. In order to elicit knowledge from the pre-trained models efficiently, Zecoler casts the downstream tasks to the same form of pre-training tasks by inserting trainable prompts into the original input. Then, it employs the prompt learning technique to optimize the pre-trained model by merely adjusting the original input. This enables the representation model to efficiently fit the scarce task-specific data while reusing pre-trained knowledge. We evaluate Zecoler in three code intelligence tasks in two programming languages that have no training samples, namely, Solidity and Go, with model trained in corpora of common languages such as Java. Experimental results show that our approach significantly outperforms baseline models in both zero-shot and few-shot settings.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3524610.3527888","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Learning program representations has been the core prerequisite of code intelligence tasks (e.g., code search and code clone detection). The state-of-the-art pre-trained models such as CodeBERT require the availability of large-scale code corpora. However, gathering training samples can be costly and infeasible for domain-specific languages such as Solidity for smart contracts. In this paper, we propose Zecoler, a zero-shot learning approach for code representations. Zecoler is built upon a pre-trained programming language model. In order to elicit knowledge from the pre-trained models efficiently, Zecoler casts the downstream tasks to the same form of pre-training tasks by inserting trainable prompts into the original input. Then, it employs the prompt learning technique to optimize the pre-trained model by merely adjusting the original input. This enables the representation model to efficiently fit the scarce task-specific data while reusing pre-trained knowledge. We evaluate Zecoler in three code intelligence tasks in two programming languages that have no training samples, namely, Solidity and Go, with model trained in corpora of common languages such as Java. Experimental results show that our approach significantly outperforms baseline models in both zero-shot and few-shot settings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

零射击程序表示学习

学习程序表示一直是代码智能任务(例如，代码搜索和代码克隆检测)的核心先决条件。最先进的预训练模型，如CodeBERT，需要大规模代码语料库的可用性。然而，收集训练样本对于特定于领域的语言(如用于智能合约的Solidity)来说可能是昂贵且不可行的。在本文中，我们提出了Zecoler，一种用于代码表示的零采样学习方法。Zecoler是建立在预先训练的编程语言模型之上的。为了有效地从预训练模型中提取知识，Zecoler通过在原始输入中插入可训练的提示，将下游任务转换为与预训练任务相同的形式。然后，采用提示学习技术，仅通过调整原始输入对预训练模型进行优化。这使得表示模型能够有效地拟合稀缺的特定于任务的数据，同时重用预训练的知识。我们在两种没有训练样本的编程语言(即Solidity和Go)的三个代码智能任务中评估Zecoler，并在Java等常用语言的语料库中训练模型。实验结果表明，我们的方法在零射击和少射击设置下都明显优于基线模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)

自引率

0.00%

发文量

期刊最新文献

Context-based Cluster Fault Localization Fine-Grained Code-Comment Semantic Interaction Analysis Find Bugs in Static Bug Finders Self-Supervised Learning of Smart Contract Representations An Exploratory Study of Analyzing JavaScript Online Code Clones