VarGAN: Adversarial Learning of Variable Semantic Representations

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Transactions on Software Engineering Pub Date : 2024-04-25 DOI:10.1109/TSE.2024.3391730

Yalan Lin;Chengcheng Wan;Shuwen Bai;Xiaodong Gu

{"title":"VarGAN: Adversarial Learning of Variable Semantic Representations","authors":"Yalan Lin;Chengcheng Wan;Shuwen Bai;Xiaodong Gu","doi":"10.1109/TSE.2024.3391730","DOIUrl":null,"url":null,"abstract":"Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 6","pages":"1505-1517"},"PeriodicalIF":5.6000,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10508714/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

VarGAN：可变语义表征的对抗学习

变量名在代码表示学习中至关重要。然而，由于命名规则的多样性，变量通常会被任意命名，从而导致长尾、词汇量不足（OOV）和其他众所周知的问题。虽然字节对编码（BPE）标记器解决了低频标记的表面识别问题，但它并没有注意到代码表示模型对低频标识符的训练不足，导致稀有标识符和常见标识符的分布不平衡。因此，代码表示模型很难有效捕捉低频变量名的语义。在本文中，我们提出了一种用于变量名表示的新方法 VarGAN。VarGAN 通过对抗训练加强了低频变量的训练。具体来说，我们将代码表示模型视为一个生成器，负责从源代码中生成向量。此外，我们还采用了一种判别器，用于检测输入到生成器的代码是否包含低频变量。这种对抗设置规范了稀有变量的分布，使它们与向量空间中相应的高频变量重叠。实验结果表明，VarGAN 使 CodeBERT 生成的代码向量在低频和高频标识符的分布上更加均匀。在 IdBench 基准测试中，与 VarCLR 相比，相似性和相关性得分提高了 8%。VarGAN 在下游任务中也得到了验证，在捕获标记和代码级语义方面表现出更强的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.