Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification

2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC) Pub Date : 2022-05-01 DOI:10.1145/3524610.3527915

Kesu Wang, Meng Yan, He Zhang, Haibo Hu

{"title":"Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification","authors":"Kesu Wang, Meng Yan, He Zhang, Haibo Hu","doi":"10.1145/3524610.3527915","DOIUrl":null,"url":null,"abstract":"Program classification can be regarded as a high-level abstraction of code, laying a foundation for various tasks related to source code comprehension, and has a very wide range of applications in the field of software engineering, such as code clone detection, code smell classification, defects classification, etc. The cross-language program classification can realize code transfer in different programming languages, and can also promote cross-language code reuse, thereby helping developers to write code quickly and reduce the development time of code transfer. Most of the existing studies focus on the semantic learning of the code, whilst few studies are devoted to cross-language tasks. The main challenge of cross-language program classification is how to extract semantic features of different programming languages. In order to cope with this difficulty, we propose a Unified Abstract Syntax Tree (namely UAST in this paper) neural network. In detail, the core idea of UAST consists of two unified mechanisms. First, UAST learns an AST representation by unifying the AST traversal sequence and graph-like AST structure for capturing semantic code features. Second, we construct a mechanism called unified vocabulary, which can reduce the feature gap between different programming languages, so it can achieve the role of cross-language program classification. Besides, we collect a dataset containing 20,000 files of five programming languages, which can be used as a benchmark dataset for the cross-language program classification task. We have done experiments on two datasets, and the results show that our proposed approach out-performs the state-of-the-art baselines in terms of four evaluation metrics (Precision, Recall, F1-score, and Accuracy).","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3524610.3527915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Program classification can be regarded as a high-level abstraction of code, laying a foundation for various tasks related to source code comprehension, and has a very wide range of applications in the field of software engineering, such as code clone detection, code smell classification, defects classification, etc. The cross-language program classification can realize code transfer in different programming languages, and can also promote cross-language code reuse, thereby helping developers to write code quickly and reduce the development time of code transfer. Most of the existing studies focus on the semantic learning of the code, whilst few studies are devoted to cross-language tasks. The main challenge of cross-language program classification is how to extract semantic features of different programming languages. In order to cope with this difficulty, we propose a Unified Abstract Syntax Tree (namely UAST in this paper) neural network. In detail, the core idea of UAST consists of two unified mechanisms. First, UAST learns an AST representation by unifying the AST traversal sequence and graph-like AST structure for capturing semantic code features. Second, we construct a mechanism called unified vocabulary, which can reduce the feature gap between different programming languages, so it can achieve the role of cross-language program classification. Besides, we collect a dataset containing 20,000 files of five programming languages, which can be used as a benchmark dataset for the cross-language program classification task. We have done experiments on two datasets, and the results show that our proposed approach out-performs the state-of-the-art baselines in terms of four evaluation metrics (Precision, Recall, F1-score, and Accuracy).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

跨语言程序分类的统一抽象语法树表示学习

程序分类可以看作是对代码的高级抽象，为理解源代码相关的各种任务奠定基础，在软件工程领域有着非常广泛的应用，如代码克隆检测、代码气味分类、缺陷分类等。跨语言程序分类可以实现不同编程语言之间的代码迁移，也可以促进跨语言代码重用，从而帮助开发人员快速编写代码，减少代码迁移的开发时间。现有的研究大多集中在代码的语义学习上，而对跨语言任务的研究很少。跨语言程序分类的主要挑战是如何提取不同编程语言的语义特征。为了解决这一难题，我们提出了一种统一抽象语法树(即UAST)神经网络。具体来说，UAST的核心思想包括两个统一的机制。首先，UAST通过统一AST遍历序列和用于捕获语义代码特征的类似图的AST结构来学习AST表示。其次，构建统一词汇表机制，减少不同编程语言之间的特征差距，从而实现跨语言程序分类的作用。此外，我们收集了包含5种编程语言的20000个文件的数据集，可以作为跨语言程序分类任务的基准数据集。我们在两个数据集上进行了实验，结果表明，我们提出的方法在四个评估指标(Precision, Recall, F1-score和Accuracy)方面优于最先进的基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)

自引率

0.00%

发文量

期刊最新文献

Context-based Cluster Fault Localization Fine-Grained Code-Comment Semantic Interaction Analysis Find Bugs in Static Bug Finders Self-Supervised Learning of Smart Contract Representations An Exploratory Study of Analyzing JavaScript Online Code Clones