学习用异构图表示程序

2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC) Pub Date : 2020-12-08 DOI:10.1145/3524610.3527905

Wenhan Wang, Kechi Zhang, Ge Li, Zhi Jin

{"title":"学习用异构图表示程序","authors":"Wenhan Wang, Kechi Zhang, Ge Li, Zhi Jin","doi":"10.1145/3524610.3527905","DOIUrl":null,"url":null,"abstract":"Code representation, which transforms programs into vectors with semantics, is essential for source code processing. We have witnessed the effectiveness of incorporating structural information (i.e., graph) into code representations in recent years. Specifically, the abstract syntax tree (AST) and the AST-augmented graph of the program contain much structural and semantic information, and most existing studies apply them for code representation. The graph adopted by existing approaches is homogeneous, i.e., it discards the type information of the edges and the nodes lying within AST. That may cause plausible obstruction to the representation model. In this paper, we propose to leverage the type information in the graph for code representation. To be specific, we propose the heterogeneous program graph (HPG), which provides the types of the nodes and the edges explicitly. Furthermore, we employ the heterogeneous graph transformer (HGT) architecture to generate representations based on HPG, considering the type of information during processing. With the additional types in HPG, our approach can capture complex structural information, produce accurate and delicate representations, and finally perform well on certain tasks. Our in-depth evaluations upon four classic datasets for two typical tasks (i.e., method name prediction and code classification) demonstrate that the heterogeneous types in HPG benefit the representation models. Our proposed $\\text{HPG}+\\text{HGT}$ also outperforms the SOTA baselines on the subject tasks and datasets.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"Learning to Represent Programs with Heterogeneous Graphs\",\"authors\":\"Wenhan Wang, Kechi Zhang, Ge Li, Zhi Jin\",\"doi\":\"10.1145/3524610.3527905\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Code representation, which transforms programs into vectors with semantics, is essential for source code processing. We have witnessed the effectiveness of incorporating structural information (i.e., graph) into code representations in recent years. Specifically, the abstract syntax tree (AST) and the AST-augmented graph of the program contain much structural and semantic information, and most existing studies apply them for code representation. The graph adopted by existing approaches is homogeneous, i.e., it discards the type information of the edges and the nodes lying within AST. That may cause plausible obstruction to the representation model. In this paper, we propose to leverage the type information in the graph for code representation. To be specific, we propose the heterogeneous program graph (HPG), which provides the types of the nodes and the edges explicitly. Furthermore, we employ the heterogeneous graph transformer (HGT) architecture to generate representations based on HPG, considering the type of information during processing. With the additional types in HPG, our approach can capture complex structural information, produce accurate and delicate representations, and finally perform well on certain tasks. Our in-depth evaluations upon four classic datasets for two typical tasks (i.e., method name prediction and code classification) demonstrate that the heterogeneous types in HPG benefit the representation models. Our proposed $\\\\text{HPG}+\\\\text{HGT}$ also outperforms the SOTA baselines on the subject tasks and datasets.\",\"PeriodicalId\":426634,\"journal\":{\"name\":\"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3524610.3527905\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3524610.3527905","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

摘要

代码表示是将程序转换为具有语义的向量的方法，是源代码处理的关键。近年来，我们已经见证了将结构信息(即图形)合并到代码表示中的有效性。具体来说，程序的抽象语法树(AST)和AST增强图包含了大量的结构和语义信息，现有的研究大多将它们用于代码表示。现有方法采用的图是同构的，即丢弃了AST内的边和节点的类型信息，这可能会对表示模型造成貌似合理的阻碍。在本文中，我们建议利用图中的类型信息来表示代码。具体来说，我们提出了异构规划图(HPG)，它明确地提供了节点和边的类型。此外，考虑到处理过程中信息的类型，我们采用异构图转换器(HGT)架构来生成基于HPG的表示。利用HPG中的附加类型，我们的方法可以捕获复杂的结构信息，产生准确而精细的表示，并最终在某些任务上表现良好。我们对两个典型任务(即方法名称预测和代码分类)的四个经典数据集进行了深入评估，结果表明HPG中的异构类型有利于表示模型。我们提出的$\text{HPG}+\text{HGT}$在主题任务和数据集上也优于SOTA基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Learning to Represent Programs with Heterogeneous Graphs

Code representation, which transforms programs into vectors with semantics, is essential for source code processing. We have witnessed the effectiveness of incorporating structural information (i.e., graph) into code representations in recent years. Specifically, the abstract syntax tree (AST) and the AST-augmented graph of the program contain much structural and semantic information, and most existing studies apply them for code representation. The graph adopted by existing approaches is homogeneous, i.e., it discards the type information of the edges and the nodes lying within AST. That may cause plausible obstruction to the representation model. In this paper, we propose to leverage the type information in the graph for code representation. To be specific, we propose the heterogeneous program graph (HPG), which provides the types of the nodes and the edges explicitly. Furthermore, we employ the heterogeneous graph transformer (HGT) architecture to generate representations based on HPG, considering the type of information during processing. With the additional types in HPG, our approach can capture complex structural information, produce accurate and delicate representations, and finally perform well on certain tasks. Our in-depth evaluations upon four classic datasets for two typical tasks (i.e., method name prediction and code classification) demonstrate that the heterogeneous types in HPG benefit the representation models. Our proposed $\text{HPG}+\text{HGT}$ also outperforms the SOTA baselines on the subject tasks and datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助