Hierarchical Learning of Cross-Language Mappings Through Distributed Vector Representations for Code

2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER) Pub Date : 2018-03-13 DOI:10.1145/3183399.3183427

Nghi D. Q. Bui, Lingxiao Jiang

{"title":"Hierarchical Learning of Cross-Language Mappings Through Distributed Vector Representations for Code","authors":"Nghi D. Q. Bui, Lingxiao Jiang","doi":"10.1145/3183399.3183427","DOIUrl":null,"url":null,"abstract":"Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach to automatically learn cross-language representations for various kinds of structural code elements that may be used for program translation. Our key idea is two folded: First, we normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings; Second, hierarchically from bottom up, we construct shared embeddings for code elements of higher levels of granularity (e.g., expressions, statements, methods) from the embeddings for their constituents, and then build mappings among code elements across languages based on similarities among embeddings. Our preliminary evaluations on about 40,000 Java and C# source files from 9 software projects show that our approach can automatically learn shared embeddings for various code elements in different languages and identify their cross-language mappings with reasonable Mean Average Precision scores. When compared with an existing tool for mapping library API methods, our approach identifies many more mappings accurately. The mapping results and code can be accessed at https://github.com/bdqnghi/hierarchical-programming-language-mapping) We believe that our idea for learning cross-language vector representations with code structural information can be a useful step towards automated program translation.","PeriodicalId":212579,"journal":{"name":"2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183399.3183427","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach to automatically learn cross-language representations for various kinds of structural code elements that may be used for program translation. Our key idea is two folded: First, we normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings; Second, hierarchically from bottom up, we construct shared embeddings for code elements of higher levels of granularity (e.g., expressions, statements, methods) from the embeddings for their constituents, and then build mappings among code elements across languages based on similarities among embeddings. Our preliminary evaluations on about 40,000 Java and C# source files from 9 software projects show that our approach can automatically learn shared embeddings for various code elements in different languages and identify their cross-language mappings with reasonable Mean Average Precision scores. When compared with an existing tool for mapping library API methods, our approach identifies many more mappings accurately. The mapping results and code can be accessed at https://github.com/bdqnghi/hierarchical-programming-language-mapping) We believe that our idea for learning cross-language vector representations with code structural information can be a useful step towards automated program translation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过代码的分布式向量表示实现跨语言映射的分层学习

对于需要用不同语言实现功能的软件开发任务，将用一种编程语言编写的程序翻译成另一种编程语言是很有用的。虽然过去的研究已经考虑了这个问题，但它们可能是特定于语言语法的，或者特定于某些类型的代码元素(例如，记号、短语、API使用)。本文提出了一种自动学习可用于程序翻译的各种结构代码元素的跨语言表示的新方法。我们的关键思想有两个方面:首先，我们用额外的结构和语义信息规范化和丰富代码标记流，并训练标记的跨语言向量表示(即基于word2vec的共享嵌入，这是一种基于神经网络的生成词嵌入的技术;其次，从下至上的层次结构，我们为更高粒度级别的代码元素(例如表达式、语句、方法)构建共享嵌入，然后基于嵌入之间的相似性在不同语言的代码元素之间构建映射。我们对来自9个软件项目的大约40,000个Java和c#源文件的初步评估表明，我们的方法可以自动学习不同语言中各种代码元素的共享嵌入，并以合理的Mean Average Precision分数识别它们的跨语言映射。与现有的映射库API方法的工具相比，我们的方法可以准确地识别更多的映射。映射结果和代码可以访问https://github.com/bdqnghi/hierarchical-programming-language-mapping)我们相信，我们用代码结构信息学习跨语言向量表示的想法可以成为实现自动程序翻译的有用步骤。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER)

自引率

0.00%

发文量

期刊最新文献

Generalizing Specific-Instance Interpolation Proofs with SyGuS Images of Code: Lossy Compression for Native Instructions Enabling Real-Time Feedback in Software Engineering Deep Customization of Multi-tenant SaaS Using Intrusive Microservices Measure Confidence of Assurance Cases in Safety-Critical Domains