Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code (T)

2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) Pub Date : 2015-11-09 DOI:10.1109/ASE.2015.74

A. Nguyen, T. Nguyen, T. Nguyen

{"title":"Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code (T)","authors":"A. Nguyen, T. Nguyen, T. Nguyen","doi":"10.1109/ASE.2015.74","DOIUrl":null,"url":null,"abstract":"Prior research shows that directly applying phrase-based SMT on lexical tokens to migrate Java to C# produces much semantically incorrect code. A key limitation is the use of sequences in phrase-based SMT to model and translate source code with well-formed structures. We propose mppSMT, a divide-and-conquer technique to address that with novel training and migration algorithms using phrase-based SMT in three phases. First, mppSMT treats a program as a sequence of syntactic units and maps/translates such sequences in two languages to one another. Second, with a syntax-directed fashion, it deals with the tokens within syntactic units by encoding them with semantic symbols to represent their data and token types. This encoding via semantic symbols helps better migration of API usages. Third, the lexical tokens corresponding to each sememe are mapped or migrated. The resulting sequences of tokens are merged together to form the final migrated code. Such divide-and-conquer and syntax-direction strategies enable phrase-based SMT to adapt well to syntactical structures in source code, thus, improving migration accuracy. Our empirical evaluation on several real-world systems shows that 84.8 -- 97.9% and 70 -- 83% of the migrated methods are syntactically and semantically correct, respectively. 26.3 -- 51.2% of total migrated methods are exactly matched to the human-written C# code in the oracle. Compared to Java2CSharp, a rule-based migration tool, it achieves higher semantic accuracy from 6.6 -- 57.7% relatively. Importantly, it does not require manual labeling for training data or manual definition of rules.","PeriodicalId":6586,"journal":{"name":"2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"48 1","pages":"585-596"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"58","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASE.2015.74","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 58

Abstract

Prior research shows that directly applying phrase-based SMT on lexical tokens to migrate Java to C# produces much semantically incorrect code. A key limitation is the use of sequences in phrase-based SMT to model and translate source code with well-formed structures. We propose mppSMT, a divide-and-conquer technique to address that with novel training and migration algorithms using phrase-based SMT in three phases. First, mppSMT treats a program as a sequence of syntactic units and maps/translates such sequences in two languages to one another. Second, with a syntax-directed fashion, it deals with the tokens within syntactic units by encoding them with semantic symbols to represent their data and token types. This encoding via semantic symbols helps better migration of API usages. Third, the lexical tokens corresponding to each sememe are mapped or migrated. The resulting sequences of tokens are merged together to form the final migrated code. Such divide-and-conquer and syntax-direction strategies enable phrase-based SMT to adapt well to syntactical structures in source code, thus, improving migration accuracy. Our empirical evaluation on several real-world systems shows that 84.8 -- 97.9% and 70 -- 83% of the migrated methods are syntactically and semantically correct, respectively. 26.3 -- 51.2% of total migrated methods are exactly matched to the human-written C# code in the oracle. Compared to Java2CSharp, a rule-based migration tool, it achieves higher semantic accuracy from 6.6 -- 57.7% relatively. Importantly, it does not require manual labeling for training data or manual definition of rules.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分而治之的源代码多阶段统计迁移方法(T)

先前的研究表明，直接在词法标记上应用基于短语的SMT将Java迁移到c#会产生许多语义上不正确的代码。一个关键的限制是在基于短语的SMT中使用序列来建模和翻译具有良好结构的源代码。我们提出了mppSMT，一种分而治之的技术，通过使用基于短语的SMT分三个阶段的新颖训练和迁移算法来解决这一问题。首先，mppSMT将程序视为语法单元序列，并用两种语言将这些序列相互映射/翻译。其次，使用语法导向的方式，它处理语法单元中的标记，方法是用语义符号对它们进行编码，以表示它们的数据和标记类型。这种通过语义符号进行的编码有助于更好地迁移API用法。第三，对每个义素对应的词法记号进行映射或迁移。生成的令牌序列合并在一起，形成最终的迁移代码。这种分而治之和语法方向策略使基于短语的SMT能够很好地适应源代码中的语法结构，从而提高迁移的准确性。我们在几个实际系统上的经验评估表明，迁移的方法分别有84.8—97.9%和70—83%是语法和语义正确的。26.3—51.2%的迁移方法与oracle中人工编写的c#代码完全匹配。与基于规则的迁移工具Java2CSharp相比，它实现了更高的语义准确度，相对而言为6.6—57.7%。重要的是，它不需要手动标记训练数据或手动定义规则。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE)

自引率

0.00%

发文量

期刊最新文献

Cost-Efficient Sampling for Performance Prediction of Configurable Systems (T) Refactorings for Android Asynchronous Programming Study and Refactoring of Android Asynchronous Programming (T) The iMPAcT Tool: Testing UI Patterns on Mobile Applications Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N)