On the use of statistical machine translation for suggesting variable names for decompiled code: The Pharo case

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Journal of Computer Languages Pub Date : 2024-04-19 DOI:10.1016/j.cola.2024.101271

Juan Pablo Sandoval Alcocer , Harold Camacho-Jaimes , Geraldine Galindo-Gutierrez , Andrés Neyem , Alexandre Bergel , Stéphane Ducasse

{"title":"On the use of statistical machine translation for suggesting variable names for decompiled code: The Pharo case","authors":"Juan Pablo Sandoval Alcocer , Harold Camacho-Jaimes , Geraldine Galindo-Gutierrez , Andrés Neyem , Alexandre Bergel , Stéphane Ducasse","doi":"10.1016/j.cola.2024.101271","DOIUrl":null,"url":null,"abstract":"<div><p>Adequately selecting variable names is a difficult activity for practitioners. In 2018, Jaffe et al. proposed the use of statistical machine translation (SMT) to suggest descriptive variable names for decompiled code. A large corpus of decompiled C code was used to train the SMT model. Our paper presents the results of a partial replication of Jaffe’s experiment. We apply the same technique and methodology to a dataset made of code written in the Pharo programming language. We selected Pharo since its syntax is simple – it fits on half of a postcard – and because the optimizations performed by the compiler are limited to method scope. Our results indicate that SMT may recover between 8.9% and 69.88% of the variable names depending on the training set. Our replication concludes that: (i) the accuracy depends on the code similarity between the training and testing sets; (ii) the simplicity of the Pharo syntax and the satisfactory decompiled code alignment have a positive impact on predicting variable names; and (iii) a relatively small code corpus is sufficient to train the SMT model, which shows the applicability of the approach to less popular programming languages. Additionally, to assess SMT’s potential in improving original variable names, ten Pharo developers reviewed 400 SMT name suggestions, with four reviews per variable. Only 15 suggestions (3.75%) were unanimously viewed as improvements, while 45 (11.25%) were perceived as improvements by at least two reviewers, highlighting SMT’s limitations in providing suitable alternatives.</p></div>","PeriodicalId":48552,"journal":{"name":"Journal of Computer Languages","volume":"79 ","pages":"Article 101271"},"PeriodicalIF":1.8000,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Languages","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590118424000145","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Adequately selecting variable names is a difficult activity for practitioners. In 2018, Jaffe et al. proposed the use of statistical machine translation (SMT) to suggest descriptive variable names for decompiled code. A large corpus of decompiled C code was used to train the SMT model. Our paper presents the results of a partial replication of Jaffe’s experiment. We apply the same technique and methodology to a dataset made of code written in the Pharo programming language. We selected Pharo since its syntax is simple – it fits on half of a postcard – and because the optimizations performed by the compiler are limited to method scope. Our results indicate that SMT may recover between 8.9% and 69.88% of the variable names depending on the training set. Our replication concludes that: (i) the accuracy depends on the code similarity between the training and testing sets; (ii) the simplicity of the Pharo syntax and the satisfactory decompiled code alignment have a positive impact on predicting variable names; and (iii) a relatively small code corpus is sufficient to train the SMT model, which shows the applicability of the approach to less popular programming languages. Additionally, to assess SMT’s potential in improving original variable names, ten Pharo developers reviewed 400 SMT name suggestions, with four reviews per variable. Only 15 suggestions (3.75%) were unanimously viewed as improvements, while 45 (11.25%) were perceived as improvements by at least two reviewers, highlighting SMT’s limitations in providing suitable alternatives.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于使用统计机器翻译为反编译代码建议变量名：Pharo 案例

对于从业者来说，适当选择变量名是一项困难的活动。2018 年，Jaffe 等人提出使用统计机器翻译（SMT）为反编译代码建议描述性变量名。他们使用了大量反编译 C 代码语料库来训练 SMT 模型。我们的论文介绍了部分复制 Jaffe 实验的结果。我们将相同的技术和方法应用于一个由 Pharo 编程语言编写的代码组成的数据集。我们选择 Pharo 是因为它的语法很简单--只需半张明信片就能写完，而且编译器的优化仅限于方法范围。我们的结果表明，根据训练集的不同，SMT 可以恢复 8.9% 到 69.88% 的变量名。我们的复制结论是(i) 准确性取决于训练集和测试集之间的代码相似性；(ii) Pharo 语法的简洁性和令人满意的反编译代码对齐方式对预测变量名有积极影响；(iii) 相对较小的代码语料足以训练 SMT 模型，这表明该方法适用于不太流行的编程语言。此外，为了评估 SMT 在改进原始变量名方面的潜力，十名 Pharo 开发人员审查了 400 个 SMT 名称建议，每个变量审查四次。只有 15 项建议（3.75%）被一致认为是改进，而 45 项建议（11.25%）被至少两名评审者认为是改进，这突出表明了 SMT 在提供合适替代方案方面的局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊