Mining Idioms in the Wild

2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) Pub Date : 2021-07-13 DOI:10.1145/3510457.3513046

Aishwarya Sivaraman, Rui Abreu, Andrew C. Scott, Tobi Akomolede, S. Chandra

{"title":"Mining Idioms in the Wild","authors":"Aishwarya Sivaraman, Rui Abreu, Andrew C. Scott, Tobi Akomolede, S. Chandra","doi":"10.1145/3510457.3513046","DOIUrl":null,"url":null,"abstract":"Existing code repositories contain numerous instances of code patterns that are idiomatic ways of accomplishing a particular programming task. Sometimes, the programming language in use supports specific operators or APIs that can express the same idiomatic imperative code much more succinctly. However, those code patterns linger in repositories because the developers may be unaware of the new APIs or have not gotten around to them. Detection of idiomatic code can also point to the need for new APIs. We share our experiences in mining imperative idiomatic patterns from the Hack repo at Facebook. We found that existing techniques either cannot identify meaningful patterns from syntax trees or require test-suite-based dynamic analysis to incorporate semantic properties to mine useful patterns. The key insight of the approach proposed in this paper – Jezero – is that semantic idioms from a large codebase can be learned from canonicalized dataflow trees. We propose a scalable, lightweight static analysis-based approach to construct such a tree that is well suited to mine semantic idioms using nonparametric Bayesian methods. Our experiments with Jezero on Hack code show a clear advantage of adding canonicalized dataflow information to ASTs: Jezero was significantly more effective in finding new refactoring opportunities from unannotated legacy code than a baseline that did not have the dataflow augmentation.","PeriodicalId":119790,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)","volume":"116 7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3510457.3513046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Existing code repositories contain numerous instances of code patterns that are idiomatic ways of accomplishing a particular programming task. Sometimes, the programming language in use supports specific operators or APIs that can express the same idiomatic imperative code much more succinctly. However, those code patterns linger in repositories because the developers may be unaware of the new APIs or have not gotten around to them. Detection of idiomatic code can also point to the need for new APIs. We share our experiences in mining imperative idiomatic patterns from the Hack repo at Facebook. We found that existing techniques either cannot identify meaningful patterns from syntax trees or require test-suite-based dynamic analysis to incorporate semantic properties to mine useful patterns. The key insight of the approach proposed in this paper – Jezero – is that semantic idioms from a large codebase can be learned from canonicalized dataflow trees. We propose a scalable, lightweight static analysis-based approach to construct such a tree that is well suited to mine semantic idioms using nonparametric Bayesian methods. Our experiments with Jezero on Hack code show a clear advantage of adding canonicalized dataflow information to ASTs: Jezero was significantly more effective in finding new refactoring opportunities from unannotated legacy code than a baseline that did not have the dataflow augmentation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

野外采矿习语

现有代码存储库包含许多代码模式实例，这些代码模式是完成特定编程任务的惯用方法。有时，所使用的编程语言支持特定的操作符或api，这些操作符或api可以更简洁地表达相同的惯用命令式代码。然而，这些代码模式留在存储库中，因为开发人员可能不知道新的api，或者没有抽出时间使用它们。对惯用代码的检测还可以指出需要新的api。我们将分享从Facebook的Hack仓库中挖掘命令式惯用模式的经验。我们发现现有的技术要么不能从语法树中识别有意义的模式，要么需要基于测试套件的动态分析来结合语义属性来挖掘有用的模式。本文提出的方法(Jezero)的关键观点是，可以从规范化的数据流树中学习大型代码库中的语义习惯用法。我们提出了一种可扩展的、轻量级的、基于静态分析的方法来构建这样一个树，它非常适合使用非参数贝叶斯方法挖掘语义习惯用法。我们在Hack代码上使用Jezero的实验显示了向ast中添加规范化数据流信息的明显优势:在从未注释的遗留代码中发现新的重构机会方面，Jezero明显比没有数据流增强的基线更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

自引率

0.00%

发文量

期刊最新文献

Industry's Cry for Tools that Support Large-Scale Refactoring Code Reviewer Recommendation in Tencent: Practice, Challenge, and Direction* What's bothering developers in code review? The Impact of Flaky Tests on Historical Test Prioritization on Chrome Surveying the Developer Experience of Flaky Tests