首页 > 最新文献

2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)最新文献

英文 中文
SourcererCC: Scaling Code Clone Detection to Big-Code SourcererCC:扩展代码克隆检测到大代码
Pub Date : 2015-12-20 DOI: 10.1145/2884781.2884877
Hitesh Sajnani, V. Saini, Jeffrey Svajlenko, C. Roy, C. Lopes
Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.
尽管进行了十年的积极研究,但明显缺乏克隆检测技术,无法扩展到大型存储库,以检测险些失败的克隆。在本文中,我们提出了一个基于令牌的克隆检测器SourcererCC,它可以使用标准工作站从大型项目间存储库中检测精确的和接近的克隆。它利用优化的反向索引来快速查询给定代码块的潜在克隆。使用基于令牌排序的过滤启发式方法,可以显著减少索引的大小、检测克隆所需的代码块比较次数,以及判断潜在克隆所需的令牌比较次数。我们评估了SourcererCC的可扩展性、执行时间、召回率和精度,并将其与四个公开可用的最先进的工具进行了比较。为了测量召回率,我们使用了两个最近的基准:(1)一个真实克隆的大基准,BigCloneBench,以及(2)一个基于数千个细粒度人工克隆的突变/注入框架。我们发现SourcererCC具有很高的查全率和准确性,并且能够使用标准工作站扩展到大型项目间存储库(25K项目,250MLOC)。
{"title":"SourcererCC: Scaling Code Clone Detection to Big-Code","authors":"Hitesh Sajnani, V. Saini, Jeffrey Svajlenko, C. Roy, C. Lopes","doi":"10.1145/2884781.2884877","DOIUrl":"https://doi.org/10.1145/2884781.2884877","url":null,"abstract":"Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"36 1","pages":"1157-1168"},"PeriodicalIF":0.0,"publicationDate":"2015-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88264034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 456
SWIM: Synthesizing What I Mean - Code Search and Idiomatic Snippet Synthesis 游泳:合成我的意思-代码搜索和习惯片段合成
Pub Date : 2015-11-26 DOI: 10.1145/2884781.2884808
Mukund Raghothaman, Yi Wei, Y. Hamadi
Modern programming frameworks come with large libraries, with diverse applications such as for matching regular expressions, parsing XML files and sending email. Programmers often use search engines such as Google and Bing to learn about existing APIs. In this paper, we describe SWIM, a tool which suggests code snippets given API-related natural language queries such as "generate md5 hash code". We translate user queries into the APIs of interest using clickthrough data from the Bing search engine. Then, based on patterns learned from open-source code repositories, we synthesize idiomatic code describing the use of these APIs. We introduce emph{structured call sequences} to capture API-usage patterns. Structured call sequences are a generalized form of method call sequences, with if-branches and while-loops to represent conditional and repeated API usage patterns, and are simple to extract and amenable to synthesis. We evaluated SWIM with 30 common C# API-related queries received by Bing. For 70% of the queries, the first suggested snippet was a relevant solution, and a relevant solution was present in the top 10 results for all benchmarked queries. The online portion of the workflow is also very responsive, at an average of 1.5 seconds per snippet.
现代编程框架都带有大型库,以及各种各样的应用程序,比如匹配正则表达式、解析XML文件和发送电子邮件。程序员经常使用谷歌和必应等搜索引擎来了解现有的api。在本文中,我们描述了SWIM,这是一个工具,它给出了与api相关的自然语言查询(如“生成md5哈希码”)的代码片段。我们使用必应搜索引擎的点击数据将用户查询转换为感兴趣的api。然后,基于从开源代码存储库学习到的模式,我们合成描述这些api使用的惯用代码。我们引入emph{结构化调用序列}来捕获api使用模式。结构化调用序列是方法调用序列的一种通用形式,使用if分支和while循环来表示有条件的和重复的API使用模式,并且易于提取并且易于合成。我们用Bing收到的30个常见的c# api相关查询对SWIM进行了评估。70岁% of the queries, the first suggested snippet was a relevant solution, and a relevant solution was present in the top 10 results for all benchmarked queries. The online portion of the workflow is also very responsive, at an average of 1.5 seconds per snippet.
{"title":"SWIM: Synthesizing What I Mean - Code Search and Idiomatic Snippet Synthesis","authors":"Mukund Raghothaman, Yi Wei, Y. Hamadi","doi":"10.1145/2884781.2884808","DOIUrl":"https://doi.org/10.1145/2884781.2884808","url":null,"abstract":"Modern programming frameworks come with large libraries, with diverse applications such as for matching regular expressions, parsing XML files and sending email. Programmers often use search engines such as Google and Bing to learn about existing APIs. In this paper, we describe SWIM, a tool which suggests code snippets given API-related natural language queries such as \"generate md5 hash code\". We translate user queries into the APIs of interest using clickthrough data from the Bing search engine. Then, based on patterns learned from open-source code repositories, we synthesize idiomatic code describing the use of these APIs. We introduce emph{structured call sequences} to capture API-usage patterns. Structured call sequences are a generalized form of method call sequences, with if-branches and while-loops to represent conditional and repeated API usage patterns, and are simple to extract and amenable to synthesis. We evaluated SWIM with 30 common C# API-related queries received by Bing. For 70% of the queries, the first suggested snippet was a relevant solution, and a relevant solution was present in the top 10 results for all benchmarked queries. The online portion of the workflow is also very responsive, at an average of 1.5 seconds per snippet.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"23 1","pages":"357-367"},"PeriodicalIF":0.0,"publicationDate":"2015-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88091775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 152
PAC Learning-Based Verification and Model Synthesis 基于PAC学习的验证与模型综合
Pub Date : 2015-11-03 DOI: 10.1145/2884781.2884860
Yu-Fang Chen, Chiao-En Hsieh, Ondřej Lengál, Tsung-Ju Lii, M. Tsai, Bow-Yaw Wang, Farn Wang
We introduce a novel technique for verification and model synthesis of sequential programs. Our technique is based on learning an approximate regular model of the set of feasible paths in a program, and testing whether this model contains an incorrect behavior. Exact learning algorithms require checking equivalence between the model and the program, which is a difficult problem, in general undecidable. Our learning procedure is therefore based on the framework of probably approximately correct (PAC) learning, which uses sampling instead, and provides correctness guarantees expressed using the terms error probability and confidence. Besides the verification result, our procedure also outputs the model with the said correctness guarantees. Obtained preliminary experiments show encouraging results, in some cases even outperforming mature software verifiers.
介绍了一种新的序列程序验证和模型综合技术。我们的技术是基于学习程序中可行路径集的近似规则模型,并测试该模型是否包含不正确的行为。精确的学习算法需要检查模型和程序之间的等价性,这是一个难题,通常是不确定的。因此,我们的学习过程基于可能近似正确(PAC)学习框架,它使用采样代替,并提供使用术语错误概率和置信度表示的正确性保证。除了验证结果之外,我们的过程还输出具有上述正确性保证的模型。获得的初步实验显示了令人鼓舞的结果,在某些情况下甚至优于成熟的软件验证器。
{"title":"PAC Learning-Based Verification and Model Synthesis","authors":"Yu-Fang Chen, Chiao-En Hsieh, Ondřej Lengál, Tsung-Ju Lii, M. Tsai, Bow-Yaw Wang, Farn Wang","doi":"10.1145/2884781.2884860","DOIUrl":"https://doi.org/10.1145/2884781.2884860","url":null,"abstract":"We introduce a novel technique for verification and model synthesis of sequential programs. Our technique is based on learning an approximate regular model of the set of feasible paths in a program, and testing whether this model contains an incorrect behavior. Exact learning algorithms require checking equivalence between the model and the program, which is a difficult problem, in general undecidable. Our learning procedure is therefore based on the framework of probably approximately correct (PAC) learning, which uses sampling instead, and provides correctness guarantees expressed using the terms error probability and confidence. Besides the verification result, our procedure also outputs the model with the said correctness guarantees. Obtained preliminary experiments show encouraging results, in some cases even outperforming mature software verifiers.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"71 1","pages":"714-724"},"PeriodicalIF":0.0,"publicationDate":"2015-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78885573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Program Synthesis Using Natural Language 使用自然语言的程序合成
Pub Date : 2015-09-01 DOI: 10.1145/2884781.2884786
Aditya Desai, Sumit Gulwani, V. Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, R. Sailesh, Subhajit Roy
Interacting with computers is a ubiquitous activity for millions of people. Repetitive or specialized tasks often require creation of small, often one-off, programs. End-users struggle with learning and using the myriad of domain-specific languages (DSLs) to effectively accomplish these tasks. We present a general framework for constructing program synthesizers that take natural language (NL) inputs and produce expressions in a target DSL. The framework takes as input a DSL definition and training data consisting of NL/DSL pairs. From these it constructs a synthesizer by learning optimal weights and classifiers (using NLP features) that rank the outputs of a keyword-programming based translation. We applied our framework to three domains: repetitive text editing, an intelligent tutoring system, and flight information queries. On 1200+ English descriptions, the respective synthesizers rank the desired program as the top-1 and top-3 for 80% and 90% descriptions respectively.
对数百万人来说,与计算机交互是一项无处不在的活动。重复的或专门的任务通常需要创建小的,通常是一次性的程序。为了有效地完成这些任务,最终用户需要学习和使用无数特定于领域的语言(dsl)。我们提出了一个构建程序合成器的一般框架,该程序合成器接受自然语言(NL)输入并在目标DSL中产生表达式。该框架以一个DSL定义和由NL/DSL对组成的训练数据作为输入。从这些数据中,它通过学习最优权重和分类器(使用NLP特征)构建了一个合成器,该合成器对基于关键字编程的翻译的输出进行排序。我们将框架应用于三个领域:重复文本编辑、智能辅导系统和航班信息查询。在1200多个英文描述中,各自的合成器分别将所需程序评为80%和90%描述的前1名和前3名。
{"title":"Program Synthesis Using Natural Language","authors":"Aditya Desai, Sumit Gulwani, V. Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, R. Sailesh, Subhajit Roy","doi":"10.1145/2884781.2884786","DOIUrl":"https://doi.org/10.1145/2884781.2884786","url":null,"abstract":"Interacting with computers is a ubiquitous activity for millions of people. Repetitive or specialized tasks often require creation of small, often one-off, programs. End-users struggle with learning and using the myriad of domain-specific languages (DSLs) to effectively accomplish these tasks. We present a general framework for constructing program synthesizers that take natural language (NL) inputs and produce expressions in a target DSL. The framework takes as input a DSL definition and training data consisting of NL/DSL pairs. From these it constructs a synthesizer by learning optimal weights and classifiers (using NLP features) that rank the outputs of a keyword-programming based translation. We applied our framework to three domains: repetitive text editing, an intelligent tutoring system, and flight information queries. On 1200+ English descriptions, the respective synthesizers rank the desired program as the top-1 and top-3 for 80% and 90% descriptions respectively.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"63 1","pages":"345-356"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86878195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 117
Behavioral Log Analysis with Statistical Guarantees 具有统计保证的行为日志分析
Pub Date : 2015-08-30 DOI: 10.1145/2786805.2803198
Nimrod Busany, S. Maoz
Scalability is a major challenge for existing behavioral log analysis algorithms, which extract finite-state automaton models or temporal properties from logs generated by running systems. In this paper we present statistical log analysis, which addresses scalability using statistical tools. The key to our approach is to consider behavioral log analysis as a statistical experiment.Rather than analyzing the entire log, we suggest to analyze only a sample of traces from the log and, most importantly, provide means to compute statistical guarantees for the correctness of the analysis result.We present the theoretical foundations of our approach and describe two example applications, to the classic k-Tails algorithm and to the recently presented BEAR algorithm.Finally, based on experiments with logs generated from real-world models and with real-world logs provided to us by our industrial partners, we present extensive evidence for the need for scalable log analysis and for the effectiveness of statistical log analysis.
可伸缩性是现有行为日志分析算法面临的主要挑战,这些算法从运行系统生成的日志中提取有限状态自动机模型或时间属性。在本文中,我们介绍了统计日志分析,它使用统计工具解决了可伸缩性问题。我们方法的关键是将行为日志分析视为统计实验。我们建议不分析整个日志,而只分析日志中的一个轨迹样本,最重要的是,提供计算分析结果正确性的统计保证的方法。我们介绍了我们的方法的理论基础,并描述了两个例子应用,经典的k-Tails算法和最近提出的BEAR算法。最后,基于对现实世界模型生成的日志和我们的工业合作伙伴提供给我们的现实世界日志的实验,我们提出了大量证据,证明需要可扩展的日志分析和统计日志分析的有效性。
{"title":"Behavioral Log Analysis with Statistical Guarantees","authors":"Nimrod Busany, S. Maoz","doi":"10.1145/2786805.2803198","DOIUrl":"https://doi.org/10.1145/2786805.2803198","url":null,"abstract":"Scalability is a major challenge for existing behavioral log analysis algorithms, which extract finite-state automaton models or temporal properties from logs generated by running systems. In this paper we present statistical log analysis, which addresses scalability using statistical tools. The key to our approach is to consider behavioral log analysis as a statistical experiment.Rather than analyzing the entire log, we suggest to analyze only a sample of traces from the log and, most importantly, provide means to compute statistical guarantees for the correctness of the analysis result.We present the theoretical foundations of our approach and describe two example applications, to the classic k-Tails algorithm and to the recently presented BEAR algorithm.Finally, based on experiments with logs generated from real-world models and with real-world logs provided to us by our industrial partners, we present extensive evidence for the need for scalable log analysis and for the effectiveness of statistical log analysis.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"30 1","pages":"877-887"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87752428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Efficient Large-Scale Trace Checking Using MapReduce 使用MapReduce的高效大规模跟踪检查
Pub Date : 2015-08-26 DOI: 10.1145/2884781.2884832
M. Bersani, D. Bianculli, C. Ghezzi, S. Krstic, P. S. Pietro
The problem of checking a logged event trace against a temporal logic specification arises in many practical cases. Unfortunately, known algorithms for an expressive logic like MTL (Metric Temporal Logic) do not scale with respect to two crucial dimensions: the length of the trace and the size of the time interval of the formula to be checked. The former issue can be addressed by distributed and parallel trace checking algorithms that can take advantage of modern cloud computing and programming frameworks like MapReduce. Still, the latter issue remains open with current state-of-the-art approaches. In this paper we address this memory scalability issue by proposing a new semantics for MTL, called lazy semantics. This semantics can evaluate temporal formulae and boolean combinations of temporal-only formulae at any arbitrary time instant. We prove that lazy semantics is more expressive than point-based semantics and that it can be used as a basis for a correct parametric decomposition of any MTL formula into an equivalent one with smaller, bounded time intervals. We use lazy semantics to extend our previous distributed trace checking algorithm for MTL. The evaluation shows that the proposed algorithm can check formulae with large intervals, on large traces, in a memory-efficient way.
根据时间逻辑规范检查记录的事件跟踪的问题在许多实际情况下都会出现。不幸的是,对于像MTL(度量时态逻辑)这样的表达性逻辑,已知的算法不能根据两个关键维度进行缩放:跟踪的长度和要检查的公式的时间间隔的大小。前一个问题可以通过分布式和并行跟踪检查算法来解决,这些算法可以利用现代云计算和编程框架(如MapReduce)。尽管如此,后一个问题仍然存在于目前最先进的方法中。在本文中,我们通过提出一种新的MTL语义来解决这个内存可伸缩性问题,称为惰性语义。该语义可以在任意时刻计算时间公式和仅时间公式的布尔组合。我们证明了惰性语义比基于点的语义更具表现力,并且它可以作为将任何MTL公式正确参数分解为具有较小有界时间间隔的等效公式的基础。我们使用懒惰语义扩展了之前的MTL分布式跟踪检查算法。计算结果表明,该算法能够在较大的轨迹上对大间隔的公式进行校验,并且具有较高的内存利用率。
{"title":"Efficient Large-Scale Trace Checking Using MapReduce","authors":"M. Bersani, D. Bianculli, C. Ghezzi, S. Krstic, P. S. Pietro","doi":"10.1145/2884781.2884832","DOIUrl":"https://doi.org/10.1145/2884781.2884832","url":null,"abstract":"The problem of checking a logged event trace against a temporal logic specification arises in many practical cases. Unfortunately, known algorithms for an expressive logic like MTL (Metric Temporal Logic) do not scale with respect to two crucial dimensions: the length of the trace and the size of the time interval of the formula to be checked. The former issue can be addressed by distributed and parallel trace checking algorithms that can take advantage of modern cloud computing and programming frameworks like MapReduce. Still, the latter issue remains open with current state-of-the-art approaches. In this paper we address this memory scalability issue by proposing a new semantics for MTL, called lazy semantics. This semantics can evaluate temporal formulae and boolean combinations of temporal-only formulae at any arbitrary time instant. We prove that lazy semantics is more expressive than point-based semantics and that it can be used as a basis for a correct parametric decomposition of any MTL formula into an equivalent one with smaller, bounded time intervals. We use lazy semantics to extend our previous distributed trace checking algorithm for MTL. The evaluation shows that the proposed algorithm can check formulae with large intervals, on large traces, in a memory-efficient way.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"134 1","pages":"888-898"},"PeriodicalIF":0.0,"publicationDate":"2015-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80986839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Learning API Usages from Bytecode: A Statistical Approach 从字节码学习API用法:一种统计方法
Pub Date : 2015-07-27 DOI: 10.1145/2884781.2884873
Tam The Nguyen, H. Pham, P. Vu, T. Nguyen
Mobile app developers rely heavily on standard API frameworks and libraries. However, learning API usages is often challenging due to the fast-changing nature of API frameworks for mobile systems and the insufficiency of API documentation and source code examples. In this paper, we propose a novel approach to learn API usages from bytecode of Android mobile apps. Our core contributions include HAPI, a statistical model of API usages and three algorithms to extract method call sequences from apps’ bytecode, to train HAPI based on those sequences, and to recommend method calls in code completion using the trained HAPIs. Our empirical evaluation shows that our prototype tool can effectively learn API usages from 200 thousand apps containing 350 million method sequences. It recommends next method calls with top-3 accuracy of 90% and outperforms baseline approaches on average 10-20%.
移动应用开发者严重依赖于标准的API框架和库。然而,由于移动系统API框架的快速变化以及API文档和源代码示例的不足,学习API用法通常具有挑战性。在本文中,我们提出了一种从Android移动应用程序的字节码中学习API用法的新方法。我们的核心贡献包括HAPI,一个API使用的统计模型和三种算法,用于从应用程序的字节码中提取方法调用序列,根据这些序列训练HAPI,并使用训练好的HAPI在代码补全中推荐方法调用。我们的实证评估表明,我们的原型工具可以有效地从包含3.5亿个方法序列的20万个应用程序中学习API用法。它建议下一个方法调用的前3个准确率为90%,平均优于基准方法10-20%。
{"title":"Learning API Usages from Bytecode: A Statistical Approach","authors":"Tam The Nguyen, H. Pham, P. Vu, T. Nguyen","doi":"10.1145/2884781.2884873","DOIUrl":"https://doi.org/10.1145/2884781.2884873","url":null,"abstract":"Mobile app developers rely heavily on standard API frameworks and libraries. However, learning API usages is often challenging due to the fast-changing nature of API frameworks for mobile systems and the insufficiency of API documentation and source code examples. In this paper, we propose a novel approach to learn API usages from bytecode of Android mobile apps. Our core contributions include HAPI, a statistical model of API usages and three algorithms to extract method call sequences from apps’ bytecode, to train HAPI based on those sequences, and to recommend method calls in code completion using the trained HAPIs. Our empirical evaluation shows that our prototype tool can effectively learn API usages from 200 thousand apps containing 350 million method sequences. It recommends next method calls with top-3 accuracy of 90% and outperforms baseline approaches on average 10-20%.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"322 1","pages":"416-427"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80287215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 68
On the "Naturalness" of Buggy Code 论bug代码的“自然性”
Pub Date : 2015-06-03 DOI: 10.1145/2884781.2884848
Baishakhi Ray, V. Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, Premkumar T. Devanbu
Real software, the kind working programmers produce by the kLOC to solve real-world problems, tends to be “natural”, like speech or natural language; it tends to be highly repetitive and predictable. Researchers have captured this naturalness of software through statistical models and used them to good effect in suggestion engines, porting tools, coding standards checkers, and idiom miners. This suggests that code that appears improbable, or surprising, to a good statistical language model is “unnatural” in some sense, and thus possibly suspicious. In this paper, we investigate this hypothesis. We consider a large corpus of bug fix commits (ca. 7,139), from 10 different Java projects, and focus on its language statistics, evaluating the naturalness of buggy code and the corresponding fixes. We find that code with bugs tends to be more entropic (i.e. unnatural), becoming less so as bugs are fixed. Ordering files for inspection by their average entropy yields cost-effectiveness scores comparable to popular defect prediction methods. At a finer granularity, focusing on highly entropic lines is similar in cost-effectiveness to some well-known static bug finders (PMD, FindBugs) and or- dering warnings from these bug finders using an entropy measure improves the cost-effectiveness of inspecting code implicated in warnings. This suggests that entropy may be a valid, simple way to complement the effectiveness of PMD or FindBugs, and that search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes.
真正的软件,即程序员为解决现实世界问题而开发的那种软件,往往是“自然的”,比如语音或自然语言;它往往是高度重复和可预测的。研究人员通过统计模型捕获了软件的这种自然性,并在建议引擎、移植工具、编码标准检查器和习语挖掘器中发挥了良好的作用。这表明,对于一个好的统计语言模型来说,看似不可能或令人惊讶的代码在某种意义上是“不自然的”,因此可能是可疑的。本文对这一假设进行了研究。我们考虑了来自10个不同Java项目的大量bug修复提交(约7139),并关注其语言统计,评估bug代码的自然性和相应的修复。我们发现,有bug的代码往往更具熵(即不自然),随着bug的修复,它变得不那么自然了。根据平均熵对文件进行排序,可以产生与流行的缺陷预测方法相当的成本效益分数。在更细的粒度上,关注高度熵线的成本效益与一些著名的静态bug查找器(PMD、FindBugs)相似,或者使用熵度量从这些bug查找器发出警告,可以提高检查警告中包含的代码的成本效益。这表明熵可能是一种有效的、简单的方法,可以补充PMD或FindBugs的有效性,并且基于搜索的错误修复方法可以从使用熵进行故障定位和搜索修复中获益。
{"title":"On the \"Naturalness\" of Buggy Code","authors":"Baishakhi Ray, V. Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, Premkumar T. Devanbu","doi":"10.1145/2884781.2884848","DOIUrl":"https://doi.org/10.1145/2884781.2884848","url":null,"abstract":"Real software, the kind working programmers produce by the kLOC to solve real-world problems, tends to be “natural”, like speech or natural language; it tends to be highly repetitive and predictable. Researchers have captured this naturalness of software through statistical models and used them to good effect in suggestion engines, porting tools, coding standards checkers, and idiom miners. This suggests that code that appears improbable, or surprising, to a good statistical language model is “unnatural” in some sense, and thus possibly suspicious. In this paper, we investigate this hypothesis. We consider a large corpus of bug fix commits (ca. 7,139), from 10 different Java projects, and focus on its language statistics, evaluating the naturalness of buggy code and the corresponding fixes. We find that code with bugs tends to be more entropic (i.e. unnatural), becoming less so as bugs are fixed. Ordering files for inspection by their average entropy yields cost-effectiveness scores comparable to popular defect prediction methods. At a finer granularity, focusing on highly entropic lines is similar in cost-effectiveness to some well-known static bug finders (PMD, FindBugs) and or- dering warnings from these bug finders using an entropy measure improves the cost-effectiveness of inspecting code implicated in warnings. This suggests that entropy may be a valid, simple way to complement the effectiveness of PMD or FindBugs, and that search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"62 1","pages":"428-439"},"PeriodicalIF":0.0,"publicationDate":"2015-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85783247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 204
Work Practices and Challenges in Pull-Based Development: The Contributor's Perspective 基于拉动的开发中的工作实践和挑战:贡献者的视角
Pub Date : 2015-05-16 DOI: 10.1145/2884781.2884826
Georgios Gousios, M. Storey, Alberto Bacchelli
The pull-based development model is an emerging way of contributing to distributed software projects that is gaining enormous popularity within the open source software (OSS) world. Previous work has examined this model by focusing on projects and their owners—we complement it by examining the work practices of project contributors and the challenges they face.We conducted a survey with 645 top contributors to active OSS projects using the pull-based model on GitHub, the prevalent social coding site. We also analyzed traces extracted from corresponding GitHub repositories. Our research shows that: contributors have a strong interest in maintaining awareness of project status to get inspiration and avoid duplicating work, but they do not actively propagate information; communication within pull requests is reportedly limited to low-level concerns and contributors often use communication channels external to pull requests; challenges are mostly social in nature, with most reporting poor responsiveness from integrators; and the increased transparency of this setting is a confirmed motivation to contribute. Based on these findings, we present recommendations for practitioners to streamline the contribution process and discuss potential future research directions.
基于拉的开发模型是一种新兴的分布式软件项目的开发方式,它在开放源码软件(OSS)世界中越来越受欢迎。以前的工作通过关注项目及其所有者来检查这个模型,我们通过检查项目贡献者的工作实践和他们面临的挑战来补充它。我们使用GitHub(流行的社交编码网站)上的pull-based模型,对645个活跃OSS项目的顶级贡献者进行了调查。我们还分析了从相应的GitHub存储库中提取的痕迹。我们的研究表明:贡献者对保持项目状态的意识有强烈的兴趣,以获得灵感和避免重复工作,但他们不会主动传播信息;据报道,拉请求内部的通信仅限于低级别的关注,贡献者经常使用外部的通信渠道进行拉请求;挑战主要是社交性质的,大多数报告来自整合者的反应较差;这种环境透明度的提高是做出贡献的有力动力。基于这些发现,我们提出了建议,以简化从业者的贡献过程,并讨论了潜在的未来研究方向。
{"title":"Work Practices and Challenges in Pull-Based Development: The Contributor's Perspective","authors":"Georgios Gousios, M. Storey, Alberto Bacchelli","doi":"10.1145/2884781.2884826","DOIUrl":"https://doi.org/10.1145/2884781.2884826","url":null,"abstract":"The pull-based development model is an emerging way of contributing to distributed software projects that is gaining enormous popularity within the open source software (OSS) world. Previous work has examined this model by focusing on projects and their owners—we complement it by examining the work practices of project contributors and the challenges they face.We conducted a survey with 645 top contributors to active OSS projects using the pull-based model on GitHub, the prevalent social coding site. We also analyzed traces extracted from corresponding GitHub repositories. Our research shows that: contributors have a strong interest in maintaining awareness of project status to get inspiration and avoid duplicating work, but they do not actively propagate information; communication within pull requests is reportedly limited to low-level concerns and contributors often use communication channels external to pull requests; challenges are mostly social in nature, with most reporting poor responsiveness from integrators; and the increased transparency of this setting is a confirmed motivation to contribute. Based on these findings, we present recommendations for practitioners to streamline the contribution process and discuss potential future research directions.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"3 1","pages":"285-296"},"PeriodicalIF":0.0,"publicationDate":"2015-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81350533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 308
期刊
2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1