2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)最新文献_第6页

Reengineering an industrial HMI: Approach, objectives, and challenges 工业人机界面再造:方法、目标和挑战

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-03-01 DOI: 10.1109/SANER.2018.8330257

B. Dorninger, M. Moser, Albin Kern

Human Machine Interfaces (HMI) play a pivotal role in operating industrial machines. Depending on the extension of a manufacturers domain and the range of its machines as well as the possible options and variants, the ensuing HMI component repository may become substantially large, resulting in significant maintenance requirements and subsequent cost. A combination of cost pressure and other factors, such as significant change of requirements, may then call for a substantial reengineering. A viable alternative to manually reengineering the whole HMI framework might be the use of (semi)-automated reengineering techniques for suitable parts. We describe such a model based reengineering procedure relying on static analysis of the existing source code for suited aspects of a large HMI framework. We will sketch our overall approach including the objectives and highlight some important challenges of transforming HMI component information extracted from source code into a representation developed for the completely redesigned HMI infrastructure in the light of an existing product assembly and configuration process at a large machinery manufacturer.

人机界面(HMI)在工业机器的操作中起着举足轻重的作用。根据制造商领域的扩展及其机器的范围以及可能的选项和变体，随后的HMI组件存储库可能会变得非常大，从而导致重大的维护需求和后续成本。成本压力和其他因素的结合，例如需求的重大变化，可能需要大量的重新设计。手动重新设计整个HMI框架的可行替代方案可能是对合适的部件使用(半)自动化的重新设计技术。我们描述了这样一个基于模型的再造过程，它依赖于对大型HMI框架的合适方面的现有源代码的静态分析。我们将概述我们的总体方法，包括目标，并强调将从源代码中提取的HMI组件信息转换为根据大型机械制造商现有产品组装和配置过程开发的完全重新设计的HMI基础设施的表示的一些重要挑战。

{"title":"Reengineering an industrial HMI: Approach, objectives, and challenges","authors":"B. Dorninger, M. Moser, Albin Kern","doi":"10.1109/SANER.2018.8330257","DOIUrl":"https://doi.org/10.1109/SANER.2018.8330257","url":null,"abstract":"Human Machine Interfaces (HMI) play a pivotal role in operating industrial machines. Depending on the extension of a manufacturers domain and the range of its machines as well as the possible options and variants, the ensuing HMI component repository may become substantially large, resulting in significant maintenance requirements and subsequent cost. A combination of cost pressure and other factors, such as significant change of requirements, may then call for a substantial reengineering. A viable alternative to manually reengineering the whole HMI framework might be the use of (semi)-automated reengineering techniques for suitable parts. We describe such a model based reengineering procedure relying on static analysis of the existing source code for suited aspects of a large HMI framework. We will sketch our overall approach including the objectives and highlight some important challenges of transforming HMI component information extracted from source code into a representation developed for the completely redesigned HMI infrastructure in the light of an existing product assembly and configuration process at a large machinery manufacturer.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"28 1","pages":"547-551"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75321751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

FINALIsT2: Feature identification, localization, and tracing tool FINALIsT2:特征识别、定位和跟踪工具

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-03-01 DOI: 10.1109/SANER.2018.8330254

Andreas Burger, Sten Grüner

Feature identification and localization is a complicated and error-prone task. Nowadays it is mainly done manually by lead software developer or domain experts. Sometimes these experts are no longer available or cannot support in the feature identification and localization process. Due to that we propose a tool which supports this process with an iterative semi-automatic workflow for identifying, localizing and documenting features. Our tool calculates a feature cluster based on an defined entry point that is found by using information retrieval techniques. This feature cluster will be iteratively refined by the user. This iterative feedback-driven workflow enables developer which are not deeply involved in the development of the software to identify and extract features properly. We evaluated our tool on an industrial smart control system for electric motors with first promising results.

特征识别和定位是一项复杂且容易出错的任务。现在它主要是由首席软件开发人员或领域专家手工完成的。有时这些专家不再可用或不能在特征识别和定位过程中提供支持。因此，我们提出了一种工具，它通过迭代的半自动工作流程来支持这一过程，用于识别、本地化和记录功能。我们的工具根据使用信息检索技术找到的定义入口点计算特征集群。这个特征集群将由用户迭代地改进。这种迭代反馈驱动的工作流程使得没有深入参与软件开发的开发人员能够正确地识别和提取特性。我们在电机的工业智能控制系统上评估了我们的工具，并取得了初步的成果。

引用次数: 8

The impact of refactoring changes on the SZZ algorithm: An empirical study 重构变化对SZZ算法的影响:实证研究

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-03-01 DOI: 10.1109/SANER.2018.8330225

Edmilson Campos Neto, D. A. D. Costa, U. Kulesza

SZZ is a widely used algorithm in the software engineering community to identify changes that are likely to introduce bugs (i.e., bug-introducing changes). Despite its wide adoption, SZZ still has room for improvements. For example, current SZZ implementations may still flag refactoring changes as bug-introducing. Refactorings should be disregarded as bug-introducing because they do not change the system behaviour. In this paper, we empirically investigate how refactorings impact both the input (bug-fix changes) and the output (bug-introducing changes) of the SZZ algorithm. We analyse 31,518 issues of ten Apache projects with 20,298 bug-introducing changes. We use an existing tool that automatically detects refactorings in code changes. We observe that 6.5% of lines that are flagged as bug-introducing changes by SZZ are in fact refactoring changes. Regarding bug-fix changes, we observe that 19.9% of lines that are removed during a fix are related to refactorings and, therefore, their respective inducing changes are false positives. We then incorporate the refactoring-detection tool in our Refactoring Aware SZZ Implementation (RA-SZZ). Our results reveal that RA-SZZ reduces 20.8% of the lines that are flagged as bug-introducing changes compared to the state-of-the-art SZZ implementations. Finally, we perform a manual analysis to identify change patterns that are not captured by the refactoring identification tool used in our study. Our results reveal that 47.95% of the analyzed bug-introducing changes contain additional change patterns that RA-SZZ should not flag as bug-introducing.

SZZ是软件工程社区中广泛使用的算法，用于识别可能引入bug的更改(即引入bug的更改)。尽管被广泛采用，SZZ仍有改进的空间。例如，当前的SZZ实现可能仍然将重构更改标记为引入bug。重构应该被忽略为bug引入，因为它们不会改变系统行为。在本文中，我们实证地研究了重构如何影响SZZ算法的输入(bug修复更改)和输出(bug引入更改)。我们分析了10个Apache项目的31,518个问题，其中包含20,298个引入bug的更改。我们使用一个现有的工具来自动检测代码更改中的重构。我们观察到6.5%被SZZ标记为bug引入变更的行实际上是重构变更。关于bug修复更改，我们观察到在修复期间删除的19.9%的行与重构有关，因此，它们各自的诱导更改是假阳性的。然后我们将重构检测工具合并到我们的重构感知SZZ实现(RA-SZZ)中。我们的结果显示，与最先进的SZZ实现相比，RA-SZZ减少了20.8%被标记为引入bug的更改的行。最后，我们执行手动分析，以识别在我们的研究中使用的重构识别工具无法捕获的更改模式。我们的结果显示，47.95%被分析的引入bug的变更包含了RA-SZZ不应该标记为引入bug的附加变更模式。

{"title":"The impact of refactoring changes on the SZZ algorithm: An empirical study","authors":"Edmilson Campos Neto, D. A. D. Costa, U. Kulesza","doi":"10.1109/SANER.2018.8330225","DOIUrl":"https://doi.org/10.1109/SANER.2018.8330225","url":null,"abstract":"SZZ is a widely used algorithm in the software engineering community to identify changes that are likely to introduce bugs (i.e., bug-introducing changes). Despite its wide adoption, SZZ still has room for improvements. For example, current SZZ implementations may still flag refactoring changes as bug-introducing. Refactorings should be disregarded as bug-introducing because they do not change the system behaviour. In this paper, we empirically investigate how refactorings impact both the input (bug-fix changes) and the output (bug-introducing changes) of the SZZ algorithm. We analyse 31,518 issues of ten Apache projects with 20,298 bug-introducing changes. We use an existing tool that automatically detects refactorings in code changes. We observe that 6.5% of lines that are flagged as bug-introducing changes by SZZ are in fact refactoring changes. Regarding bug-fix changes, we observe that 19.9% of lines that are removed during a fix are related to refactorings and, therefore, their respective inducing changes are false positives. We then incorporate the refactoring-detection tool in our Refactoring Aware SZZ Implementation (RA-SZZ). Our results reveal that RA-SZZ reduces 20.8% of the lines that are flagged as bug-introducing changes compared to the state-of-the-art SZZ implementations. Finally, we perform a manual analysis to identify change patterns that are not captured by the refactoring identification tool used in our study. Our results reveal that 47.95% of the analyzed bug-introducing changes contain additional change patterns that RA-SZZ should not flag as bug-introducing.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"145 5 1","pages":"380-390"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89362843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 67

Using recurrent neural networks for decompilation 使用递归神经网络进行反编译

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-03-01 DOI: 10.1109/SANER.2018.8330222

Deborah S. Katz, Jason Ruchti, Eric Schulte

Decompilation, recovering source code from binary, is useful in many situations where it is necessary to analyze or understand software for which source code is not available. Source code is much easier for humans to read than binary code, and there are many tools available to analyze source code. Existing decompilation techniques often generate source code that is difficult for humans to understand because the generated code often does not use the coding idioms that programmers use. Differences from human-written code also reduce the effectiveness of analysis tools on the decompiled source code. To address the problem of differences between decompiled code and human-written code, we present a novel technique for decompiling binary code snippets using a model based on Recurrent Neural Networks. The model learns properties and patterns that occur in source code and uses them to produce decompilation output. We train and evaluate our technique on snippets of binary machine code compiled from C source code. The general approach we outline in this paper is not language-specific and requires little or no domain knowledge of a language and its properties or how a compiler operates, making the approach easily extensible to new languages and constructs. Furthermore, the technique can be extended and applied in situations to which traditional decompilers are not targeted, such as for decompilation of isolated binary snippets; fast, on-demand decompilation; domain-specific learned decompilation; optimizing for readability of decompilation; and recovering control flow constructs, comments, and variable or function names. We show that the translations produced by this technique are often accurate or close and can provide a useful picture of the snippet's behavior.

反编译，即从二进制文件中恢复源代码，在许多需要分析或理解源代码不可用的软件的情况下非常有用。对于人类来说，源代码比二进制代码更容易阅读，并且有许多工具可用于分析源代码。现有的反编译技术经常生成人类难以理解的源代码，因为生成的代码通常不使用程序员使用的编码习惯。与人类编写的代码的差异也降低了分析工具对反编译源代码的有效性。为了解决反编译代码和人工编写代码之间的差异问题，我们提出了一种基于递归神经网络模型的二进制代码片段反编译新技术。模型学习源代码中出现的属性和模式，并使用它们生成反编译输出。我们在从C源代码编译的二进制机器码片段上训练和评估我们的技术。我们在本文中概述的一般方法不是特定于语言的，并且需要很少或不需要语言及其属性或编译器如何操作的领域知识，这使得该方法很容易扩展到新的语言和结构。此外，该技术可以扩展和应用于传统反编译器不针对的情况，例如反编译孤立的二进制片段;快速，按需反编译;特定领域学习反编译;反编译可读性的优化;以及恢复控制流结构、注释和变量或函数名。我们展示了由这种技术产生的翻译通常是准确的或接近的，并且可以提供一个有用的片段行为的图片。

{"title":"Using recurrent neural networks for decompilation","authors":"Deborah S. Katz, Jason Ruchti, Eric Schulte","doi":"10.1109/SANER.2018.8330222","DOIUrl":"https://doi.org/10.1109/SANER.2018.8330222","url":null,"abstract":"Decompilation, recovering source code from binary, is useful in many situations where it is necessary to analyze or understand software for which source code is not available. Source code is much easier for humans to read than binary code, and there are many tools available to analyze source code. Existing decompilation techniques often generate source code that is difficult for humans to understand because the generated code often does not use the coding idioms that programmers use. Differences from human-written code also reduce the effectiveness of analysis tools on the decompiled source code. To address the problem of differences between decompiled code and human-written code, we present a novel technique for decompiling binary code snippets using a model based on Recurrent Neural Networks. The model learns properties and patterns that occur in source code and uses them to produce decompilation output. We train and evaluate our technique on snippets of binary machine code compiled from C source code. The general approach we outline in this paper is not language-specific and requires little or no domain knowledge of a language and its properties or how a compiler operates, making the approach easily extensible to new languages and constructs. Furthermore, the technique can be extended and applied in situations to which traditional decompilers are not targeted, such as for decompilation of isolated binary snippets; fast, on-demand decompilation; domain-specific learned decompilation; optimizing for readability of decompilation; and recovering control flow constructs, comments, and variable or function names. We show that the translations produced by this technique are often accurate or close and can provide a useful picture of the snippet's behavior.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"22 1","pages":"346-356"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74606406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

GoldRusher: A miner for rapid identification of hidden code 快速识别隐藏代码的挖矿器

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-03-01 DOI: 10.1109/SANER.2018.8330251

Aleieldin Salem

GoldRusher is a dynamic analysis tool primarily meant to aid reverse engineers with analyzing malware. Based on the fact that hidden code segments rarely execute, the tool is able to rapidly highlight functions and basic blocks that are potentially hidden, and identify the trigger conditions that control their executions.

GoldRusher是一个动态分析工具，主要是为了帮助逆向工程师分析恶意软件。基于隐藏代码段很少执行的事实，该工具能够快速突出显示可能隐藏的函数和基本块，并识别控制其执行的触发条件。

引用次数: 0

LICCA: A tool for cross-language clone detection LICCA:跨语言克隆检测工具

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-03-01 DOI: 10.1109/SANER.2018.8330250

Tijana Vislavski, Gordana Rakic, Nicolás Cardozo, Z. Budimac

Code clones mostly have been proven harmful for the development and maintenance of software systems, leading to code deterioration and an increase in bugs as the system evolves. Modern software systems are composed of several components, incorporating multiple technologies in their development. In such systems, it is common to replicate (parts of) functionality across the different components, potentially in a different programming language. Effect of these duplicates is more acute, as their identification becomes more challenging. This paper presents LICCA, a tool for the identification of duplicate code fragments across multiple languages. LICCA is integrated with the SSQSA platform and relies on its high-level representation of code in which it is possible to extract syntactic and semantic characteristics of code fragments positing full cross-language clone detection. LICCA is on a technology development level. We demonstrate its potential by adopting a set of cloning scenarios, extended and rewritten in five characteristic languages: Java, C, JavaScript, Modula-2 and Scheme.

事实证明，代码克隆对软件系统的开发和维护是有害的，随着系统的发展，它会导致代码的退化和bug的增加。现代软件系统由几个组件组成，在其开发中结合了多种技术。在这样的系统中，跨不同组件(可能使用不同的编程语言)复制(部分)功能是很常见的。这些重复的影响更严重，因为它们的识别变得更具挑战性。本文介绍了LICCA，一种跨多种语言识别重复代码片段的工具。LICCA与SSQSA平台集成，并依赖于其代码的高级表示，其中可以提取代码片段的语法和语义特征，从而实现完整的跨语言克隆检测。LICCA处于技术发展水平。我们通过采用一组克隆场景来展示其潜力，这些场景使用五种特色语言进行扩展和重写:Java、C、JavaScript、Modula-2和Scheme。

{"title":"LICCA: A tool for cross-language clone detection","authors":"Tijana Vislavski, Gordana Rakic, Nicolás Cardozo, Z. Budimac","doi":"10.1109/SANER.2018.8330250","DOIUrl":"https://doi.org/10.1109/SANER.2018.8330250","url":null,"abstract":"Code clones mostly have been proven harmful for the development and maintenance of software systems, leading to code deterioration and an increase in bugs as the system evolves. Modern software systems are composed of several components, incorporating multiple technologies in their development. In such systems, it is common to replicate (parts of) functionality across the different components, potentially in a different programming language. Effect of these duplicates is more acute, as their identification becomes more challenging. This paper presents LICCA, a tool for the identification of duplicate code fragments across multiple languages. LICCA is integrated with the SSQSA platform and relies on its high-level representation of code in which it is possible to extract syntactic and semantic characteristics of code fragments positing full cross-language clone detection. LICCA is on a technology development level. We demonstrate its potential by adopting a set of cloning scenarios, extended and rewritten in five characteristic languages: Java, C, JavaScript, Modula-2 and Scheme.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"15 24 1","pages":"512-516"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79619888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Fuzz testing in practice: Obstacles and solutions 实践中的模糊测试:障碍和解决方案

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-03-01 DOI: 10.1109/SANER.2018.8330260

Jie Liang, Mingzhe Wang, Yuanliang Chen, Yu Jiang, Renwei Zhang

Fuzz testing has helped security researchers and organizations discover a large number of vulnerabilities. Although it is efficient and widely used in industry, hardly any empirical studies and experience exist on the customization of fuzzers to real industrial projects. In this paper, collaborating with the engineers from Huawei, we present the practice of adapting fuzz testing to a proprietary message middleware named libmsg, which is responsible for the message transfer of the entire distributed system department. We present the main obstacles coming across in applying an efficient fuzzer to libmsg, including system configuration inconsistency, system build complexity, fuzzing driver absence. The solutions for those typical obstacles are also provided. For example, for the most difficult and expensive obstacle of writing fuzzing drivers, we present a low-cost approach by converting existing sample code snippets into fuzzing drivers. After overcoming those obstacles, we can effectively identify software bugs, and report 9 previously unknown vulnerabilities, including flaws that lead to denial of service or system crash.

模糊测试帮助安全研究人员和组织发现了大量的漏洞。虽然它在工业中得到了广泛的应用，但很少有针对实际工业项目的模糊器定制的实证研究和经验。在本文中，我们与华为的工程师合作，提出了将模糊测试应用于一个名为libmsg的专有消息中间件的实践，该中间件负责整个分布式系统部门的消息传输。我们介绍了在对libmsg应用有效的模糊器时遇到的主要障碍，包括系统配置不一致、系统构建复杂性、模糊检测驱动程序缺失。并给出了典型障碍的解决方法。例如，对于编写模糊测试驱动程序的最困难和最昂贵的障碍，我们提出了一种低成本的方法，即将现有的示例代码片段转换为模糊测试驱动程序。在克服这些障碍之后，我们可以有效地识别软件错误，并报告9个以前未知的漏洞，包括导致拒绝服务或系统崩溃的缺陷。

{"title":"Fuzz testing in practice: Obstacles and solutions","authors":"Jie Liang, Mingzhe Wang, Yuanliang Chen, Yu Jiang, Renwei Zhang","doi":"10.1109/SANER.2018.8330260","DOIUrl":"https://doi.org/10.1109/SANER.2018.8330260","url":null,"abstract":"Fuzz testing has helped security researchers and organizations discover a large number of vulnerabilities. Although it is efficient and widely used in industry, hardly any empirical studies and experience exist on the customization of fuzzers to real industrial projects. In this paper, collaborating with the engineers from Huawei, we present the practice of adapting fuzz testing to a proprietary message middleware named libmsg, which is responsible for the message transfer of the entire distributed system department. We present the main obstacles coming across in applying an efficient fuzzer to libmsg, including system configuration inconsistency, system build complexity, fuzzing driver absence. The solutions for those typical obstacles are also provided. For example, for the most difficult and expensive obstacle of writing fuzzing drivers, we present a low-cost approach by converting existing sample code snippets into fuzzing drivers. After overcoming those obstacles, we can effectively identify software bugs, and report 9 previously unknown vulnerabilities, including flaws that lead to denial of service or system crash.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"22 1","pages":"562-566"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88153900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Mining framework usage graphs from app corpora 从应用语料库中挖掘框架使用图

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-03-01 DOI: 10.1109/SANER.2018.8330216

Sergio Mover, S. Sankaranarayanan, Rhys Braginton Pettee Olsen, B. E. Chang

We investigate the problem of mining graph-based usage patterns for large, object-oriented frameworks like Android—revisiting previous approaches based on graph-based object usage models (groums). Groums are a promising approach to represent usage patterns for object-oriented libraries because they simultaneously describe control flow and data dependencies between methods of multiple interacting object types. However, this expressivity comes at a cost: mining groums requires solving a subgraph isomorphism problem that is well known to be expensive. This cost limits the applicability of groum mining to large API frameworks. In this paper, we employ groum mining to learn usage patterns for object-oriented frameworks from program corpora. The central challenge is to scale groum mining so that it is sensitive to usages horizontally across programs from arbitrarily many developers (as opposed to simply usages vertically within the program of a single developer). To address this challenge, we develop a novel groum mining algorithm that scales on a large corpus of programs. We first use frequent itemset mining to restrict the search for groums to smaller subsets of methods in the given corpus. Then, we pose the subgraph isomorphism as a SAT problem and apply efficient pre-processing algorithms to rule out fruitless comparisons ahead of time. Finally, we identify containment relationships between clusters of groums to characterize popular usage patterns in the corpus (as well as classify less popular patterns as possible anomalies). We find that our approach scales on a corpus of over five hundred open source Android applications, effectively mining obligatory and best-practice usage patterns.

我们研究了为大型面向对象框架(如android)挖掘基于图的使用模式的问题——重新审视了以前基于图的对象使用模型(组)的方法。组是表示面向对象库的使用模式的一种很有前途的方法，因为它们同时描述了多个交互对象类型的方法之间的控制流和数据依赖关系。然而，这种表达性是有代价的:挖掘群需要解决子图同构问题，这是众所周知的昂贵问题。这种成本限制了群挖掘对大型API框架的适用性。在本文中，我们使用群挖掘从程序语料库中学习面向对象框架的使用模式。核心挑战是扩展群挖掘，以便对任意多个开发人员的程序中的横向使用敏感(与单个开发人员的程序中的简单垂直使用相反)。为了应对这一挑战，我们开发了一种新的群挖掘算法，该算法可以在大型程序语料库上进行扩展。我们首先使用频繁项集挖掘来将组的搜索限制在给定语料库中更小的方法子集中。然后，我们将子图同构作为一个SAT问题，并应用有效的预处理算法提前排除无结果的比较。最后，我们确定组簇之间的包含关系，以表征语料库中流行的使用模式(以及将不太流行的模式分类为可能的异常)。我们发现我们的方法可以在超过500个开源Android应用程序的语料库上扩展，有效地挖掘必要的和最佳实践的使用模式。

{"title":"Mining framework usage graphs from app corpora","authors":"Sergio Mover, S. Sankaranarayanan, Rhys Braginton Pettee Olsen, B. E. Chang","doi":"10.1109/SANER.2018.8330216","DOIUrl":"https://doi.org/10.1109/SANER.2018.8330216","url":null,"abstract":"We investigate the problem of mining graph-based usage patterns for large, object-oriented frameworks like Android—revisiting previous approaches based on graph-based object usage models (groums). Groums are a promising approach to represent usage patterns for object-oriented libraries because they simultaneously describe control flow and data dependencies between methods of multiple interacting object types. However, this expressivity comes at a cost: mining groums requires solving a subgraph isomorphism problem that is well known to be expensive. This cost limits the applicability of groum mining to large API frameworks. In this paper, we employ groum mining to learn usage patterns for object-oriented frameworks from program corpora. The central challenge is to scale groum mining so that it is sensitive to usages horizontally across programs from arbitrarily many developers (as opposed to simply usages vertically within the program of a single developer). To address this challenge, we develop a novel groum mining algorithm that scales on a large corpus of programs. We first use frequent itemset mining to restrict the search for groums to smaller subsets of methods in the given corpus. Then, we pose the subgraph isomorphism as a SAT problem and apply efficient pre-processing algorithms to rule out fruitless comparisons ahead of time. Finally, we identify containment relationships between clusters of groums to characterize popular usage patterns in the corpus (as well as classify less popular patterns as possible anomalies). We find that our approach scales on a corpus of over five hundred open source Android applications, effectively mining obligatory and best-practice usage patterns.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"18 1","pages":"277-289"},"PeriodicalIF":0.0,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87567963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Duplicate question detection in stack overflow: A reproducibility study 堆栈溢出中的重复问题检测:再现性研究

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-02-21 DOI: 10.1109/SANER.2018.8330262

Rodrigo F. Silva, K. V. R. Paixão, M. Maia

Stack Overflow has become a fundamental element of developer toolset. Such influence increase has been accompanied by an effort from Stack Overflow community to keep the quality of its content. One of the problems which jeopardizes that quality is the continuous growth of duplicated questions. To solve this problem, prior works focused on automatically detecting duplicated questions. Two important solutions are DupPredictor and Dupe. Despite reporting significant results, both works do not provide their implementations publicly available, hindering subsequent works in scientific literature which rely on them. We executed an empirical study as a reproduction of DupPredictor and Dupe. Our results, not robust when attempted with different set of tools and data sets, show that the barriers to reproduce these approaches are high. Furthermore, when applied to more recent data, we observe a performance decay of our both reproductions in terms of recall-rate over time, as the number of questions increases. Our findings suggest that the subsequent works concerning detection of duplicated questions in Question and Answer communities require more investigation to assert their findings.

Stack Overflow已经成为开发人员工具集的基本元素。这种影响力的增加伴随着Stack Overflow社区的努力，以保持其内容的质量。其中一个危害质量的问题是重复问题的不断增加。为了解决这个问题，之前的工作主要集中在自动检测重复问题上。两个重要的解决方案是DupPredictor和Dupe。尽管报告了重要的结果，但这两项工作都没有公开提供它们的实现，阻碍了科学文献中依赖它们的后续工作。我们执行了一项实证研究作为DupPredictor和Dupe的复制。我们的结果，当尝试使用不同的工具和数据集时，并不健壮，表明再现这些方法的障碍很高。此外，当应用于最近的数据时，我们观察到，随着问题数量的增加，我们的两个复制版本在召回率方面的性能随着时间的推移而下降。我们的研究结果表明，后续关于在问答社区中检测重复问题的工作需要更多的调查来证实他们的发现。

{"title":"Duplicate question detection in stack overflow: A reproducibility study","authors":"Rodrigo F. Silva, K. V. R. Paixão, M. Maia","doi":"10.1109/SANER.2018.8330262","DOIUrl":"https://doi.org/10.1109/SANER.2018.8330262","url":null,"abstract":"Stack Overflow has become a fundamental element of developer toolset. Such influence increase has been accompanied by an effort from Stack Overflow community to keep the quality of its content. One of the problems which jeopardizes that quality is the continuous growth of duplicated questions. To solve this problem, prior works focused on automatically detecting duplicated questions. Two important solutions are DupPredictor and Dupe. Despite reporting significant results, both works do not provide their implementations publicly available, hindering subsequent works in scientific literature which rely on them. We executed an empirical study as a reproduction of DupPredictor and Dupe. Our results, not robust when attempted with different set of tools and data sets, show that the barriers to reproduce these approaches are high. Furthermore, when applied to more recent data, we observe a performance decay of our both reproductions in terms of recall-rate over time, as the number of questions increases. Our findings suggest that the subsequent works concerning detection of duplicated questions in Question and Answer communities require more investigation to assert their findings.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"1 1","pages":"572-581"},"PeriodicalIF":0.0,"publicationDate":"2018-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87099512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Dissection of a bug dataset: Anatomy of 395 patches from Defects4J 剖析bug数据集:剖析来自Defects4J的395个补丁

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pub Date : 2018-01-19 DOI: 10.1109/SANER.2018.8330203

Victor Sobreira, Thomas Durieux, Fernanda Madeiral Delfim, Monperrus Martin, M. Maia

Well-designed and publicly available datasets of bugs are an invaluable asset to advance research fields such as fault localization and program repair as they allow directly and fairly comparison between competing techniques and also the replication of experiments. These datasets need to be deeply understood by researchers: the answer for questions like "which bugs can my technique handle?" and "for which bugs is my technique effective?" depends on the comprehension of properties related to bugs and their patches. However, such properties are usually not included in the datasets, and there is still no widely adopted methodology for characterizing bugs and patches. In this work, we deeply study 395 patches of the Defects4J dataset. Quantitative properties (patch size and spreading) were automatically extracted, whereas qualitative ones (repair actions and patterns) were manually extracted using a thematic analysis-based approach. We found that 1) the median size of Defects4J patches is four lines, and almost 30% of the patches contain only addition of lines; 2) 92% of the patches change only one file, and 38% has no spreading at all; 3) the top-3 most applied repair actions are addition of method calls, conditionals, and assignments, occurring in 77% of the patches; and 4) nine repair patterns were found for 95% of the patches, where the most prevalent, appearing in 43% of the patches, is on conditional blocks. These results are useful for researchers to perform advanced analysis on their techniques' results based on Defects4J. Moreover, our set of properties can be used to characterize and compare different bug datasets.

设计良好且公开可用的漏洞数据集是推进诸如故障定位和程序修复等研究领域的宝贵资产，因为它们可以直接和公平地比较竞争技术，也可以复制实验。研究人员需要深入理解这些数据集:诸如“我的技术可以处理哪些bug ?”和“我的技术对哪些bug有效?”等问题的答案取决于对bug及其补丁相关属性的理解。然而，这些属性通常不包括在数据集中，并且仍然没有广泛采用的方法来表征错误和补丁。在这项工作中，我们深入研究了缺陷4j数据集的395个补丁。定量属性(补丁大小和扩展)是自动提取的，而定性属性(修复行动和模式)是使用基于主题分析的方法手动提取的。我们发现1)缺陷4j补丁的中位数大小是4行，几乎30%的补丁只包含添加的行;2) 92%的补丁只改变了一个文件，38%的补丁根本没有传播;3)应用最多的前3个修复操作是增加方法调用、条件和赋值，占补丁的77%;4)在95%的斑块中发现了9种修复模式，其中最普遍的是条件块，出现在43%的斑块上。这些结果对于研究人员基于Defects4J对他们的技术结果执行高级分析非常有用。此外，我们的属性集可以用来描述和比较不同的bug数据集。

{"title":"Dissection of a bug dataset: Anatomy of 395 patches from Defects4J","authors":"Victor Sobreira, Thomas Durieux, Fernanda Madeiral Delfim, Monperrus Martin, M. Maia","doi":"10.1109/SANER.2018.8330203","DOIUrl":"https://doi.org/10.1109/SANER.2018.8330203","url":null,"abstract":"Well-designed and publicly available datasets of bugs are an invaluable asset to advance research fields such as fault localization and program repair as they allow directly and fairly comparison between competing techniques and also the replication of experiments. These datasets need to be deeply understood by researchers: the answer for questions like \"which bugs can my technique handle?\" and \"for which bugs is my technique effective?\" depends on the comprehension of properties related to bugs and their patches. However, such properties are usually not included in the datasets, and there is still no widely adopted methodology for characterizing bugs and patches. In this work, we deeply study 395 patches of the Defects4J dataset. Quantitative properties (patch size and spreading) were automatically extracted, whereas qualitative ones (repair actions and patterns) were manually extracted using a thematic analysis-based approach. We found that 1) the median size of Defects4J patches is four lines, and almost 30% of the patches contain only addition of lines; 2) 92% of the patches change only one file, and 38% has no spreading at all; 3) the top-3 most applied repair actions are addition of method calls, conditionals, and assignments, occurring in 77% of the patches; and 4) nine repair patterns were found for 95% of the patches, where the most prevalent, appearing in 43% of the patches, is on conditional blocks. These results are useful for researchers to perform advanced analysis on their techniques' results based on Defects4J. Moreover, our set of properties can be used to characterize and compare different bug datasets.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"47 1","pages":"130-140"},"PeriodicalIF":0.0,"publicationDate":"2018-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88682623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 108