Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081890
Douglas H. Martin
Build systems are an integral part of the software development process, being responsible for turning source code into a deliverable product. They are, however, difficult to comprehend and maintain at times. Make, the most popular build language, is often cited as being difficult to debug. In this work, we propose a framework to analyze and manipulate Makefiles, and discover how the language is used in open source systems using existing software analysis techniques like source transformation and clone detection.
{"title":"Towards a framework for analysis, transformation, and manipulation of Makefiles","authors":"Douglas H. Martin","doi":"10.1109/SANER.2015.7081890","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081890","url":null,"abstract":"Build systems are an integral part of the software development process, being responsible for turning source code into a deliverable product. They are, however, difficult to comprehend and maintain at times. Make, the most popular build language, is often cited as being difficult to debug. In this work, we propose a framework to analyze and manipulate Makefiles, and discover how the language is used in open source systems using existing software analysis techniques like source transformation and clone detection.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131849992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081835
Heider Sanchez, R. Robbes, Víctor M. González
Information workers and software developers are exposed to work fragmentation, an interleaving of activities and interruptions during their normal work day. Small-scale observational studies have shown that this can be detrimental to their work. In this paper, we perform a large-scale study of this phenomenon for the particular case of software developers performing software evolution tasks. Our study is based on several thousands interaction traces collected by Mylyn, for dozens of developers. We observe that work fragmentation is correlated to lower observed productivity at both the macro level (for entire sessions), and at the micro level (around markers of work fragmentation); further, longer activity switches seem to strengthen the effect. These observations are basis for subsequent studies investigating the phenomenon of work fragmentation.
{"title":"An empirical study of work fragmentation in software evolution tasks","authors":"Heider Sanchez, R. Robbes, Víctor M. González","doi":"10.1109/SANER.2015.7081835","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081835","url":null,"abstract":"Information workers and software developers are exposed to work fragmentation, an interleaving of activities and interruptions during their normal work day. Small-scale observational studies have shown that this can be detrimental to their work. In this paper, we perform a large-scale study of this phenomenon for the particular case of software developers performing software evolution tasks. Our study is based on several thousands interaction traces collected by Mylyn, for dozens of developers. We observe that work fragmentation is correlated to lower observed productivity at both the macro level (for entire sessions), and at the micro level (around markers of work fragmentation); further, longer activity switches seem to strengthen the effect. These observations are basis for subsequent studies investigating the phenomenon of work fragmentation.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132244699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081871
Sascha Lity, H. Baller, Ina Schaefer
The analysis of nowadays software systems for supporting, e.g., testing, verification or debugging is becoming more challenging due to their increasing complexity. Model slicing is a promising analysis technique to tackle this issue by abstracting from those parts not influencing the current point of interest. In the context of software product lines, applying model slicing separately for each variant is in general infeasible. Delta modeling allows exploiting the explicit specification of commonality and variability within deltas and enables the reuse of artifacts and already obtained results to reduce the modeling and analysis efforts. In this paper, we propose a novel approach for incremental model slicing for delta-oriented software product lines. Based on the specification of model changes between variants by means of model regression deltas, an incremental adaptation of variant-specific dependency graphs as well as an incremental slice computation is achieved. The slice computation further allows for the derivation of differences between slices for the same point of interest enhancing, e.g., change impact analysis. We provide details of our incremental approach, discuss benefits and present future work.
{"title":"Towards incremental model slicing for delta-oriented software product lines","authors":"Sascha Lity, H. Baller, Ina Schaefer","doi":"10.1109/SANER.2015.7081871","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081871","url":null,"abstract":"The analysis of nowadays software systems for supporting, e.g., testing, verification or debugging is becoming more challenging due to their increasing complexity. Model slicing is a promising analysis technique to tackle this issue by abstracting from those parts not influencing the current point of interest. In the context of software product lines, applying model slicing separately for each variant is in general infeasible. Delta modeling allows exploiting the explicit specification of commonality and variability within deltas and enables the reuse of artifacts and already obtained results to reduce the modeling and analysis efforts. In this paper, we propose a novel approach for incremental model slicing for delta-oriented software product lines. Based on the specification of model changes between variants by means of model regression deltas, an incremental adaptation of variant-specific dependency graphs as well as an incremental slice computation is achieved. The slice computation further allows for the derivation of differences between slices for the same point of interest enhancing, e.g., change impact analysis. We provide details of our incremental approach, discuss benefits and present future work.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124041659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081826
Sebastiano Panichella, V. Arnaoudova, M. D. Penta, G. Antoniol
Code reviews have been conducted since decades in software projects, with the aim of improving code quality from many different points of view. During code reviews, developers are supported by checklists, coding standards and, possibly, by various kinds of static analysis tools. This paper investigates whether warnings highlighted by static analysis tools are taken care of during code reviews and, whether there are kinds of warnings that tend to be removed more than others. Results of a study conducted by mining the Gerrit repository of six Java open source projects indicate that the density of warnings only slightly vary after each review. The overall percentage of warnings removed during reviews is slightly higher than what previous studies found for the overall project evolution history. However, when looking (quantitatively and qualitatively) at specific categories of warnings, we found that during code reviews developers focus on certain kinds of problems. For such categories of warnings the removal percentage tend to be very high, often above 50% and sometimes up to 100%. Examples of those are warnings in the imports, regular expressions, and type resolution categories. In conclusion, while a broad warning detection might produce way too many false positives, enforcing the removal of certain warnings prior to the patch submission could reduce the amount of effort provided during the code review process.
{"title":"Would static analysis tools help developers with code reviews?","authors":"Sebastiano Panichella, V. Arnaoudova, M. D. Penta, G. Antoniol","doi":"10.1109/SANER.2015.7081826","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081826","url":null,"abstract":"Code reviews have been conducted since decades in software projects, with the aim of improving code quality from many different points of view. During code reviews, developers are supported by checklists, coding standards and, possibly, by various kinds of static analysis tools. This paper investigates whether warnings highlighted by static analysis tools are taken care of during code reviews and, whether there are kinds of warnings that tend to be removed more than others. Results of a study conducted by mining the Gerrit repository of six Java open source projects indicate that the density of warnings only slightly vary after each review. The overall percentage of warnings removed during reviews is slightly higher than what previous studies found for the overall project evolution history. However, when looking (quantitatively and qualitatively) at specific categories of warnings, we found that during code reviews developers focus on certain kinds of problems. For such categories of warnings the removal percentage tend to be very high, often above 50% and sometimes up to 100%. Examples of those are warnings in the imports, regular expressions, and type resolution categories. In conclusion, while a broad warning detection might produce way too many false positives, enforcing the removal of certain warnings prior to the patch submission could reduce the amount of effort provided during the code review process.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130207386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081822
T. Roehm, Stefan Nosovic, B. Brügge
Bug reports submitted by users and crash reports collected by crash reporting tools often lack information about reproduction steps, i.e. the steps necessary to reproduce a failure. Hence, developers have difficulties to reproduce field failures and might not be able to fix all reported bugs. We present an approach to automatically extract failure reproduction steps from user interaction traces. We capture interactions between a user and a WIMP GUI using a capture/replay tool. Then, we extract the minimal, failure-inducing subsequence of captured interaction traces. We use three algorithms to perform this extraction: Delta Debugging, Sequential Pattern Mining, and a combination of both. Delta Debugging automatically replays subsequences of an interaction trace to identify the minimal, failure-inducing subsequence. Sequential Pattern Mining identifies the common subsequence in interaction traces inducing the same failure. We evaluated our approach in a case study. We injected four bugs to the code of a mail client application, collected interaction traces of five participants trying to find these bugs, and applied the extraction algorithms. Delta Debugging extracted the minimal, failure-inducing interaction subsequence in 90% of all cases. Sequential Pattern Mining produced failure-inducing interaction sequences in 75% of all cases and removed on average 93% of unnecessary interactions, potentially enabling manual analysis by developers. Both algorithms complement each other because they are applicable in different contexts and can be combined to improve performance.
{"title":"Automated extraction of failure reproduction steps from user interaction traces","authors":"T. Roehm, Stefan Nosovic, B. Brügge","doi":"10.1109/SANER.2015.7081822","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081822","url":null,"abstract":"Bug reports submitted by users and crash reports collected by crash reporting tools often lack information about reproduction steps, i.e. the steps necessary to reproduce a failure. Hence, developers have difficulties to reproduce field failures and might not be able to fix all reported bugs. We present an approach to automatically extract failure reproduction steps from user interaction traces. We capture interactions between a user and a WIMP GUI using a capture/replay tool. Then, we extract the minimal, failure-inducing subsequence of captured interaction traces. We use three algorithms to perform this extraction: Delta Debugging, Sequential Pattern Mining, and a combination of both. Delta Debugging automatically replays subsequences of an interaction trace to identify the minimal, failure-inducing subsequence. Sequential Pattern Mining identifies the common subsequence in interaction traces inducing the same failure. We evaluated our approach in a case study. We injected four bugs to the code of a mail client application, collected interaction traces of five participants trying to find these bugs, and applied the extraction algorithms. Delta Debugging extracted the minimal, failure-inducing interaction subsequence in 90% of all cases. Sequential Pattern Mining produced failure-inducing interaction sequences in 75% of all cases and removed on average 93% of unnecessary interactions, potentially enabling manual analysis by developers. Both algorithms complement each other because they are applicable in different contexts and can be combined to improve performance.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129168247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081837
Sven Mattsen, Arne Wichmann, S. Schupp
A challenge in sound reverse engineering of binary executables is to determine sets of possible targets for dynamic jumps. One technique to address this challenge is abstract interpretation, where singleton values in registers and memory locations are overapproximated to collections of possible values. With contemporary abstract interpretation techniques, convexity is usually enforced on these collections, which causes unacceptable loss of precision. We present a non-convex abstract domain, suitable for the analysis of binary executables. The domain is based on binary decision diagrams (BDD) to allow an efficient representation of non-convex sets of integers. Non-convex sets are necessary to represent the results of jump table lookups and bitwise operations, which are more frequent in executables than in high-level code because of optimizing compilers. Our domain computes abstract bitwise and arithmetic operations precisely and looses precision only for division and multiplication. Because the operations are defined on the structure of the BDDs, they remain efficient even if executed on very large sets. In executables, conditional jumps require solving formulas built with negation and conjunction. We implement a constraint solver using the fast intersection and complementation of BDD-based sets. Our domain is implemented as a plug-in, called BDDStab, and integrated with the binary analysis framework Jakstab. We use Jakstab's k-set and interval domains to discuss the increase in precision for a selection of compiler-generated executables.
{"title":"A non-convex abstract domain for the value analysis of binaries","authors":"Sven Mattsen, Arne Wichmann, S. Schupp","doi":"10.1109/SANER.2015.7081837","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081837","url":null,"abstract":"A challenge in sound reverse engineering of binary executables is to determine sets of possible targets for dynamic jumps. One technique to address this challenge is abstract interpretation, where singleton values in registers and memory locations are overapproximated to collections of possible values. With contemporary abstract interpretation techniques, convexity is usually enforced on these collections, which causes unacceptable loss of precision. We present a non-convex abstract domain, suitable for the analysis of binary executables. The domain is based on binary decision diagrams (BDD) to allow an efficient representation of non-convex sets of integers. Non-convex sets are necessary to represent the results of jump table lookups and bitwise operations, which are more frequent in executables than in high-level code because of optimizing compilers. Our domain computes abstract bitwise and arithmetic operations precisely and looses precision only for division and multiplication. Because the operations are defined on the structure of the BDDs, they remain efficient even if executed on very large sets. In executables, conditional jumps require solving formulas built with negation and conjunction. We implement a constraint solver using the fast intersection and complementation of BDD-based sets. Our domain is implemented as a plug-in, called BDDStab, and integrated with the binary analysis framework Jakstab. We use Jakstab's k-set and interval domains to discuss the increase in precision for a selection of compiler-generated executables.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128004693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081833
Haipeng Cai, Raúl A. Santelices
Dynamic impact analysis can greatly assist developers with managing software changes by focusing their attention on the effects of potential changes relative to concrete program executions. While dependence-based dynamic impact analysis (DDIA) provides finer-grained results than traceability-based approaches, traditional DDIA techniques often produce imprecise results, incurring excessive costs thus hindering their adoption in many practical situations. In this paper, we present the design and evaluation of a DDIA framework and its three new instances that offer not only much more precise impact sets but also flexible cost-effectiveness options to meet diverse application needs such as different budgets and levels of detail of results. By exploiting both static dependencies and various dynamic information including method-execution traces, statement coverage, and dynamic points-to data, our techniques achieve that goal at reasonable costs according to our experiment results. Our study also suggests that statement coverage has generally stronger effects on the precision and cost-effectiveness of DDIA than dynamic points-to data.
{"title":"A framework for cost-effective dependence-based dynamic impact analysis","authors":"Haipeng Cai, Raúl A. Santelices","doi":"10.1109/SANER.2015.7081833","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081833","url":null,"abstract":"Dynamic impact analysis can greatly assist developers with managing software changes by focusing their attention on the effects of potential changes relative to concrete program executions. While dependence-based dynamic impact analysis (DDIA) provides finer-grained results than traceability-based approaches, traditional DDIA techniques often produce imprecise results, incurring excessive costs thus hindering their adoption in many practical situations. In this paper, we present the design and evaluation of a DDIA framework and its three new instances that offer not only much more precise impact sets but also flexible cost-effectiveness options to meet diverse application needs such as different budgets and levels of detail of results. By exploiting both static dependencies and various dynamic information including method-execution traces, statement coverage, and dynamic points-to data, our techniques achieve that goal at reasonable costs according to our experiment results. Our study also suggests that statement coverage has generally stronger effects on the precision and cost-effectiveness of DDIA than dynamic points-to data.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115957614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081813
M. Saied, H. Sahraoui, Bruno Dufour
Nowadays, APIs represent the most common reuse form when developing software. However, the reuse benefits depend greatly on the ability of client application developers to use correctly the APIs. In this paper, we present an observational study on the API usage constraints and their documentation. To conduct the study on a large number of APIs, we implemented and validated strategies to automatically detect four types of usage constraints in existing APIs. We observed that some of the constraint types are frequent and that for three types, they are not documented in general. Surprisingly, the absence of documentation is, in general, specific to the constraints and not due to the non documenting habits of developers.
{"title":"An observational study on API usage constraints and their documentation","authors":"M. Saied, H. Sahraoui, Bruno Dufour","doi":"10.1109/SANER.2015.7081813","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081813","url":null,"abstract":"Nowadays, APIs represent the most common reuse form when developing software. However, the reuse benefits depend greatly on the ability of client application developers to use correctly the APIs. In this paper, we present an observational study on the API usage constraints and their documentation. To conduct the study on a large number of APIs, we implemented and validated strategies to automatically detect four types of usage constraints in existing APIs. We observed that some of the constraint types are frequent and that for three types, they are not documented in general. Surprisingly, the absence of documentation is, in general, specific to the constraints and not due to the non documenting habits of developers.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130809500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081819
Gergely Ladányi, Z. Tóth, R. Ferenc, Tibor Keresztesi
The IBM i mainframe was designed to manage business applications for which the reliability and quality is a matter of national security. The RPG programming language is the most frequently used one on this platform. The maintainability of the source code has big influence on the development costs, probably this is the reason why it is one of the most attractive, observed and evaluated quality characteristic of all. For improving or at least preserving the maintainability level of software it is necessary to evaluate it regularly. In this study we present a quality model based on the ISO/IEC 25010 international standard for evaluating the maintainability of software systems written in RPG. As an evaluation step of the quality model we show a case study in which we explain how we integrated the quality model as a continuous quality monitoring tool into the business processes of a mid-size software company which has more than twenty years of experience in developing RPG applications.
IBM i大型机设计用于管理业务应用程序,其可靠性和质量关系到国家安全。RPG编程语言是该平台上最常用的编程语言。源代码的可维护性对开发成本有很大的影响,可能这就是为什么它是所有质量特征中最吸引人、最受观察和评价的一个原因。为了提高或至少保持软件的可维护性水平,有必要定期对其进行评估。在本研究中,我们提出了一个基于ISO/IEC 25010国际标准的质量模型,用于评估用RPG编写的软件系统的可维护性。作为质量模型的评估步骤,我们展示了一个案例研究,其中我们解释了如何将质量模型作为持续的质量监控工具集成到一家中型软件公司的业务流程中,该公司在开发RPG应用程序方面拥有20多年的经验。
{"title":"A software quality model for RPG","authors":"Gergely Ladányi, Z. Tóth, R. Ferenc, Tibor Keresztesi","doi":"10.1109/SANER.2015.7081819","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081819","url":null,"abstract":"The IBM i mainframe was designed to manage business applications for which the reliability and quality is a matter of national security. The RPG programming language is the most frequently used one on this platform. The maintainability of the source code has big influence on the development costs, probably this is the reason why it is one of the most attractive, observed and evaluated quality characteristic of all. For improving or at least preserving the maintainability level of software it is necessary to evaluate it regularly. In this study we present a quality model based on the ISO/IEC 25010 international standard for evaluating the maintainability of software systems written in RPG. As an evaluation step of the quality model we show a case study in which we explain how we integrated the quality model as a continuous quality monitoring tool into the business processes of a mid-size software company which has more than twenty years of experience in developing RPG applications.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131156420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-03-02DOI: 10.1109/SANER.2015.7081830
I. Keivanloo, Feng Zhang, Ying Zou
Code clones are unavoidable entities in software ecosystems. A variety of clone-detection algorithms are available for finding code clones. For Type-3 clone detection at method granularity (i.e., similar methods with changes in statements), dissimilarity threshold is one of the possible configuration parameters. Existing approaches use a single threshold to detect Type-3 clones across a repository. However, our study shows that to detect Type-3 clones at method granularity on a large-scale heterogeneous repository, multiple thresholds are often required. We find that the performance of clone detection improves if selecting different thresholds for various groups of clones in a heterogeneous repository (i.e., various applications). In this paper, we propose a threshold-free approach to detect Type-3 clones at method granularity across a large number of applications. Our approach uses an unsupervised learning algorithm, i.e., k-means, to determine true and false clones. We use a clone benchmark with 330,840 tagged clones from 24,824 open source Java projects for our study. We observe that our approach improves the performance significantly by 12% in terms of F-measure. Furthermore, our threshold-free approach eliminates the concern of practitioners about possible misconfiguration of Type-3 clone detection tools.
{"title":"Threshold-free code clone detection for a large-scale heterogeneous Java repository","authors":"I. Keivanloo, Feng Zhang, Ying Zou","doi":"10.1109/SANER.2015.7081830","DOIUrl":"https://doi.org/10.1109/SANER.2015.7081830","url":null,"abstract":"Code clones are unavoidable entities in software ecosystems. A variety of clone-detection algorithms are available for finding code clones. For Type-3 clone detection at method granularity (i.e., similar methods with changes in statements), dissimilarity threshold is one of the possible configuration parameters. Existing approaches use a single threshold to detect Type-3 clones across a repository. However, our study shows that to detect Type-3 clones at method granularity on a large-scale heterogeneous repository, multiple thresholds are often required. We find that the performance of clone detection improves if selecting different thresholds for various groups of clones in a heterogeneous repository (i.e., various applications). In this paper, we propose a threshold-free approach to detect Type-3 clones at method granularity across a large number of applications. Our approach uses an unsupervised learning algorithm, i.e., k-means, to determine true and false clones. We use a clone benchmark with 330,840 tagged clones from 24,824 open source Java projects for our study. We observe that our approach improves the performance significantly by 12% in terms of F-measure. Furthermore, our threshold-free approach eliminates the concern of practitioners about possible misconfiguration of Type-3 clone detection tools.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129467204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}