Finding lines of code similar to a code fragment across large knowledge bases in fractions of a second is a new branch of code clone research also known as real-time code clone search. Among the requirements real-time code clone search has to meet are scalability, short response time, scalable incremental corpus updates, and support for type-1, type-2, and type-3 clones. We conducted a set of empirical studies on a large open source code corpus to gain insight about its characteristics. We used these results to design and optimize a multi-level indexing approach using hash table-based and binary search to improve Internet-scale real-time code clone search response time. Finally, we performed an evaluation on an Internet-scale corpus (1.5 million Java files and 266 MLOC). Our approach maintains a response time for 99.9% of clone searches in the microseconds range, while supporting the aforementioned requirements.
{"title":"Internet-scale Real-time Code Clone Search Via Multi-level Indexing","authors":"I. Keivanloo, J. Rilling, P. Charland","doi":"10.1109/WCRE.2011.13","DOIUrl":"https://doi.org/10.1109/WCRE.2011.13","url":null,"abstract":"Finding lines of code similar to a code fragment across large knowledge bases in fractions of a second is a new branch of code clone research also known as real-time code clone search. Among the requirements real-time code clone search has to meet are scalability, short response time, scalable incremental corpus updates, and support for type-1, type-2, and type-3 clones. We conducted a set of empirical studies on a large open source code corpus to gain insight about its characteristics. We used these results to design and optimize a multi-level indexing approach using hash table-based and binary search to improve Internet-scale real-time code clone search response time. Finally, we performed an evaluation on an Internet-scale corpus (1.5 million Java files and 266 MLOC). Our approach maintains a response time for 99.9% of clone searches in the microseconds range, while supporting the aforementioned requirements.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129822211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A reverse engineer trying to understand protected malware binaries is faced with avoiding detection by antidebugging protections. Advanced protection systems may even load specialized drivers that can re-flash firmware and change the privileges of running applications, significantly increasing the penalty of detection. Hades is a Windows kernel driver designed to aid reverse engineering endeavors. It avoids detection by employing intelligent instrumentation via instruction rerouting in both user and kernel space. This technique allows a reverse engineer to easily debug and profile binaries without fear of invoking protection penalties.
{"title":"Stealthy Profiling and Debugging of Malware Trampolining from User to Kernel Space","authors":"J. Raber","doi":"10.1109/WCRE.2011.62","DOIUrl":"https://doi.org/10.1109/WCRE.2011.62","url":null,"abstract":"A reverse engineer trying to understand protected malware binaries is faced with avoiding detection by antidebugging protections. Advanced protection systems may even load specialized drivers that can re-flash firmware and change the privileges of running applications, significantly increasing the penalty of detection. Hades is a Windows kernel driver designed to aid reverse engineering endeavors. It avoids detection by employing intelligent instrumentation via instruction rerouting in both user and kernel space. This technique allows a reverse engineer to easily debug and profile binaries without fear of invoking protection penalties.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130167909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andy Kellens, Coen De Roover, Carlos Noguera, Reinout Stevens, V. Jonckers
Version control systems (VCS) have become indispensable to develop software. Next to their immediate advantages, they also offer information about the evolution of software and its development process. Despite this wealth of information, it has only been leveraged by tools that are dedicated to a specific software engineering task such as predicting bugs or identifying hotspots. General-purpose tool support for reasoning about the information contained in a version control system is limited. In this paper, we introduce the logic-based program query language ABSINTHE. It supports querying versioned software systems using logic queries in which quantified regular path expressions are embedded. These expressions lend themselves to specifying the properties that each individual version in a sequence of successive software versions ought to exhibit.
{"title":"Reasoning over the Evolution of Source Code Using Quantified Regular Path Expressions","authors":"Andy Kellens, Coen De Roover, Carlos Noguera, Reinout Stevens, V. Jonckers","doi":"10.1109/WCRE.2011.54","DOIUrl":"https://doi.org/10.1109/WCRE.2011.54","url":null,"abstract":"Version control systems (VCS) have become indispensable to develop software. Next to their immediate advantages, they also offer information about the evolution of software and its development process. Despite this wealth of information, it has only been leveraged by tools that are dedicated to a specific software engineering task such as predicting bugs or identifying hotspots. General-purpose tool support for reasoning about the information contained in a version control system is limited. In this paper, we introduce the logic-based program query language ABSINTHE. It supports querying versioned software systems using logic queries in which quantified regular path expressions are embedded. These expressions lend themselves to specifying the properties that each individual version in a sequence of successive software versions ought to exhibit.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114777842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Uddin, C. Roy, Kevin A. Schneider, Abram Hindle
Clone detection techniques essentially cluster textually, syntactically and/or semantically similar code fragments in or across software systems. For large datasets, similarity identification is costly both in terms of time and memory, and especially so when detecting near-miss clones where lines could be modified, added and/or deleted in the copied fragments. The capability and effectiveness of a clone detection tool mostly depends on the code similarity measurement technique it uses. A variety of similarity measurement approaches have been used for clone detection, including fingerprint based approaches, which have had varying degrees of success notwithstanding some limitations. In this paper, we investigate the effectiveness of simhash, a state of the art fingerprint based data similarity measurement technique for detecting both exact and near-miss clones in large scale software systems. Our experimental data show that simhash is indeed effective in identifying various types of clones in a software system despite wide variations in experimental circumstances. The approach is also suitable as a core capability for building other tools, such as tools for: incremental clone detection, code searching, and clone management.
{"title":"On the Effectiveness of Simhash for Detecting Near-Miss Clones in Large Scale Software Systems","authors":"M. Uddin, C. Roy, Kevin A. Schneider, Abram Hindle","doi":"10.1109/WCRE.2011.12","DOIUrl":"https://doi.org/10.1109/WCRE.2011.12","url":null,"abstract":"Clone detection techniques essentially cluster textually, syntactically and/or semantically similar code fragments in or across software systems. For large datasets, similarity identification is costly both in terms of time and memory, and especially so when detecting near-miss clones where lines could be modified, added and/or deleted in the copied fragments. The capability and effectiveness of a clone detection tool mostly depends on the code similarity measurement technique it uses. A variety of similarity measurement approaches have been used for clone detection, including fingerprint based approaches, which have had varying degrees of success notwithstanding some limitations. In this paper, we investigate the effectiveness of simhash, a state of the art fingerprint based data similarity measurement technique for detecting both exact and near-miss clones in large scale software systems. Our experimental data show that simhash is indeed effective in identifying various types of clones in a software system despite wide variations in experimental circumstances. The approach is also suitable as a core capability for building other tools, such as tools for: incremental clone detection, code searching, and clone management.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"57 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134412519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evelyn Nicole Haslinger, R. Lopez-Herrejon, Alexander Egyed
Successful software is more and more rarely developed as a one-of-a-kind system. Instead, different system variants are built from a common set of assets and customized for catering to the different functionality or technology needs of the distinct clients and users. The Software Product Line Engineering (SPLE) paradigm has proven effective to cope with the variability described for this scenario. However, evolving a Software Product Line (SPL) from a family of systems is not a simple endeavor. A crucial requirement is accurately capturing the variability present in the family of systems and representing it with Feature Models (FMs), the de facto standard for variability modeling. Current research has focused on extracting FMs from configuration scripts, propositional logic expressions or natural language. In contrast, in this short paper we present an algorithm that reverse engineers a basic feature model from the feature sets which describe the features each system provides. We perform an evaluation of our approach using several case studies and outline the issues that still need to be addressed.
{"title":"Reverse Engineering Feature Models from Programs' Feature Sets","authors":"Evelyn Nicole Haslinger, R. Lopez-Herrejon, Alexander Egyed","doi":"10.1109/WCRE.2011.45","DOIUrl":"https://doi.org/10.1109/WCRE.2011.45","url":null,"abstract":"Successful software is more and more rarely developed as a one-of-a-kind system. Instead, different system variants are built from a common set of assets and customized for catering to the different functionality or technology needs of the distinct clients and users. The Software Product Line Engineering (SPLE) paradigm has proven effective to cope with the variability described for this scenario. However, evolving a Software Product Line (SPL) from a family of systems is not a simple endeavor. A crucial requirement is accurately capturing the variability present in the family of systems and representing it with Feature Models (FMs), the de facto standard for variability modeling. Current research has focused on extracting FMs from configuration scripts, propositional logic expressions or natural language. In contrast, in this short paper we present an algorithm that reverse engineers a basic feature model from the feature sets which describe the features each system provides. We perform an evaluation of our approach using several case studies and outline the issues that still need to be addressed.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129517898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Didi Surian, Nian Liu, D. Lo, Hanghang Tong, Ee-Peng Lim, C. Faloutsos
Many software developments involve collaborations of developers across the globe. This is true for both open-source and closed-source development efforts. Developers collaborate on different projects of various types. As with any other teamwork endeavors, finding compatibility among members in a development team is helpful towards the realization of the team's goal. Compatible members tend to share similar programming style and naming strategy, communicate well with one another, etc. However, finding the right person to work with is not an easy task. In this work, we extract information available from Source forge. Net, the largest database of open source software, and build developer collaboration network comprising of information on developers, projects, and project properties. Based on an input developer, we then recommend a list of top developers that are most compatible based on their programming language skills, past projects and project categories they have worked on before, via a random walk with restart procedure. Our quantitative and qualitative experiments show that we are able to recommend reasonable developer candidates from snapshots of Source forge. Net consisting of tens of thousands of developers and projects, and hundreds of project properties.
{"title":"Recommending People in Developers' Collaboration Network","authors":"Didi Surian, Nian Liu, D. Lo, Hanghang Tong, Ee-Peng Lim, C. Faloutsos","doi":"10.1109/WCRE.2011.53","DOIUrl":"https://doi.org/10.1109/WCRE.2011.53","url":null,"abstract":"Many software developments involve collaborations of developers across the globe. This is true for both open-source and closed-source development efforts. Developers collaborate on different projects of various types. As with any other teamwork endeavors, finding compatibility among members in a development team is helpful towards the realization of the team's goal. Compatible members tend to share similar programming style and naming strategy, communicate well with one another, etc. However, finding the right person to work with is not an easy task. In this work, we extract information available from Source forge. Net, the largest database of open source software, and build developer collaboration network comprising of information on developers, projects, and project properties. Based on an input developer, we then recommend a list of top developers that are most compatible based on their programming language skills, past projects and project categories they have worked on before, via a random walk with restart procedure. Our quantitative and qualitative experiments show that we are able to recommend reasonable developer candidates from snapshots of Source forge. Net consisting of tens of thousands of developers and projects, and hundreds of project properties.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129526643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the current demonstration we present a new instrument that provides for each existing class in an analyzed system information related to the problems the class reveals. We associate different types of problems to a class: design flaws, vulnerabilities and defects. In order to validate its usefulness, we perform some experiments on a suite of object-oriented systems and some results are briefly presented in the last part of the demo.
{"title":"iProblems - An Integrated Instrument for Reporting Design Flaws, Vulnerabilities and Defects","authors":"Mihai Codoban, C. Marinescu, Radu Marinescu","doi":"10.1109/WCRE.2011.65","DOIUrl":"https://doi.org/10.1109/WCRE.2011.65","url":null,"abstract":"In the current demonstration we present a new instrument that provides for each existing class in an analyzed system information related to the problems the class reveals. We associate different types of problems to a class: design flaws, vulnerabilities and defects. In order to validate its usefulness, we perform some experiments on a suite of object-oriented systems and some results are briefly presented in the last part of the demo.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132086242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In manual testing, testers typically follow the steps listed in the bug report to verify whether a bug has been fixed or not. Depending on time and availability of resources, a tester may execute some additional test cases to ensure test coverage. In the case of manual testing, the process of finding the most relevant manual test cases to run is largely manual and involves tester expertise. From a usability standpoint, the task of finding the most relevant test cases is tedious as the tester typically has to switch between the defect management tool and the test case management tool in order to search for test cases relevant to the bug at hand. In this paper, we use IR techniques to recover trace ability between bugs and test cases with the aim of recommending test cases for bugs. We report on our experience of recovering trace ability between bugs and test cases using techniques such as Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) through a small industrial case study.
{"title":"Reconstructing Traceability between Bugs and Test Cases: An Experimental Study","authors":"Nilam Kaushik, L. Tahvildari, Mark Moore","doi":"10.1109/WCRE.2011.58","DOIUrl":"https://doi.org/10.1109/WCRE.2011.58","url":null,"abstract":"In manual testing, testers typically follow the steps listed in the bug report to verify whether a bug has been fixed or not. Depending on time and availability of resources, a tester may execute some additional test cases to ensure test coverage. In the case of manual testing, the process of finding the most relevant manual test cases to run is largely manual and involves tester expertise. From a usability standpoint, the task of finding the most relevant test cases is tedious as the tester typically has to switch between the defect management tool and the test case management tool in order to search for test cases relevant to the bug at hand. In this paper, we use IR techniques to recover trace ability between bugs and test cases with the aim of recommending test cases for bugs. We report on our experience of recovering trace ability between bugs and test cases using techniques such as Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) through a small industrial case study.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126629244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Program identifiers represent an invaluable source of information for developers who are not familiar with the code to be evolved. Domain concepts and inter-concept relationships can be automatically extracted by means of natural language processing techniques applied to the program identifiers. However, the ontology produced by this approach tends to be very large and to include implementation details that reduce its usefulness for domain concept understanding. In this paper, we analyze the effectiveness of information retrieval based techniques used to filter domain concepts and relations from the implementation details, so as to obtain a smaller, more informative domain ontology. In particular, we show that fully automated techniques based on keywords or topics have quite poor performance, while a semi-automated approach, requiring limited user involvement, can highly improve the filtering of domain concepts.
{"title":"Towards the Extraction of Domain Concepts from the Identifiers","authors":"S. Abebe, P. Tonella","doi":"10.1109/WCRE.2011.19","DOIUrl":"https://doi.org/10.1109/WCRE.2011.19","url":null,"abstract":"Program identifiers represent an invaluable source of information for developers who are not familiar with the code to be evolved. Domain concepts and inter-concept relationships can be automatically extracted by means of natural language processing techniques applied to the program identifiers. However, the ontology produced by this approach tends to be very large and to include implementation details that reduce its usefulness for domain concept understanding. In this paper, we analyze the effectiveness of information retrieval based techniques used to filter domain concepts and relations from the implementation details, so as to obtain a smaller, more informative domain ontology. In particular, we show that fully automated techniques based on keywords or topics have quite poor performance, while a semi-automated approach, requiring limited user involvement, can highly improve the filtering of domain concepts.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123804130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software developers are often facing the challenge of understanding a large code base. Program comprehension is not only achieved by looking at object interactions, but also by considering the meaning of the identifiers and the contained terms. Ideally, the source code should exemplify this meaning. We propose to call the source code locations that define the meaning of a term term introduction. We further derive a heuristic to determine the introduction location with the help of an explorative study. This study was performed on 8000 manually evaluated samples gained from 30 open source projects. To support reproducibility, all samples and classifications are also available online. The achieved results show a precision of 75% for the heuristic.
{"title":"Locating the Meaning of Terms in Source Code Research on \"Term Introduction\"","authors":"Jan Nonnen, D. Speicher, Paul Imhoff","doi":"10.1109/WCRE.2011.21","DOIUrl":"https://doi.org/10.1109/WCRE.2011.21","url":null,"abstract":"Software developers are often facing the challenge of understanding a large code base. Program comprehension is not only achieved by looking at object interactions, but also by considering the meaning of the identifiers and the contained terms. Ideally, the source code should exemplify this meaning. We propose to call the source code locations that define the meaning of a term term introduction. We further derive a heuristic to determine the introduction location with the help of an explorative study. This study was performed on 8000 manually evaluated samples gained from 30 open source projects. To support reproducibility, all samples and classifications are also available online. The achieved results show a precision of 75% for the heuristic.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117146951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}