Romain Brixtel, Mathieu Fontaine, Boris Lesner, Cyril Bazin, R. Robbes
Clone detection is usually applied in the context of detecting small-to medium scale fragments of duplicated code in large software systems. In this paper, we address the problem of clone detection applied to plagiarism detection in the context of source code assignments done by computer science students. Plagiarism detection comes with a distinct set of constraints to usual clone detection approaches, which influenced the design of the approach we present in this paper. For instance, the source code can be heavily changed at a superficial level (in an attempt to look genuine), yet be functionally very similar. Since assignments turned in by computer science students can be in a variety of languages, we work at the syntactic level and do not consider the source-code semantics. Consequently, the approach we propose is endogenous and makes no assumption about the programming language being analysed. It is based on an alignment method using the parallel principle at local resolution (character level) to compute similarities between documents. We tested our framework on hundreds of real source files, involving a wide array of programming languages (Java, C, Python, PHP, Haskell, bash). Our approach allowed us to discover previously undetected frauds, and to empirically evaluate its accuracy and robustness.
{"title":"Language-Independent Clone Detection Applied to Plagiarism Detection","authors":"Romain Brixtel, Mathieu Fontaine, Boris Lesner, Cyril Bazin, R. Robbes","doi":"10.1109/SCAM.2010.19","DOIUrl":"https://doi.org/10.1109/SCAM.2010.19","url":null,"abstract":"Clone detection is usually applied in the context of detecting small-to medium scale fragments of duplicated code in large software systems. In this paper, we address the problem of clone detection applied to plagiarism detection in the context of source code assignments done by computer science students. Plagiarism detection comes with a distinct set of constraints to usual clone detection approaches, which influenced the design of the approach we present in this paper. For instance, the source code can be heavily changed at a superficial level (in an attempt to look genuine), yet be functionally very similar. Since assignments turned in by computer science students can be in a variety of languages, we work at the syntactic level and do not consider the source-code semantics. Consequently, the approach we propose is endogenous and makes no assumption about the programming language being analysed. It is based on an alignment method using the parallel principle at local resolution (character level) to compute similarities between documents. We tested our framework on hundreds of real source files, involving a wide array of programming languages (Java, C, Python, PHP, Haskell, bash). Our approach allowed us to discover previously undetected frauds, and to empirically evaluate its accuracy and robustness.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122813749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detecting bugs in concurrent software is challenging due to the many different thread interleavings. Dynamic analysis and testing solutions to bug detection are often costly as they need to provide coverage of the interleaving space in addition to traditional black box or white box coverage. An alternative to dynamic analysis detection of concurrency bugs is the use of static analysis. This paper examines the use of three static analysis tools (Find Bugs, J Lint and Chord) in order to assess each tool's ability to find concurrency bugs and to identify the percentage of spurious results produced. The empirical data presented is based on an experiment involving 12 concurrent Java programs.
{"title":"How Good is Static Analysis at Finding Concurrency Bugs?","authors":"Devin Kester, Martin Mwebesa, J. S. Bradbury","doi":"10.1109/SCAM.2010.26","DOIUrl":"https://doi.org/10.1109/SCAM.2010.26","url":null,"abstract":"Detecting bugs in concurrent software is challenging due to the many different thread interleavings. Dynamic analysis and testing solutions to bug detection are often costly as they need to provide coverage of the interleaving space in addition to traditional black box or white box coverage. An alternative to dynamic analysis detection of concurrency bugs is the use of static analysis. This paper examines the use of three static analysis tools (Find Bugs, J Lint and Chord) in order to assess each tool's ability to find concurrency bugs and to identify the percentage of spurious results produced. The empirical data presented is based on an experiment involving 12 concurrent Java programs.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126351994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Memory access violations are a leading source of unreliability in C programs. As evidence of this problem, a variety of methods exist that retrofit C with software checks to detect memory errors at runtime. However, these methods generally suffer from one or more drawbacks including the inability to detect all errors, the use of incompatible metadata, the need for manual code modifications, and high runtime overheads. In this paper, we present a compiler analysis and transformation for ensuring the memory safety of C called MemSafe. MemSafe makes several novel contributions that improve upon previous work and lower the cost of safety. These include (1) a method for modeling temporal errors as spatial errors, (2) a metadata representation that combines features of both object - and pointer-based approaches, and (3) a dataflow representation that simplifies optimizations for removing unneeded checks. MemSafe is capable of detecting real errors with lower overheads than previous efforts. Experimental results show that MemSafe detects all memory errors in 6 programs with known violations and ensures complete safety with an average overhead of 87% on 30 large programs widely-used in evaluating error detection tools.
{"title":"MemSafe: Ensuring the Spatial and Temporal Memory Safety of C at Runtime","authors":"Matthew S. Simpson, R. Barua","doi":"10.1002/spe.2105","DOIUrl":"https://doi.org/10.1002/spe.2105","url":null,"abstract":"Memory access violations are a leading source of unreliability in C programs. As evidence of this problem, a variety of methods exist that retrofit C with software checks to detect memory errors at runtime. However, these methods generally suffer from one or more drawbacks including the inability to detect all errors, the use of incompatible metadata, the need for manual code modifications, and high runtime overheads. In this paper, we present a compiler analysis and transformation for ensuring the memory safety of C called MemSafe. MemSafe makes several novel contributions that improve upon previous work and lower the cost of safety. These include (1) a method for modeling temporal errors as spatial errors, (2) a metadata representation that combines features of both object - and pointer-based approaches, and (3) a dataflow representation that simplifies optimizations for removing unneeded checks. MemSafe is capable of detecting real errors with lower overheads than previous efforts. Experimental results show that MemSafe detects all memory errors in 6 programs with known violations and ensures complete safety with an average overhead of 87% on 30 large programs widely-used in evaluating error detection tools.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115396969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Static program analysis usually consists of a number of steps, each producing partial results. For example, the points-to analysis step, calculating object references in a program, usually just provides the input for larger client analyses like reach ability and escape analyses. All these analyses are computationally intense and it is therefore vital to create parallel approaches that make use of the processing power that comes from multiple cores in modern desktop computers. The present paper presents two parallel approaches to increase the efficiency of reach ability analysis and escape analysis, based on a parallel points-to analysis. The experiments show that the two parallel approaches achieve a speed-up of 1.5 for reach ability analysis and 3.8 for escape analysis on 8 cores for a benchmark suite of Java programs.
{"title":"Parallel Reachability and Escape Analyses","authors":"Marcus Edvinsson, Jonas Lundberg, Welf Löwe","doi":"10.1109/SCAM.2010.10","DOIUrl":"https://doi.org/10.1109/SCAM.2010.10","url":null,"abstract":"Static program analysis usually consists of a number of steps, each producing partial results. For example, the points-to analysis step, calculating object references in a program, usually just provides the input for larger client analyses like reach ability and escape analyses. All these analyses are computationally intense and it is therefore vital to create parallel approaches that make use of the processing power that comes from multiple cores in modern desktop computers. The present paper presents two parallel approaches to increase the efficiency of reach ability analysis and escape analysis, based on a parallel points-to analysis. The experiments show that the two parallel approaches achieve a speed-up of 1.5 for reach ability analysis and 3.8 for escape analysis on 8 cores for a benchmark suite of Java programs.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114380190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Allier, S. Vaucher, Bruno Dufour, H. Sahraoui
Coupling metrics play an important role in empirical software engineering research as well as in industrial measurement programs. The existing coupling metrics have usually been defined in a way that they can be computed from a static analysis of the source code. However, modern programs extensively use dynamic language features such as polymorphism and dynamic class loading that are difficult to capture by static analysis. Consequently, the derived metric values might not accurately reflect the state of a program. In this paper, we express existing definitions of coupling metrics using call graphs. We then compare the results of four different call graph construction algorithms with standard tool implementations of these metrics in an empirical study. Our results show important variations in coupling between standard and call graph-based calculations due to the support of dynamic features.
{"title":"Deriving Coupling Metrics from Call Graphs","authors":"Simon Allier, S. Vaucher, Bruno Dufour, H. Sahraoui","doi":"10.1109/SCAM.2010.25","DOIUrl":"https://doi.org/10.1109/SCAM.2010.25","url":null,"abstract":"Coupling metrics play an important role in empirical software engineering research as well as in industrial measurement programs. The existing coupling metrics have usually been defined in a way that they can be computed from a static analysis of the source code. However, modern programs extensively use dynamic language features such as polymorphism and dynamic class loading that are difficult to capture by static analysis. Consequently, the derived metric values might not accurately reflect the state of a program. In this paper, we express existing definitions of coupling metrics using call graphs. We then compare the results of four different call graph construction algorithms with standard tool implementations of these metrics in an empirical study. Our results show important variations in coupling between standard and call graph-based calculations due to the support of dynamic features.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130457701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper deals with the binary analysis of executable programs, with the goal of understanding how they access memory. It explains how to statically build a formal model of all memory accesses. Starting with a control-flow graph of each procedure, well-known techniques are used to structure this graph into a hierarchy of loops in all cases. The paper shows that much more information can be extracted by performing a complete data-flow analysis over machine registers after the program has been put in static single assignment (SSA) form. By using the SSA form, registers used in addressing memory can be symbolically expressed in terms of other, previously set registers. By including the loop structures in the analysis, loop indices and trip counts can also often be expressed symbolically. The whole process produces a formal model made of loops where memory accesses are linear expressions of loop counters and registers. The paper provides a quantitative evaluation of the results when applied to several dozens of SPEC benchmark programs. Because static analysis is often incomplete, the paper ends by describing a lightweight instrumentation strategy that collects at run time enough information to complete the program's symbolic description.
{"title":"Recovering the Memory Behavior of Executable Programs","authors":"A. Ketterlin, P. Clauss","doi":"10.1109/SCAM.2010.18","DOIUrl":"https://doi.org/10.1109/SCAM.2010.18","url":null,"abstract":"This paper deals with the binary analysis of executable programs, with the goal of understanding how they access memory. It explains how to statically build a formal model of all memory accesses. Starting with a control-flow graph of each procedure, well-known techniques are used to structure this graph into a hierarchy of loops in all cases. The paper shows that much more information can be extracted by performing a complete data-flow analysis over machine registers after the program has been put in static single assignment (SSA) form. By using the SSA form, registers used in addressing memory can be symbolically expressed in terms of other, previously set registers. By including the loop structures in the analysis, loop indices and trip counts can also often be expressed symbolically. The whole process produces a formal model made of loops where memory accesses are linear expressions of loop counters and registers. The paper provides a quantitative evaluation of the results when applied to several dozens of SPEC benchmark programs. Because static analysis is often incomplete, the paper ends by describing a lightweight instrumentation strategy that collects at run time enough information to complete the program's symbolic description.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"18 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114045064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Decompilation is reconstruction of a program in a high-level language from a program in a low-level language. This paper presents a method for automatic reconstruction of composite types (structures, arrays and combinations of them)in a high-level program during decompilation. Assembly code is obtained by disassembling a binary code or traces collected by a simulator. The proposed method is based on expressing memory access operations as pairs base offset, then building equivalence classes for the bases used in the program and accumulating offsets for each equivalence class. For Strictly conforming C programs our approach is substantiated by the C language semantics as defined in the international standard. However, experimental results have revealed that it is applicable for real-world programs also. Experimental results are obtained for a number of open-source programs as well as for traces collected from them. The method is an essential part of the tool for program decompilation TyDec being developed by the authors. Decompiler TyDec can be used as a standalone tool or as a plug-in for Interactive Trace Explorer TrEx being developed in Institute for System Programming, Russian Academy of Sciences.
{"title":"Reconstruction of Composite Types for Decompilation","authors":"K. Troshina, Yegor Derevenets, A. Chernov","doi":"10.1109/SCAM.2010.24","DOIUrl":"https://doi.org/10.1109/SCAM.2010.24","url":null,"abstract":"Decompilation is reconstruction of a program in a high-level language from a program in a low-level language. This paper presents a method for automatic reconstruction of composite types (structures, arrays and combinations of them)in a high-level program during decompilation. Assembly code is obtained by disassembling a binary code or traces collected by a simulator. The proposed method is based on expressing memory access operations as pairs base offset, then building equivalence classes for the bases used in the program and accumulating offsets for each equivalence class. For Strictly conforming C programs our approach is substantiated by the C language semantics as defined in the international standard. However, experimental results have revealed that it is applicable for real-world programs also. Experimental results are obtained for a number of open-source programs as well as for traces collected from them. The method is an essential part of the tool for program decompilation TyDec being developed by the authors. Decompiler TyDec can be used as a standalone tool or as a plug-in for Interactive Trace Explorer TrEx being developed in Institute for System Programming, Russian Academy of Sciences.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128421760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
During execution, an objected-oriented program typically creates a large number of objects. This research considers the distribution of those objects that share a common su per class. If this distribution is uniform then all subclasses are equally likely to be instantiated. However, if not, then the lack of uniformity can be exploited by giving preferential treatment to the dominant class (or classes). For example, a tester might spend greater testing resources on the dominant class while an engineer refactoring the code might begin with a more dominant class. An experiment designed to investigate the distribution of subclass instantiations was performed using eight Java programs containing almost half a million lines of code and just over three thousand classes. The results show that outside a few infrequent instances, most distributions are heavily skewed.
{"title":"Subclass Instantiation Distribution","authors":"Amy Wheeler, D. Binkley","doi":"10.1109/SCAM.2010.12","DOIUrl":"https://doi.org/10.1109/SCAM.2010.12","url":null,"abstract":"During execution, an objected-oriented program typically creates a large number of objects. This research considers the distribution of those objects that share a common su per class. If this distribution is uniform then all subclasses are equally likely to be instantiated. However, if not, then the lack of uniformity can be exploited by giving preferential treatment to the dominant class (or classes). For example, a tester might spend greater testing resources on the dominant class while an engineer refactoring the code might begin with a more dominant class. An experiment designed to investigate the distribution of subclass instantiations was performed using eight Java programs containing almost half a million lines of code and just over three thousand classes. The results show that outside a few infrequent instances, most distributions are heavily skewed.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128446867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ambiguity detection tools try to statically track down ambiguities in context-free grammars. Current ambiguity detection tools, however, either are too slow for large realistic cases, or produce incomprehensible ambiguity reports. AmbiDexter is the ambiguity tool to have your cake and eat it too.
{"title":"AMBIDEXTER: Practical Ambiguity Detection","authors":"Bas Basten, T. Storm","doi":"10.1109/SCAM.2010.21","DOIUrl":"https://doi.org/10.1109/SCAM.2010.21","url":null,"abstract":"Ambiguity detection tools try to statically track down ambiguities in context-free grammars. Current ambiguity detection tools, however, either are too slow for large realistic cases, or produce incomprehensible ambiguity reports. AmbiDexter is the ambiguity tool to have your cake and eat it too.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114674118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The optimal number of latent topics required to model the most accurate latent substructure for a source code corpus is an open question in source code analysis. Most estimates about the number of latent topics that exist in a software corpus are based on the assumption that the data is similar to natural language, but there is little empirical evidence to support this. In order to help determine the appropriate number of topics needed to accurately represent the source code, we generate a series of Latent Dirichlet Allocation models with varying topic counts. We use a heuristic to evaluate the ability of the model to identify related source code blocks, and demonstrate the consequences of choosing too few or too many latent topics.
{"title":"Estimating the Optimal Number of Latent Concepts in Source Code Analysis","authors":"Scott Grant, J. Cordy","doi":"10.1109/SCAM.2010.22","DOIUrl":"https://doi.org/10.1109/SCAM.2010.22","url":null,"abstract":"The optimal number of latent topics required to model the most accurate latent substructure for a source code corpus is an open question in source code analysis. Most estimates about the number of latent topics that exist in a software corpus are based on the assumption that the data is similar to natural language, but there is little empirical evidence to support this. In order to help determine the appropriate number of topics needed to accurately represent the source code, we generate a series of Latent Dirichlet Allocation models with varying topic counts. We use a heuristic to evaluate the ability of the model to identify related source code blocks, and demonstrate the consequences of choosing too few or too many latent topics.","PeriodicalId":222204,"journal":{"name":"2010 10th IEEE Working Conference on Source Code Analysis and Manipulation","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127173663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}