In a previous study, we showed that indentation was regular across multiple languages and the variance in the level of indentation of a block of revised code is correlated with metrics such as McCabe cyclomatic complexity. Building on that work the current paper investigates the relationship between the "shape'' of the indentation of the revised code block (the "revision'') and the corresponding syntactic structure of the code. We annotated revisions matching these three indentation shapes: "flat'' (all lines are equally indented), "slash'' (indentation becomes increasingly deep), or "bubble'' (indentation increases and then decreases). We then classified the code structure as one of: function definition, loop, expression, comment, etc. We studied thousands of revisions, coming from over 200 software projects, written in a variety of languages. Our study indicates that indentation shape correlates positively with code structure; that is, certain shapes typically correspond to certain code structures. For example, flat shapes commonly correspond to comments while bubble shapes commonly correspond to conditionals and function definitions. These results can form the basis of a tool framework that can analyze code in a language independent way to support browsing targeted to viewing particular code structures such as conditionals or comments.
{"title":"From Indentation Shapes to Code Structures","authors":"Abram Hindle, Michael W. Godfrey, R. Holt","doi":"10.1109/SCAM.2008.31","DOIUrl":"https://doi.org/10.1109/SCAM.2008.31","url":null,"abstract":"In a previous study, we showed that indentation was regular across multiple languages and the variance in the level of indentation of a block of revised code is correlated with metrics such as McCabe cyclomatic complexity. Building on that work the current paper investigates the relationship between the \"shape'' of the indentation of the revised code block (the \"revision'') and the corresponding syntactic structure of the code. We annotated revisions matching these three indentation shapes: \"flat'' (all lines are equally indented), \"slash'' (indentation becomes increasingly deep), or \"bubble'' (indentation increases and then decreases). We then classified the code structure as one of: function definition, loop, expression, comment, etc. We studied thousands of revisions, coming from over 200 software projects, written in a variety of languages. Our study indicates that indentation shape correlates positively with code structure; that is, certain shapes typically correspond to certain code structures. For example, flat shapes commonly correspond to comments while bubble shapes commonly correspond to conditionals and function definitions. These results can form the basis of a tool framework that can analyze code in a language independent way to support browsing targeted to viewing particular code structures such as conditionals or comments.","PeriodicalId":433693,"journal":{"name":"2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127665955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ensuring the correctness and reliability of software systems is one of the main problems in software development. Model checking, a static analysis method, is preponderant in improving the precision of vulnerabilities detection. However, when applied to buffer overflow and other bugs, it is hard to automatically construct the model for detecting the vulnerabilities. To address this problem we propose an approach that combines constraint based analysis and model checking together. We trace the memory size of buffer-related variables and instrument the code with corresponding constraint assertions before the potential vulnerable points by constraint based analysis. Then the problem of detecting vulnerabilities is converted into the problem of detecting vulnerabilities to verifying the reach ability of these assertions by model checking. In order to reduce the cost of model checking, program slicing is introduced to reduce the code size. CodeAuditor is a prototype implementation of our approach. With CodeAuditor, several yet unreported vulnerabilities are discovered in several open source software, and the performance is shown to be improved significantly with the help of program slicing.
{"title":"Automated Detection of Code Vulnerabilities Based on Program Analysis and Model Checking","authors":"Lei Wang, Qiang Zhang, Peng Zhao","doi":"10.1109/SCAM.2008.24","DOIUrl":"https://doi.org/10.1109/SCAM.2008.24","url":null,"abstract":"Ensuring the correctness and reliability of software systems is one of the main problems in software development. Model checking, a static analysis method, is preponderant in improving the precision of vulnerabilities detection. However, when applied to buffer overflow and other bugs, it is hard to automatically construct the model for detecting the vulnerabilities. To address this problem we propose an approach that combines constraint based analysis and model checking together. We trace the memory size of buffer-related variables and instrument the code with corresponding constraint assertions before the potential vulnerable points by constraint based analysis. Then the problem of detecting vulnerabilities is converted into the problem of detecting vulnerabilities to verifying the reach ability of these assertions by model checking. In order to reduce the cost of model checking, program slicing is introduced to reduce the code size. CodeAuditor is a prototype implementation of our approach. With CodeAuditor, several yet unreported vulnerabilities are discovered in several open source software, and the performance is shown to be improved significantly with the help of program slicing.","PeriodicalId":433693,"journal":{"name":"2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130084067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper argues that semantic information encoded in natural language identifiers is a largely neglected resource for program analysis. First we show that words in Java class names relate to class properties, expressed using the recently developed micro patterns language. We analyse a large corpus of Java programs to create a database that links common class name words with micro patterns. Finally we report on prototype tools integrated with the Eclipse development environment. These tools use the database to inform programmers of particular problems or optimization opportunities in their code.
{"title":"Exploiting the Correspondence between Micro Patterns and Class Names","authors":"Jeremy Singer, C. Kirkham","doi":"10.1109/SCAM.2008.23","DOIUrl":"https://doi.org/10.1109/SCAM.2008.23","url":null,"abstract":"This paper argues that semantic information encoded in natural language identifiers is a largely neglected resource for program analysis. First we show that words in Java class names relate to class properties, expressed using the recently developed micro patterns language. We analyse a large corpus of Java programs to create a database that links common class name words with micro patterns. Finally we report on prototype tools integrated with the Eclipse development environment. These tools use the database to inform programmers of particular problems or optimization opportunities in their code.","PeriodicalId":433693,"journal":{"name":"2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123951021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C preprocessor (CPP) is a major cause that makes it much difficult to accurately analyze C source code, which is indispensable to refactoring tools for C programs. To accurately analyze C source code, we need to generate CPP mapping information between unpreprocessed C sourcecode and preprocessed one. Previous works generate CPP mapping information by extending the existing CPP, which results in low portability and low maintainability due to the strong dependency of CPP implementation. To solve this problem, this paper proposes a novel approach (called TBCppA) based on tracer, which generates CPP mapping information by instrumenting the unpreprocessed C source code using XML-like tags called "tracers". The advantage of TBCppA is high portability and high maintainability, which the previous methods do not have. We successfully implemented a first prototype of TBCppA, and our preliminary evaluation of applying TBCppA to gcc-4.1.1 produced promising results.
{"title":"TBCppA: A Tracer Approach for Automatic Accurate Analysis of C Preprocessor's Behaviors","authors":"K. Gondow, Hayato Kawashima, T. Imaizumi","doi":"10.1109/SCAM.2008.13","DOIUrl":"https://doi.org/10.1109/SCAM.2008.13","url":null,"abstract":"C preprocessor (CPP) is a major cause that makes it much difficult to accurately analyze C source code, which is indispensable to refactoring tools for C programs. To accurately analyze C source code, we need to generate CPP mapping information between unpreprocessed C sourcecode and preprocessed one. Previous works generate CPP mapping information by extending the existing CPP, which results in low portability and low maintainability due to the strong dependency of CPP implementation. To solve this problem, this paper proposes a novel approach (called TBCppA) based on tracer, which generates CPP mapping information by instrumenting the unpreprocessed C source code using XML-like tags called \"tracers\". The advantage of TBCppA is high portability and high maintainability, which the previous methods do not have. We successfully implemented a first prototype of TBCppA, and our preliminary evaluation of applying TBCppA to gcc-4.1.1 produced promising results.","PeriodicalId":433693,"journal":{"name":"2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation","volume":"409 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116368108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article describes some of the engineering approaches that were taken during the development of GrammaTech's static-analysis technology that have taken it from a prototype system with poor performance and scalability and with very limited applicability, to a much-more general-purpose industrial-strength analysis infrastructure capable of operating on millions of lines of code. A wide variety of code bases are found in industry, and many extremes of usage exist, from code size through use of unusual, or non-standard features and dialects.Some of the problems associated with handling these code-bases are described, and the solutions that were used to address them, including some that were ultimately unsuccessful, are discussed.
{"title":"90% Perspiration: Engineering Static Analysis Techniques for Industrial Applications","authors":"P. Anderson","doi":"10.1109/SCAM.2008.11","DOIUrl":"https://doi.org/10.1109/SCAM.2008.11","url":null,"abstract":"This article describes some of the engineering approaches that were taken during the development of GrammaTech's static-analysis technology that have taken it from a prototype system with poor performance and scalability and with very limited applicability, to a much-more general-purpose industrial-strength analysis infrastructure capable of operating on millions of lines of code. A wide variety of code bases are found in industry, and many extremes of usage exist, from code size through use of unusual, or non-standard features and dialects.Some of the problems associated with handling these code-bases are described, and the solutions that were used to address them, including some that were ultimately unsuccessful, are discussed.","PeriodicalId":433693,"journal":{"name":"2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121099313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Polymorphism and class hierarchies are key to increasing the extensibility of an object-oriented program but also raise challenges for program comprehension. Despite many advances in understanding and restructuring class hierarchies, there is no direct support to analyze and understand the design decisions that drive their polymorphic usage. In this paper we introduce a metric-based visual approach to capture the extent to which the clients of a hierarchy polymorphically manipulate that hierarchy. A visual pattern vocabulary is also presented in order to facilitate the communication between analysts. Initial evaluation shows that our techniques aid program comprehension by effectively visualizing large quantities of information, and can help detect several design problems.
{"title":"Type Highlighting: A Client-Driven Visual Approach for Class Hierarchies Reengineering","authors":"Petru Florin Mihancea","doi":"10.1109/SCAM.2008.16","DOIUrl":"https://doi.org/10.1109/SCAM.2008.16","url":null,"abstract":"Polymorphism and class hierarchies are key to increasing the extensibility of an object-oriented program but also raise challenges for program comprehension. Despite many advances in understanding and restructuring class hierarchies, there is no direct support to analyze and understand the design decisions that drive their polymorphic usage. In this paper we introduce a metric-based visual approach to capture the extent to which the clients of a hierarchy polymorphically manipulate that hierarchy. A visual pattern vocabulary is also presented in order to facilitate the communication between analysts. Initial evaluation shows that our techniques aid program comprehension by effectively visualizing large quantities of information, and can help detect several design problems.","PeriodicalId":433693,"journal":{"name":"2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127108549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bug-checking tools have been used with some success in recent years to find bugs in software. For finding bugs that can cause security vulnerabilities, bug checking tools require a program analysis which determines whether a software bug can be controlled by user-input. In this paper we introduce a static program analysis for computing user-input dependencies. This analysis can be used as a pre-processing filter to a static bug checking tool for identifying bugs that can potentially be exploited as security vulnerabilities. In order for the analysis to be applicable to large commercial software in the millions of lines of code, runtime speed and scalability of the user-input dependence analysis is of key importance. Our user-input dependence analysis takes both data and control dependencies into account. We extend static single assignment (SSA) form by augmenting phi-nodes with control dependencies. A formal definition of user-input dependence is expressed in a dataflow analysis framework as a meet-over-all-paths (MOP) solution. We reduce the equation system to a sparse equation system exploiting the properties of SSA. The sparse equation system is solved as a reachability problem that results in a fast algorithm for computing user-input dependencies. We have implemented a call-insensitive and a call-sensitive analysis. The paper gives preliminary results on the comparison of their efficiency for various benchmarks.
{"title":"User-Input Dependence Analysis via Graph Reachability","authors":"Bernhard Scholz, Chenyi Zhang, C. Cifuentes","doi":"10.1109/SCAM.2008.22","DOIUrl":"https://doi.org/10.1109/SCAM.2008.22","url":null,"abstract":"Bug-checking tools have been used with some success in recent years to find bugs in software. For finding bugs that can cause security vulnerabilities, bug checking tools require a program analysis which determines whether a software bug can be controlled by user-input. In this paper we introduce a static program analysis for computing user-input dependencies. This analysis can be used as a pre-processing filter to a static bug checking tool for identifying bugs that can potentially be exploited as security vulnerabilities. In order for the analysis to be applicable to large commercial software in the millions of lines of code, runtime speed and scalability of the user-input dependence analysis is of key importance. Our user-input dependence analysis takes both data and control dependencies into account. We extend static single assignment (SSA) form by augmenting phi-nodes with control dependencies. A formal definition of user-input dependence is expressed in a dataflow analysis framework as a meet-over-all-paths (MOP) solution. We reduce the equation system to a sparse equation system exploiting the properties of SSA. The sparse equation system is solved as a reachability problem that results in a fast algorithm for computing user-input dependencies. We have implemented a call-insensitive and a call-sensitive analysis. The paper gives preliminary results on the comparison of their efficiency for various benchmarks.","PeriodicalId":433693,"journal":{"name":"2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115291088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}