Defect prediction aims to automatically identify potential defective code with minimal human intervention and has been widely studied in the literature. Just-in-Time (JIT) defect prediction focuses on program changes rather than whole programs, and has been widely adopted in continuous testing. CC2Vec, state-of-the-art JIT defect prediction tool, first constructs a hierarchical attention network (HAN) to learn distributed vector representations of both code additions and deletions, and then concatenates them with two other embedding vectors representing commit messages and overall code changes extracted by the existing DeepJIT approach to train a model for predicting whether a given commit is defective. Although CC2Vec has been shown to be the state of the art for JIT defect prediction, it was only evaluated on a limited dataset and not compared with all representative baselines. Therefore, to further investigate the efficacy and limitations of CC2Vec, this paper performs an extensive study of CC2Vec on a large-scale dataset with over 310,370 changes (8.3 X larger than the original CC2Vec dataset). More specifically, we also empirically compare CC2Vec against DeepJIT and representative traditional JIT defect prediction techniques. The experimental results show that CC2Vec cannot consistently outperform DeepJIT, and neither of them can consistently outperform traditional JIT defect prediction. We also investigate the impact of individual traditional defect prediction features and find that the added-line-number feature outperforms other traditional features. Inspired by this finding, we construct a simplistic JIT defect prediction approach which simply adopts the added-line-number feature with the logistic regression classifier. Surprisingly, such a simplistic approach can outperform CC2Vec and DeepJIT in defect prediction, and can be 81k X/120k X faster in training/testing. Furthermore, the paper also provides various practical guidelines for advancing JIT defect prediction in the near future.
{"title":"Deep just-in-time defect prediction: how far are we?","authors":"Zhen Zeng, Yuqun Zhang, Haotian Zhang, Lingming Zhang","doi":"10.1145/3460319.3464819","DOIUrl":"https://doi.org/10.1145/3460319.3464819","url":null,"abstract":"Defect prediction aims to automatically identify potential defective code with minimal human intervention and has been widely studied in the literature. Just-in-Time (JIT) defect prediction focuses on program changes rather than whole programs, and has been widely adopted in continuous testing. CC2Vec, state-of-the-art JIT defect prediction tool, first constructs a hierarchical attention network (HAN) to learn distributed vector representations of both code additions and deletions, and then concatenates them with two other embedding vectors representing commit messages and overall code changes extracted by the existing DeepJIT approach to train a model for predicting whether a given commit is defective. Although CC2Vec has been shown to be the state of the art for JIT defect prediction, it was only evaluated on a limited dataset and not compared with all representative baselines. Therefore, to further investigate the efficacy and limitations of CC2Vec, this paper performs an extensive study of CC2Vec on a large-scale dataset with over 310,370 changes (8.3 X larger than the original CC2Vec dataset). More specifically, we also empirically compare CC2Vec against DeepJIT and representative traditional JIT defect prediction techniques. The experimental results show that CC2Vec cannot consistently outperform DeepJIT, and neither of them can consistently outperform traditional JIT defect prediction. We also investigate the impact of individual traditional defect prediction features and find that the added-line-number feature outperforms other traditional features. Inspired by this finding, we construct a simplistic JIT defect prediction approach which simply adopts the added-line-number feature with the logistic regression classifier. Surprisingly, such a simplistic approach can outperform CC2Vec and DeepJIT in defect prediction, and can be 81k X/120k X faster in training/testing. Furthermore, the paper also provides various practical guidelines for advancing JIT defect prediction in the near future.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123035773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the widespread use of cloud-native architecture, increasing web applications (apps) choose to build on microservices. Simultaneously, troubleshooting becomes full of challenges owing to the high dynamics and complexity of anomaly propagation. Existing diagnostic methods rely heavily on monitoring metrics collected from the kernel side of microservice systems. Without a comprehensive monitoring infrastructure, application owners and even cloud operators cannot resort to these kernel-space solutions. This paper summarizes several insights on operating a top commercial cloud platform. Then, for the first time, we put forward the idea of user-space diagnosis for microservice kernel failures. To this end, we develop a crowdsourcing solution - DyCause, to resolve the asymmetric diagnostic information problem. DyCause deploys on the application side in a distributed manner. Through lightweight API log sharing, apps collect the operational status of kernel services collaboratively and initiate diagnosis on demand. Deploying DyCause is fast and lightweight as we do not have any architectural and functional requirements for the kernel. To reveal more accurate correlations from asymmetric diagnostic information, we design a novel statistical algorithm that can efficiently discover the time-varying causalities between services. This algorithm also helps us build the temporal order of the anomaly propagation. Therefore, by using DyCause, we can obtain more in-depth and interpretable diagnostic clues with limited indicators. We apply and evaluate DyCause on both a simulated test-bed and a real-world cloud system. Experimental results verify that DyCause running in the user-space outperforms several state-of-the-art algorithms running in the kernel on accuracy. Besides, DyCause shows superior advantages in terms of algorithmic efficiency and data sensitivity. Simply put, DyCause produces a significantly better result than other baselines when analyzing much fewer or sparser metrics. To conclude, DyCause is faster to act, deeper in analysis, and easier to deploy.
{"title":"Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space","authors":"Yicheng Pan, Meng Ma, Xinrui Jiang, Ping Wang","doi":"10.1145/3460319.3464805","DOIUrl":"https://doi.org/10.1145/3460319.3464805","url":null,"abstract":"With the widespread use of cloud-native architecture, increasing web applications (apps) choose to build on microservices. Simultaneously, troubleshooting becomes full of challenges owing to the high dynamics and complexity of anomaly propagation. Existing diagnostic methods rely heavily on monitoring metrics collected from the kernel side of microservice systems. Without a comprehensive monitoring infrastructure, application owners and even cloud operators cannot resort to these kernel-space solutions. This paper summarizes several insights on operating a top commercial cloud platform. Then, for the first time, we put forward the idea of user-space diagnosis for microservice kernel failures. To this end, we develop a crowdsourcing solution - DyCause, to resolve the asymmetric diagnostic information problem. DyCause deploys on the application side in a distributed manner. Through lightweight API log sharing, apps collect the operational status of kernel services collaboratively and initiate diagnosis on demand. Deploying DyCause is fast and lightweight as we do not have any architectural and functional requirements for the kernel. To reveal more accurate correlations from asymmetric diagnostic information, we design a novel statistical algorithm that can efficiently discover the time-varying causalities between services. This algorithm also helps us build the temporal order of the anomaly propagation. Therefore, by using DyCause, we can obtain more in-depth and interpretable diagnostic clues with limited indicators. We apply and evaluate DyCause on both a simulated test-bed and a real-world cloud system. Experimental results verify that DyCause running in the user-space outperforms several state-of-the-art algorithms running in the kernel on accuracy. Besides, DyCause shows superior advantages in terms of algorithmic efficiency and data sensitivity. Simply put, DyCause produces a significantly better result than other baselines when analyzing much fewer or sparser metrics. To conclude, DyCause is faster to act, deeper in analysis, and easier to deploy.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124874793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Habib, Avraham Shinnar, Martin Hirzel, Michael Pradel
JSON is a data format used pervasively in web APIs, cloud computing, NoSQL databases, and increasingly also machine learning. To ensure that JSON data is compatible with an application, one can define a JSON schema and use a validator to check data against the schema. However, because validation can happen only once concrete data occurs during an execution, it may detect data compatibility bugs too late or not at all. Examples include evolving the schema for a web API, which may unexpectedly break client applications, or accidentally running a machine learning pipeline on incorrect data. This paper presents a novel way of detecting a class of data compatibility bugs via JSON subschema checking. Subschema checks find bugs before concrete JSON data is available and across all possible data specified by a schema. For example, one can check if evolving a schema would break API clients or if two components of a machine learning pipeline have incompatible expectations about data. Deciding whether one JSON schema is a subschema of another is non-trivial because the JSON Schema specification language is rich. Our key insight to address this challenge is to first reduce the richness of schemas by canonicalizing and simplifying them, and to then reason about the subschema question on simpler schema fragments using type-specific checkers. We apply our subschema checker to thousands of real-world schemas from different domains. In all experiments, the approach is correct whenever it gives an answer (100% precision and correctness), which is the case for most schema pairs (93.5% recall), clearly outperforming the state-of-the-art tool. Moreover, the approach reveals 43 previously unknown bugs in popular software, most of which have already been fixed, showing that JSON subschema checking helps finding data compatibility bugs early.
{"title":"Finding data compatibility bugs with JSON subschema checking","authors":"Andrew Habib, Avraham Shinnar, Martin Hirzel, Michael Pradel","doi":"10.1145/3460319.3464796","DOIUrl":"https://doi.org/10.1145/3460319.3464796","url":null,"abstract":"JSON is a data format used pervasively in web APIs, cloud computing, NoSQL databases, and increasingly also machine learning. To ensure that JSON data is compatible with an application, one can define a JSON schema and use a validator to check data against the schema. However, because validation can happen only once concrete data occurs during an execution, it may detect data compatibility bugs too late or not at all. Examples include evolving the schema for a web API, which may unexpectedly break client applications, or accidentally running a machine learning pipeline on incorrect data. This paper presents a novel way of detecting a class of data compatibility bugs via JSON subschema checking. Subschema checks find bugs before concrete JSON data is available and across all possible data specified by a schema. For example, one can check if evolving a schema would break API clients or if two components of a machine learning pipeline have incompatible expectations about data. Deciding whether one JSON schema is a subschema of another is non-trivial because the JSON Schema specification language is rich. Our key insight to address this challenge is to first reduce the richness of schemas by canonicalizing and simplifying them, and to then reason about the subschema question on simpler schema fragments using type-specific checkers. We apply our subschema checker to thousands of real-world schemas from different domains. In all experiments, the approach is correct whenever it gives an answer (100% precision and correctness), which is the case for most schema pairs (93.5% recall), clearly outperforming the state-of-the-art tool. Moreover, the approach reveals 43 previously unknown bugs in popular software, most of which have already been fixed, showing that JSON subschema checking helps finding data compatibility bugs early.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122328306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrián Herrera, Hendra Gunadi, S. Magrath, Michael Norrish, Mathias Payer, Antony Lloyd Hosking
Mutation-based greybox fuzzing---unquestionably the most widely-used fuzzing technique---relies on a set of non-crashing seed inputs (a corpus) to bootstrap the bug-finding process. When evaluating a fuzzer, common approaches for constructing this corpus include: (i) using an empty file; (ii) using a single seed representative of the target's input format; or (iii) collecting a large number of seeds (e.g., by crawling the Internet). Little thought is given to how this seed choice affects the fuzzing process, and there is no consensus on which approach is best (or even if a best approach exists). To address this gap in knowledge, we systematically investigate and evaluate how seed selection affects a fuzzer's ability to find bugs in real-world software. This includes a systematic review of seed selection practices used in both evaluation and deployment contexts, and a large-scale empirical evaluation (over 33 CPU-years) of six seed selection approaches. These six seed selection approaches include three corpus minimization techniques (which select the smallest subset of seeds that trigger the same range of instrumentation data points as a full corpus). Our results demonstrate that fuzzing outcomes vary significantly depending on the initial seeds used to bootstrap the fuzzer, with minimized corpora outperforming singleton, empty, and large (in the order of thousands of files) seed sets. Consequently, we encourage seed selection to be foremost in mind when evaluating/deploying fuzzers, and recommend that (a) seed choice be carefully considered and explicitly documented, and (b) never to evaluate fuzzers with only a single seed.
{"title":"Seed selection for successful fuzzing","authors":"Adrián Herrera, Hendra Gunadi, S. Magrath, Michael Norrish, Mathias Payer, Antony Lloyd Hosking","doi":"10.1145/3460319.3464795","DOIUrl":"https://doi.org/10.1145/3460319.3464795","url":null,"abstract":"Mutation-based greybox fuzzing---unquestionably the most widely-used fuzzing technique---relies on a set of non-crashing seed inputs (a corpus) to bootstrap the bug-finding process. When evaluating a fuzzer, common approaches for constructing this corpus include: (i) using an empty file; (ii) using a single seed representative of the target's input format; or (iii) collecting a large number of seeds (e.g., by crawling the Internet). Little thought is given to how this seed choice affects the fuzzing process, and there is no consensus on which approach is best (or even if a best approach exists). To address this gap in knowledge, we systematically investigate and evaluate how seed selection affects a fuzzer's ability to find bugs in real-world software. This includes a systematic review of seed selection practices used in both evaluation and deployment contexts, and a large-scale empirical evaluation (over 33 CPU-years) of six seed selection approaches. These six seed selection approaches include three corpus minimization techniques (which select the smallest subset of seeds that trigger the same range of instrumentation data points as a full corpus). Our results demonstrate that fuzzing outcomes vary significantly depending on the initial seeds used to bootstrap the fuzzer, with minimized corpora outperforming singleton, empty, and large (in the order of thousands of files) seed sets. Consequently, we encourage seed selection to be foremost in mind when evaluating/deploying fuzzers, and recommend that (a) seed choice be carefully considered and explicitly documented, and (b) never to evaluate fuzzers with only a single seed.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116158567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ridwan Shariffdeen, Xiang Gao, Gregory J. Duck, Shin Hwei Tan, J. Lawall, Abhik Roychoudhury
Whenever a bug or vulnerability is detected in the Linux kernel, the kernel developers will endeavour to fix it by introducing a patch into the mainline version of the Linux kernel source tree. However, many users run older “stable” versions of Linux, meaning that the patch should also be “backported” to one or more of these older kernel versions. This process is error-prone and there is usually along delay in publishing the backported patch. Based on an empirical study, we show that around 8% of all commits submitted to Linux mainline are backported to older versions,but often more than one month elapses before the backport is available. Hence, we propose a patch backporting technique that can automatically transfer patches from the mainline version of Linux into older stable versions. Our approach first synthesizes a partial transformation rule based on a Linux mainline patch. This rule can then be generalized by analysing the alignment between the mainline and target versions. The generalized rule is then applied to the target version to produce a backported patch. We have implemented our transformation technique in a tool called FixMorph and evaluated it on 350 Linux mainline patches. FixMorph correctly backports 75.1% of them. Compared to existing techniques, FixMorph improves both the precision and recall in backporting patches. Apart from automation of software maintenance tasks, patch backporting helps in reducing the exposure to known security vulnerabilities in stable versions of the Linux kernel.
{"title":"Automated patch backporting in Linux (experience paper)","authors":"Ridwan Shariffdeen, Xiang Gao, Gregory J. Duck, Shin Hwei Tan, J. Lawall, Abhik Roychoudhury","doi":"10.1145/3460319.3464821","DOIUrl":"https://doi.org/10.1145/3460319.3464821","url":null,"abstract":"Whenever a bug or vulnerability is detected in the Linux kernel, the kernel developers will endeavour to fix it by introducing a patch into the mainline version of the Linux kernel source tree. However, many users run older “stable” versions of Linux, meaning that the patch should also be “backported” to one or more of these older kernel versions. This process is error-prone and there is usually along delay in publishing the backported patch. Based on an empirical study, we show that around 8% of all commits submitted to Linux mainline are backported to older versions,but often more than one month elapses before the backport is available. Hence, we propose a patch backporting technique that can automatically transfer patches from the mainline version of Linux into older stable versions. Our approach first synthesizes a partial transformation rule based on a Linux mainline patch. This rule can then be generalized by analysing the alignment between the mainline and target versions. The generalized rule is then applied to the target version to produce a backported patch. We have implemented our transformation technique in a tool called FixMorph and evaluated it on 350 Linux mainline patches. FixMorph correctly backports 75.1% of them. Compared to existing techniques, FixMorph improves both the precision and recall in backporting patches. Apart from automation of software maintenance tasks, patch backporting helps in reducing the exposure to known security vulnerabilities in stable versions of the Linux kernel.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117186833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Continuous integration advocates to run the test suite of a project frequently, e.g., for every code change committed to a shared repository. This process imposes a high computational cost and sometimes also a high human cost, e.g., when developers must wait for the test suite to pass before a change appears in the main branch of the shared repository. However, only 4% of all test suite invocations turn a previously passing test suite into a failing test suite. The question arises whether running the test suite for each code change is really necessary. This paper presents continuous test suite failure prediction, which reduces the cost of continuous integration by predicting whether a particular code change should trigger the test suite at all. The core of the approach is a machine learning model based on features of the code change, the test suite, and the development history. We also present a theoretical cost model that describes when continuous test suite failure prediction is worthwhile. Evaluating the idea with 15k test suite runs from 242 open-source projects shows that the approach is effective at predicting whether running the test suite is likely to reveal a test failure. Moreover, we find that our approach improves the AUC over baselines that use features proposed for just-in-time defect prediction and test case failure prediction by 13.9% and 2.9%, respectively. Overall, continuous test suite failure prediction can significantly reduce the cost of continuous integration.
{"title":"Continuous test suite failure prediction","authors":"Cong Pan, Michael Pradel","doi":"10.1145/3460319.3464840","DOIUrl":"https://doi.org/10.1145/3460319.3464840","url":null,"abstract":"Continuous integration advocates to run the test suite of a project frequently, e.g., for every code change committed to a shared repository. This process imposes a high computational cost and sometimes also a high human cost, e.g., when developers must wait for the test suite to pass before a change appears in the main branch of the shared repository. However, only 4% of all test suite invocations turn a previously passing test suite into a failing test suite. The question arises whether running the test suite for each code change is really necessary. This paper presents continuous test suite failure prediction, which reduces the cost of continuous integration by predicting whether a particular code change should trigger the test suite at all. The core of the approach is a machine learning model based on features of the code change, the test suite, and the development history. We also present a theoretical cost model that describes when continuous test suite failure prediction is worthwhile. Evaluating the idea with 15k test suite runs from 242 open-source projects shows that the approach is effective at predicting whether running the test suite is likely to reveal a test failure. Moreover, we find that our approach improves the AUC over baselines that use features proposed for just-in-time defect prediction and test case failure prediction by 13.9% and 2.9%, respectively. Overall, continuous test suite failure prediction can significantly reduce the cost of continuous integration.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129218135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng Ren, Zijing Yin, Fuchen Ma, Zhenyang Xu, Yu Jiang, Chengnian Sun, Huizhong Li, Yan Cai
Security of smart contracts has attracted increasing attention in recent years. Many researchers have devoted themselves to devising testing tools for vulnerability detection. Each published tool has demonstrated its effectiveness through a series of evaluations on their own experimental scenarios. However, the inconsistency of evaluation settings such as different data sets or performance metrics, may result in biased conclusion. In this paper, based on an empirical evaluation of widely used smart contract testing tools, we propose a unified standard to eliminate the bias in the assessment process. First, we collect 46,186 source-available smart contracts from four influential organizations. This comprehensive dataset is open to the public and involves different code characteristics, vulnerability patterns and application scenarios. Then we propose a 4-step evaluation process and summarize the difference among relevant work in these steps. We use nine representative tools to carry out extensive experiments. The results demonstrate that different choices of experimental settings could significantly affect tool performance and lead to misleading or even opposite conclusions. Finally, we generalize some problems of existing testing tools, and propose some possible directions for further improvement.
{"title":"Empirical evaluation of smart contract testing: what is the best choice?","authors":"Meng Ren, Zijing Yin, Fuchen Ma, Zhenyang Xu, Yu Jiang, Chengnian Sun, Huizhong Li, Yan Cai","doi":"10.1145/3460319.3464837","DOIUrl":"https://doi.org/10.1145/3460319.3464837","url":null,"abstract":"Security of smart contracts has attracted increasing attention in recent years. Many researchers have devoted themselves to devising testing tools for vulnerability detection. Each published tool has demonstrated its effectiveness through a series of evaluations on their own experimental scenarios. However, the inconsistency of evaluation settings such as different data sets or performance metrics, may result in biased conclusion. In this paper, based on an empirical evaluation of widely used smart contract testing tools, we propose a unified standard to eliminate the bias in the assessment process. First, we collect 46,186 source-available smart contracts from four influential organizations. This comprehensive dataset is open to the public and involves different code characteristics, vulnerability patterns and application scenarios. Then we propose a 4-step evaluation process and summarize the difference among relevant work in these steps. We use nine representative tools to carry out extensive experiments. The results demonstrate that different choices of experimental settings could significantly affect tool performance and lead to misleading or even opposite conclusions. Finally, we generalize some problems of existing testing tools, and propose some possible directions for further improvement.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131343041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Echidna is a widely used fuzzer for Ethereum Virtual Machine (EVM) compatible blockchain smart contracts that generates transaction sequences of calls to smart contracts. While Echidna is an essentially single-threaded tool, it is possible for multiple Echidna processes to communicate by use of a shared transaction sequence corpus. Echidna provides a very large variety of configuration options, since each smart contract may be best-tested by a non-default configuration, and different faults or coverage targets within a single contract may also have differing ideal configurations. This paper presents echidna-parade, a tool that provides pushbutton multicore fuzzing using Echidna as an underlying fuzzing engine, and automatically provides sophisticated diversification of configurations. Even without using multiple cores, echidna-parade can improve the effectiveness of fuzzing with Echidna, due to the advantages provided by multiple types of test configuration diversity. Using echidna-parade with multiple cores can produce significantly better results than Echidna, in less time.
{"title":"echidna-parade: a tool for diverse multicore smart contract fuzzing","authors":"Alex Groce, Gustavo Grieco","doi":"10.1145/3460319.3469076","DOIUrl":"https://doi.org/10.1145/3460319.3469076","url":null,"abstract":"Echidna is a widely used fuzzer for Ethereum Virtual Machine (EVM) compatible blockchain smart contracts that generates transaction sequences of calls to smart contracts. While Echidna is an essentially single-threaded tool, it is possible for multiple Echidna processes to communicate by use of a shared transaction sequence corpus. Echidna provides a very large variety of configuration options, since each smart contract may be best-tested by a non-default configuration, and different faults or coverage targets within a single contract may also have differing ideal configurations. This paper presents echidna-parade, a tool that provides pushbutton multicore fuzzing using Echidna as an underlying fuzzing engine, and automatically provides sophisticated diversification of configurations. Even without using multiple cores, echidna-parade can improve the effectiveness of fuzzing with Echidna, due to the advantages provided by multiple types of test configuration diversity. Using echidna-parade with multiple cores can produce significantly better results than Echidna, in less time.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115750682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Barslev Nielsen, Martin Toldam Torp, Anders Møller
Most of the code in typical Node.js applications comes from third-party libraries that consist of a large number of interdependent modules. Because of the dynamic features of JavaScript, it is difficult to obtain detailed information about the module dependencies, which is vital for reasoning about the potential consequences of security vulnerabilities in libraries, and for many other software development tasks. The underlying challenge is how to construct precise call graphs that capture the connectivity between functions in the modules. In this work we present a novel approach to call graph construction for Node.js applications that is modular, taking into account the modular structure of Node.js applications, and sufficiently accurate and efficient to be practically useful. We demonstrate experimentally that the constructed call graphs are useful for security scanning, reducing the number of false positives by 81% compared to npm audit and with zero false negatives. Compared to js-callgraph, the call graph construction is significantly more accurate and efficient. The experiments also show that the analysis time is reduced substantially when reusing modular call graphs.
{"title":"Modular call graph construction for security scanning of Node.js applications","authors":"Benjamin Barslev Nielsen, Martin Toldam Torp, Anders Møller","doi":"10.1145/3460319.3464836","DOIUrl":"https://doi.org/10.1145/3460319.3464836","url":null,"abstract":"Most of the code in typical Node.js applications comes from third-party libraries that consist of a large number of interdependent modules. Because of the dynamic features of JavaScript, it is difficult to obtain detailed information about the module dependencies, which is vital for reasoning about the potential consequences of security vulnerabilities in libraries, and for many other software development tasks. The underlying challenge is how to construct precise call graphs that capture the connectivity between functions in the modules. In this work we present a novel approach to call graph construction for Node.js applications that is modular, taking into account the modular structure of Node.js applications, and sufficiently accurate and efficient to be practically useful. We demonstrate experimentally that the constructed call graphs are useful for security scanning, reducing the number of false positives by 81% compared to npm audit and with zero false negatives. Compared to js-callgraph, the call graph construction is significantly more accurate and efficient. The experiments also show that the analysis time is reduced substantially when reusing modular call graphs.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122159767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Yuan, Yeseop Lee, Chao Zhang, Yun Li, Yan Cai, Bodong Zhao
A growing number of bugs have been reported by vulnerability discovery solutions. Among them, some bugs are hard to diagnose or reproduce, including data race bugs caused by thread interleavings. Few solutions are able to well address this issue, due to the huge space of interleavings to explore. What’s worse, in security analysis scenarios, analysts usually have no access to the source code of target programs and have troubles in comprehending them. In this paper, we propose a general solution RAProducer to efficiently diagnose and reproduce data race bugs, for both user-land binary programs and kernels without source code. The efficiency of RAProducer is achieved by analyzing the execution trace of the given PoC (proof-of-concept) sample to recognize race- and bug-related elements (including locks and shared variables), which greatly facilitate narrowing down the huge search space of data race spots and thread interleavings. We have implemented a prototype of RAProducer and evaluated it on 7 kernel and 10 user-land data race bugs. Evaluation results showed that, RAProducer is effective at reproducing all these bugs. More importantly, it enables us to diagnose 2 extra real world bugs which are left unconfirmed for a long time. It is also efficient as it reduces candidate data race spots of each bug to a small set, and narrows down the thread interleaving greatly.RAProducer is also more effective in reproducing real-world data race bugs than other state-of-the-art solutions.
{"title":"RAProducer: efficiently diagnose and reproduce data race bugs for binaries via trace analysis","authors":"Ming Yuan, Yeseop Lee, Chao Zhang, Yun Li, Yan Cai, Bodong Zhao","doi":"10.1145/3460319.3464831","DOIUrl":"https://doi.org/10.1145/3460319.3464831","url":null,"abstract":"A growing number of bugs have been reported by vulnerability discovery solutions. Among them, some bugs are hard to diagnose or reproduce, including data race bugs caused by thread interleavings. Few solutions are able to well address this issue, due to the huge space of interleavings to explore. What’s worse, in security analysis scenarios, analysts usually have no access to the source code of target programs and have troubles in comprehending them. In this paper, we propose a general solution RAProducer to efficiently diagnose and reproduce data race bugs, for both user-land binary programs and kernels without source code. The efficiency of RAProducer is achieved by analyzing the execution trace of the given PoC (proof-of-concept) sample to recognize race- and bug-related elements (including locks and shared variables), which greatly facilitate narrowing down the huge search space of data race spots and thread interleavings. We have implemented a prototype of RAProducer and evaluated it on 7 kernel and 10 user-land data race bugs. Evaluation results showed that, RAProducer is effective at reproducing all these bugs. More importantly, it enables us to diagnose 2 extra real world bugs which are left unconfirmed for a long time. It is also efficient as it reduces candidate data race spots of each bug to a small set, and narrows down the thread interleaving greatly.RAProducer is also more effective in reproducing real-world data race bugs than other state-of-the-art solutions.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132462039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}