Han Gao, Shaoyin Cheng, Yinxing Xue, Weiming Zhang
Software in the wild is usually released as stripped binaries that contain no debug information (e.g., function names). This paper studies the issue of reassigning descriptive names for functions to help facilitate reverse engineering. Since the essence of this issue is a data-driven prediction task, persuasive research should be based on sufficiently large-scale and diverse data. However, prior studies can only be based on small-scale datasets because their techniques suffer from heavyweight binary analysis, making them powerless in the face of big-size and large-scale binaries. This paper presents the Neural Function Rename Engine (NFRE), a lightweight framework for function name reassignment that utilizes both sequential and structural information of assembly code. NFRE uses fine-grained and easily acquired features to model assembly code, making it more effective and efficient than existing techniques. In addition, we construct a large-scale dataset and present two data-preprocessing approaches to help improve its usability. Benefiting from the lightweight design, NFRE can be efficiently trained on the large-scale dataset, thereby having better generalization capability for unknown functions. The comparative experiments show that NFRE outperforms two existing techniques by a relative improvement of 32% and 16%, respectively, while the time cost for binary analysis is much less.
{"title":"A lightweight framework for function name reassignment based on large-scale stripped binaries","authors":"Han Gao, Shaoyin Cheng, Yinxing Xue, Weiming Zhang","doi":"10.1145/3460319.3464804","DOIUrl":"https://doi.org/10.1145/3460319.3464804","url":null,"abstract":"Software in the wild is usually released as stripped binaries that contain no debug information (e.g., function names). This paper studies the issue of reassigning descriptive names for functions to help facilitate reverse engineering. Since the essence of this issue is a data-driven prediction task, persuasive research should be based on sufficiently large-scale and diverse data. However, prior studies can only be based on small-scale datasets because their techniques suffer from heavyweight binary analysis, making them powerless in the face of big-size and large-scale binaries. This paper presents the Neural Function Rename Engine (NFRE), a lightweight framework for function name reassignment that utilizes both sequential and structural information of assembly code. NFRE uses fine-grained and easily acquired features to model assembly code, making it more effective and efficient than existing techniques. In addition, we construct a large-scale dataset and present two data-preprocessing approaches to help improve its usability. Benefiting from the lightweight design, NFRE can be efficiently trained on the large-scale dataset, thereby having better generalization capability for unknown functions. The comparative experiments show that NFRE outperforms two existing techniques by a relative improvement of 32% and 16%, respectively, while the time cost for binary analysis is much less.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125530587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Elsner, Florian Hauer, A. Pretschner, Silke Reimer
Regression test selection (RTS) and prioritization (RTP) techniques aim to reduce testing efforts and developer feedback time after a change to the code base. Using various information sources, including test traces, build dependencies, version control data, and test histories, they have been shown to be effective. However, not all of these sources are guaranteed to be available and accessible for arbitrary continuous integration (CI) environments. In contrast, metadata from version control systems (VCSs) and CI systems are readily available and inexpensive. Yet, corresponding RTP and RTS techniques are scattered across research and often only evaluated on synthetic faults or in a specific industrial context. It is cumbersome for practitioners to identify insights that apply to their context, let alone to calibrate associated parameters for maximum cost-effectiveness. This paper consolidates existing work on RTP and unsafe RTS into an actionable methodology to build and evaluate such approaches that exclusively rely on CI and VCS metadata. To investigate how these approaches from prior research compare in heterogeneous settings, we apply the methodology in a large-scale empirical study on a set of 23 projects covering 37,000 CI logs and 76,000 VCS commits. We find that these approaches significantly outperform established RTP baselines and, while still triggering 90% of the failures, we show that practitioners can expect to save on average 84% of test execution time for unsafe RTS. We also find that it can be beneficial to limit training data, features from test history work better than change-based features, and, somewhat surprisingly, simple and well-known heuristics often outperform complex machine-learned models.
{"title":"Empirically evaluating readily available information for regression test optimization in continuous integration","authors":"Daniel Elsner, Florian Hauer, A. Pretschner, Silke Reimer","doi":"10.1145/3460319.3464834","DOIUrl":"https://doi.org/10.1145/3460319.3464834","url":null,"abstract":"Regression test selection (RTS) and prioritization (RTP) techniques aim to reduce testing efforts and developer feedback time after a change to the code base. Using various information sources, including test traces, build dependencies, version control data, and test histories, they have been shown to be effective. However, not all of these sources are guaranteed to be available and accessible for arbitrary continuous integration (CI) environments. In contrast, metadata from version control systems (VCSs) and CI systems are readily available and inexpensive. Yet, corresponding RTP and RTS techniques are scattered across research and often only evaluated on synthetic faults or in a specific industrial context. It is cumbersome for practitioners to identify insights that apply to their context, let alone to calibrate associated parameters for maximum cost-effectiveness. This paper consolidates existing work on RTP and unsafe RTS into an actionable methodology to build and evaluate such approaches that exclusively rely on CI and VCS metadata. To investigate how these approaches from prior research compare in heterogeneous settings, we apply the methodology in a large-scale empirical study on a set of 23 projects covering 37,000 CI logs and 76,000 VCS commits. We find that these approaches significantly outperform established RTP baselines and, while still triggering 90% of the failures, we show that practitioners can expect to save on average 84% of test execution time for unsafe RTS. We also find that it can be beneficial to limit training data, features from test history work better than change-based features, and, somewhat surprisingly, simple and well-known heuristics often outperform complex machine-learned models.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114718449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software reproducibility is important for re-usability and the cumulative progress of research. An important manifestation of unreproducible software is the changed outcome of software builds over time. While enhancing code reuse, the use of open-source dependency packages hosted on centralized repositories such as PyPI can have adverse effects on build reproducibility. Frequent updates to these packages often cause their latest versions to have breaking changes for applications using them. Large Python applications risk their historical builds becoming unreproducible due to the widespread usage of Python dependencies, and the lack of uniform practices for dependency version specification. Manually fixing dependency errors requires expensive developer time and effort, while automated approaches face challenges of parsing unstructured build logs, finding transitive dependencies, and exploring an exponential search space of dependency versions. In this paper, we investigate how open-source Python projects specify dependency versions, and how their reproducibility is impacted by dependency packages. We propose a tool PyDFix to detect and fix unreproducibility in Python builds caused by dependency errors. PyDFix is evaluated on two bug datasets BugSwarm and BugsInPy, both of which are built from real-world open-source projects. PyDFix analyzes a total of 2,702 builds, identifying 1,921 (71.1%) of them to be unreproducible due to dependency errors. From these, PyDFix provides a complete fix for 859 (44.7%) builds, and partial fixes for an additional 632 (32.9%) builds.
{"title":"Fixing dependency errors for Python build reproducibility","authors":"Suchita Mukherjee, Abigail Almanza, Cindy Rubio-González","doi":"10.1145/3460319.3464797","DOIUrl":"https://doi.org/10.1145/3460319.3464797","url":null,"abstract":"Software reproducibility is important for re-usability and the cumulative progress of research. An important manifestation of unreproducible software is the changed outcome of software builds over time. While enhancing code reuse, the use of open-source dependency packages hosted on centralized repositories such as PyPI can have adverse effects on build reproducibility. Frequent updates to these packages often cause their latest versions to have breaking changes for applications using them. Large Python applications risk their historical builds becoming unreproducible due to the widespread usage of Python dependencies, and the lack of uniform practices for dependency version specification. Manually fixing dependency errors requires expensive developer time and effort, while automated approaches face challenges of parsing unstructured build logs, finding transitive dependencies, and exploring an exponential search space of dependency versions. In this paper, we investigate how open-source Python projects specify dependency versions, and how their reproducibility is impacted by dependency packages. We propose a tool PyDFix to detect and fix unreproducibility in Python builds caused by dependency errors. PyDFix is evaluated on two bug datasets BugSwarm and BugsInPy, both of which are built from real-world open-source projects. PyDFix analyzes a total of 2,702 builds, identifying 1,921 (71.1%) of them to be unreproducible due to dependency errors. From these, PyDFix provides a complete fix for 859 (44.7%) builds, and partial fixes for an additional 632 (32.9%) builds.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123620524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing popularity of block-chain technologies, more and more engineers use smart contracts for application implementation. Traditional supporting tools can either provide code completions based on static libraries or detect a limited set of vulnerabilities, which results in the manpower waste during coding and miss-detection of bugs. In this work, we propose SCStudio, a unified smart contract development platform, which aims to help developers implement more secure smart contracts easily. The core idea is to realize real-time security-reinforced recommendation through pattern-based learning; and to perform security-oriented validation via integrated testing. SCStudio was implemented as a plug-in of VS Code. It has been used as the official development tool of WeBank and integrated as the recommended development tool by FISCO-BCOS community. In practice, it outperforms existing contract development environments, such as Remix, improving the average word suggestion accuracy by 30%-60% and helping detect about 25% more vulnerabilities. The video is presented at https://youtu.be/l6hW3Ds5Tkg.
{"title":"SCStudio: a secure and efficient integrated development environment for smart contracts","authors":"Meng Ren, Fuchen Ma, Zijing Yin, Huizhong Li, Ying Fu, Ting Chen, Yu Jiang","doi":"10.1145/3460319.3469078","DOIUrl":"https://doi.org/10.1145/3460319.3469078","url":null,"abstract":"With the increasing popularity of block-chain technologies, more and more engineers use smart contracts for application implementation. Traditional supporting tools can either provide code completions based on static libraries or detect a limited set of vulnerabilities, which results in the manpower waste during coding and miss-detection of bugs. In this work, we propose SCStudio, a unified smart contract development platform, which aims to help developers implement more secure smart contracts easily. The core idea is to realize real-time security-reinforced recommendation through pattern-based learning; and to perform security-oriented validation via integrated testing. SCStudio was implemented as a plug-in of VS Code. It has been used as the official development tool of WeBank and integrated as the recommended development tool by FISCO-BCOS community. In practice, it outperforms existing contract development environments, such as Remix, improving the average word suggestion accuracy by 30%-60% and helping detect about 25% more vulnerabilities. The video is presented at https://youtu.be/l6hW3Ds5Tkg.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129450729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kulani Mahadewa, Yanjun Zhang, Guangdong Bai, Lei Bu, Zhiqiang Zuo, Dileepa Fernando, Zhenkai Liang, J. Dong
With many trigger-action platforms that integrate Internet of Things (IoT) systems and online services, rich functionalities transparently connecting digital and physical worlds become easily accessible for the end users. On the other hand, such facilities incorporate multiple parties whose data control policies may radically differ and even contradict each other, and thus privacy violations may arise throughout the lifecycle (e.g., generation and transmission) of triggers and actions. In this work, we conduct an in-depth study on the privacy issues in multi-party trigger-action integration platforms (TAIPs). We first characterize privacy violations that may arise with the integration of heterogeneous systems and services. Based on this knowledge, we propose Taifu, a dynamic testing approach to identify privacy weaknesses from the TAIP. The key insight of Taifu is that the applets which actually program the trigger-action rules can be used as test cases to explore the behavior of the TAIP. We evaluate the effectiveness of our approach by applying it on the TAIPs that are built around the IFTTT platform. To our great surprise, we find that privacy violations are prevalent among them. Using the automatically generated 407 applets, each from a different TAIP, Taifu detects 194 cases with access policy breaches, 218 access control missing, 90 access revocation missing, 15 unintended flows, and 73 over-privilege access.
{"title":"Identifying privacy weaknesses from multi-party trigger-action integration platforms","authors":"Kulani Mahadewa, Yanjun Zhang, Guangdong Bai, Lei Bu, Zhiqiang Zuo, Dileepa Fernando, Zhenkai Liang, J. Dong","doi":"10.1145/3460319.3464838","DOIUrl":"https://doi.org/10.1145/3460319.3464838","url":null,"abstract":"With many trigger-action platforms that integrate Internet of Things (IoT) systems and online services, rich functionalities transparently connecting digital and physical worlds become easily accessible for the end users. On the other hand, such facilities incorporate multiple parties whose data control policies may radically differ and even contradict each other, and thus privacy violations may arise throughout the lifecycle (e.g., generation and transmission) of triggers and actions. In this work, we conduct an in-depth study on the privacy issues in multi-party trigger-action integration platforms (TAIPs). We first characterize privacy violations that may arise with the integration of heterogeneous systems and services. Based on this knowledge, we propose Taifu, a dynamic testing approach to identify privacy weaknesses from the TAIP. The key insight of Taifu is that the applets which actually program the trigger-action rules can be used as test cases to explore the behavior of the TAIP. We evaluate the effectiveness of our approach by applying it on the TAIPs that are built around the IFTTT platform. To our great surprise, we find that privacy violations are prevalent among them. Using the automatically generated 407 applets, each from a different TAIP, Taifu detects 194 cases with access policy breaches, 218 access control missing, 90 access revocation missing, 15 unintended flows, and 73 over-privilege access.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124567206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runxiang Cheng, Lingming Zhang, D. Marinov, Tianyin Xu
Configuration changes are among the dominant causes of failures of large-scale software system deployment. Given the velocity of configuration changes, typically at the scale of hundreds to thousands of times daily in modern cloud systems, checking these configuration changes is critical to prevent failures due to misconfigurations. Recent work has proposed configuration testing, Ctest, a technique that tests configuration changes together with the code that uses the changed configurations. Ctest can automatically generate a large number of ctests that can effectively detect misconfigurations, including those that are hard to detect by traditional techniques. However, running ctests can take a long time to detect misconfigurations. Inspired by traditional test-case prioritization (TCP) that aims to reorder test executions to speed up detection of regression code faults, we propose to apply TCP to reorder ctests to speed up detection of misconfigurations. We extensively evaluate a total of 84 traditional and novel ctest-specific TCP techniques. The experimental results on five widely used cloud projects demonstrate that TCP can substantially speed up misconfiguration detection. Our study provides guidelines for applying TCP to configuration testing in practice.
{"title":"Test-case prioritization for configuration testing","authors":"Runxiang Cheng, Lingming Zhang, D. Marinov, Tianyin Xu","doi":"10.1145/3460319.3464810","DOIUrl":"https://doi.org/10.1145/3460319.3464810","url":null,"abstract":"Configuration changes are among the dominant causes of failures of large-scale software system deployment. Given the velocity of configuration changes, typically at the scale of hundreds to thousands of times daily in modern cloud systems, checking these configuration changes is critical to prevent failures due to misconfigurations. Recent work has proposed configuration testing, Ctest, a technique that tests configuration changes together with the code that uses the changed configurations. Ctest can automatically generate a large number of ctests that can effectively detect misconfigurations, including those that are hard to detect by traditional techniques. However, running ctests can take a long time to detect misconfigurations. Inspired by traditional test-case prioritization (TCP) that aims to reorder test executions to speed up detection of regression code faults, we propose to apply TCP to reorder ctests to speed up detection of misconfigurations. We extensively evaluate a total of 84 traditional and novel ctest-specific TCP techniques. The experimental results on five widely used cloud projects demonstrate that TCP can substantially speed up misconfiguration detection. Our study provides guidelines for applying TCP to configuration testing in practice.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122645574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Mariani, Ali Mohebbi, M. Pezzè, Valerio Terragni
GUI testing is an important but expensive activity. Recently, research on test reuse approaches for Android applications produced interesting results. Test reuse approaches automatically migrate human-designed GUI tests from a source app to a target app that shares similar functionalities. They achieve this by exploiting semantic similarity among textual information of GUI widgets. Semantic matching of GUI events plays a crucial role in these approaches. In this paper, we present the first empirical study on semantic matching of GUI events. Our study involves 253 configurations of the semantic matching, 337 unique queries, and 8,099 distinct GUI events. We report several key findings that indicate how to improve semantic matching of test reuse approaches, propose SemFinder a novel semantic matching algorithm that outperforms existing solutions, and identify several interesting research directions.
{"title":"Semantic matching of GUI events for test reuse: are we there yet?","authors":"L. Mariani, Ali Mohebbi, M. Pezzè, Valerio Terragni","doi":"10.1145/3460319.3464827","DOIUrl":"https://doi.org/10.1145/3460319.3464827","url":null,"abstract":"GUI testing is an important but expensive activity. Recently, research on test reuse approaches for Android applications produced interesting results. Test reuse approaches automatically migrate human-designed GUI tests from a source app to a target app that shares similar functionalities. They achieve this by exploiting semantic similarity among textual information of GUI widgets. Semantic matching of GUI events plays a crucial role in these approaches. In this paper, we present the first empirical study on semantic matching of GUI events. Our study involves 253 configurations of the semantic matching, 337 unique queries, and 8,099 distinct GUI events. We report several key findings that indicate how to improve semantic matching of test reuse approaches, propose SemFinder a novel semantic matching algorithm that outperforms existing solutions, and identify several interesting research directions.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126042883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingjun Lyu, Sasha Volokh, William G. J. Halfond, Omer Tripp
Local databases underpin important features in many mobile applications, such as responsiveness in the face of poor connectivity. However, failure to use such databases correctly can lead to high resource consumption or even security vulnerabilities. We present SAND, an extensible static analysis approach that checks for misuse of local databases, also known as SQL antipatterns, in mobile apps. SAND features novel abstractions for common forms of application/database interactions, which enables concise and precise specification of the antipatterns that SAND checks for. To validate the efficacy of SAND, we have experimented with a diverse suite of 1,000 Android apps. We show that the abstractions that power SAND allow concise specification of all the known antipatterns from the literature (12-74 LOC), and that the antipatterns are modeled accurately (99.4-100% precision). As for performance, SAND requires on average 41 seconds to complete a scan on a mobile app.
{"title":"SAND: a static analysis approach for detecting SQL antipatterns","authors":"Yingjun Lyu, Sasha Volokh, William G. J. Halfond, Omer Tripp","doi":"10.1145/3460319.3464818","DOIUrl":"https://doi.org/10.1145/3460319.3464818","url":null,"abstract":"Local databases underpin important features in many mobile applications, such as responsiveness in the face of poor connectivity. However, failure to use such databases correctly can lead to high resource consumption or even security vulnerabilities. We present SAND, an extensible static analysis approach that checks for misuse of local databases, also known as SQL antipatterns, in mobile apps. SAND features novel abstractions for common forms of application/database interactions, which enables concise and precise specification of the antipatterns that SAND checks for. To validate the efficacy of SAND, we have experimented with a diverse suite of 1,000 Android apps. We show that the abstractions that power SAND allow concise specification of all the known antipatterns from the literature (12-74 LOC), and that the antipatterns are modeled accurately (99.4-100% precision). As for performance, SAND requires on average 41 seconds to complete a scan on a mobile app.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121908006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saikat Dutta, Jeeva Selvam, Aryaman Jain, Sasa Misailovic
The stochastic nature of many Machine Learning (ML) algorithms makes testing of ML tools and libraries challenging. ML algorithms allow a developer to control their accuracy and run-time through a set of hyper-parameters, which are typically manually selected in tests. This choice is often too conservative and leads to slow test executions, thereby increasing the cost of regression testing. We propose TERA, the first automated technique for reducing the cost of regression testing in Machine Learning tools and libraries(jointly referred to as projects) without making the tests more flaky. TERA solves the problem of exploring the trade-off space between execution time of the test and its flakiness as an instance of Stochastic Optimization over the space of algorithm hyper-parameters. TERA presents how to leverage statistical convergence-testing techniques to estimate the level of flakiness of the test for a specific choice of hyper-parameters during optimization. We evaluate TERA on a corpus of 160 tests selected from 15 popular machine learning projects. Overall, TERA obtains a geo-mean speedup of 2.23x over the original tests, for the minimum passing probability threshold of 99%. We also show that the new tests did not reduce fault detection ability through a mutation study and a study on a set of 12 historical build failures in studied projects.
{"title":"TERA: optimizing stochastic regression tests in machine learning projects","authors":"Saikat Dutta, Jeeva Selvam, Aryaman Jain, Sasa Misailovic","doi":"10.1145/3460319.3464844","DOIUrl":"https://doi.org/10.1145/3460319.3464844","url":null,"abstract":"The stochastic nature of many Machine Learning (ML) algorithms makes testing of ML tools and libraries challenging. ML algorithms allow a developer to control their accuracy and run-time through a set of hyper-parameters, which are typically manually selected in tests. This choice is often too conservative and leads to slow test executions, thereby increasing the cost of regression testing. We propose TERA, the first automated technique for reducing the cost of regression testing in Machine Learning tools and libraries(jointly referred to as projects) without making the tests more flaky. TERA solves the problem of exploring the trade-off space between execution time of the test and its flakiness as an instance of Stochastic Optimization over the space of algorithm hyper-parameters. TERA presents how to leverage statistical convergence-testing techniques to estimate the level of flakiness of the test for a specific choice of hyper-parameters during optimization. We evaluate TERA on a corpus of 160 tests selected from 15 popular machine learning projects. Overall, TERA obtains a geo-mean speedup of 2.23x over the original tests, for the minimum passing probability threshold of 99%. We also show that the new tests did not reduce fault detection ability through a mutation study and a study on a set of 12 historical build failures in studied projects.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114576839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziqi Shuai, Zhenbang Chen, Yufeng Zhang, Jun Sun, Ji Wang
Array constraints are prevalent in analyzing a program with symbolic execution. Solving array constraints is challenging due to the complexity of the precise encoding for arrays. In this work, we propose to synergize symbolic execution and array constraint solving. Our method addresses the difficulties in solving array constraints with novel ideas. First, we propose a lightweight method for pre-checking the unsatisfiability of array constraints based on integer linear programming. Second, observing that encoding arrays at the byte-level introduces many redundant axioms that reduce the effectiveness of constraint solving, we propose type and interval aware axiom generation. Note that the type information of array variables is inferred by symbolic execution, whereas interval information is calculated through the above pre-checking step. We have implemented our methods based on KLEE and its underlying constraint solver STP and conducted large-scale experiments on 75 real-world programs. The experimental results show that our method effectively improves the efficiency of symbolic execution. Our method solves 182.56% more constraints and explores 277.56% more paths on average under the same time threshold.
{"title":"Type and interval aware array constraint solving for symbolic execution","authors":"Ziqi Shuai, Zhenbang Chen, Yufeng Zhang, Jun Sun, Ji Wang","doi":"10.1145/3460319.3464826","DOIUrl":"https://doi.org/10.1145/3460319.3464826","url":null,"abstract":"Array constraints are prevalent in analyzing a program with symbolic execution. Solving array constraints is challenging due to the complexity of the precise encoding for arrays. In this work, we propose to synergize symbolic execution and array constraint solving. Our method addresses the difficulties in solving array constraints with novel ideas. First, we propose a lightweight method for pre-checking the unsatisfiability of array constraints based on integer linear programming. Second, observing that encoding arrays at the byte-level introduces many redundant axioms that reduce the effectiveness of constraint solving, we propose type and interval aware axiom generation. Note that the type information of array variables is inferred by symbolic execution, whereas interval information is calculated through the above pre-checking step. We have implemented our methods based on KLEE and its underlying constraint solver STP and conducted large-scale experiments on 75 real-world programs. The experimental results show that our method effectively improves the efficiency of symbolic execution. Our method solves 182.56% more constraints and explores 277.56% more paths on average under the same time threshold.","PeriodicalId":188008,"journal":{"name":"Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129398238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}