Defect prediction is crucial for software quality assurance and has been extensively researched over recent decades. However, prior studies rarely focus on data complexity in defect prediction tasks, and even less on understanding the difficulties of these tasks from the perspective of data complexity. In this paper, we conduct an empirical study to estimate the hardness of over 33,000 instances, employing a set of measures to characterize the inherent difficulty of instances and the characteristics of defect datasets. Our findings indicate that: (1) instance hardness in both classes displays a right-skewed distribution, with the defective class exhibiting a more scattered distribution; (2) class overlap is the primary factor influencing instance hardness and can be characterized through feature, structural, and instance-level overlap; (3) no universal preprocessing technique is applicable to all datasets, and it may not consistently reduce data complexity, fortunately, dataset complexity measures can help identify suitable techniques for specific datasets; (4) integrating data complexity information into the learning process can enhance an algorithm’s learning capacity. In summary, this empirical study highlights the crucial role of data complexity in defect prediction tasks, and provides a novel perspective for advancing research in defect prediction techniques.
{"title":"Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction Tasks","authors":"Xiaohui Wan, Zheng Zheng, Fangyun Qin, Xuhui Lu","doi":"10.1145/3649596","DOIUrl":"https://doi.org/10.1145/3649596","url":null,"abstract":"<p>Defect prediction is crucial for software quality assurance and has been extensively researched over recent decades. However, prior studies rarely focus on data complexity in defect prediction tasks, and even less on understanding the difficulties of these tasks from the perspective of data complexity. In this paper, we conduct an empirical study to estimate the hardness of over 33,000 instances, employing a set of measures to characterize the inherent difficulty of instances and the characteristics of defect datasets. Our findings indicate that: (1) instance hardness in both classes displays a right-skewed distribution, with the defective class exhibiting a more scattered distribution; (2) class overlap is the primary factor influencing instance hardness and can be characterized through feature, structural, and instance-level overlap; (3) no universal preprocessing technique is applicable to all datasets, and it may not consistently reduce data complexity, fortunately, dataset complexity measures can help identify suitable techniques for specific datasets; (4) integrating data complexity information into the learning process can enhance an algorithm’s learning capacity. In summary, this empirical study highlights the crucial role of data complexity in defect prediction tasks, and provides a novel perspective for advancing research in defect prediction techniques.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"36 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139977851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhifei Chen, Lin Chen, Yibiao Yang, Qiong Feng, Xuansong Li, Wei Song
Python’s dynamic typing nature provides developers with powerful programming abstractions. However, many type related bugs are accumulated in code bases of Python due to the misuse of dynamic typing. The goal of this paper is to aid in the understanding of developers’ high-risk practices towards dynamic typing and the early detection of type related bugs. We first formulate the rules of six types of risky dynamic typing related practices (type smells for short) in Python. We then develop a rule-based tool named RUPOR which builds an accurate type base to detect type smells. Our evaluation shows that RUPOR outperforms the existing type smell detection techniques (including the LLM-based approaches, Mypy, and PYDYPE) on a benchmark of 900 Python methods. Based on RUPOR, we conduct an empirical study on 25 real-world projects. We find that type smells are significantly related to the occurrence of post-release faults. The fault-proneness prediction model built with type smell features slightly outperforms the model built without them. We also summarize the common patterns including inserting type check to fix type smell bugs. These findings provide valuable insights for preventing and fixing type related bugs in the programs written in dynamic-typed languages.
{"title":"Risky Dynamic Typing Related Practices in Python: An Empirical Study","authors":"Zhifei Chen, Lin Chen, Yibiao Yang, Qiong Feng, Xuansong Li, Wei Song","doi":"10.1145/3649593","DOIUrl":"https://doi.org/10.1145/3649593","url":null,"abstract":"<p>Python’s dynamic typing nature provides developers with powerful programming abstractions. However, many type related bugs are accumulated in code bases of Python due to the misuse of dynamic typing. The goal of this paper is to aid in the understanding of developers’ high-risk practices towards dynamic typing and the early detection of type related bugs. We first formulate the rules of six types of risky dynamic typing related practices (type smells for short) in Python. We then develop a rule-based tool named RUPOR which builds an accurate type base to detect type smells. Our evaluation shows that RUPOR outperforms the existing type smell detection techniques (including the LLM-based approaches, Mypy, and PYDYPE) on a benchmark of 900 Python methods. Based on RUPOR, we conduct an empirical study on 25 real-world projects. We find that type smells are significantly related to the occurrence of post-release faults. The fault-proneness prediction model built with type smell features slightly outperforms the model built without them. We also summarize the common patterns including inserting type check to fix type smell bugs. These findings provide valuable insights for preventing and fixing type related bugs in the programs written in dynamic-typed languages.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"2014 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139968714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software practitioners use various methods in Requirements Engineering (RE) to elicit, analyze and specify the requirements of a enterprise products. The methods impact the final product characteristics and influence product delivery. Ad-hoc usage of the methods by software practitioners can lead to inconsistency and ambiguity in the product. With the notable rise in enterprise products, games, etc. across various domains, Virtual Reality (VR) has become an essential technology for the future. The methods adopted for requirement engineering for developing VR products requires a detailed study. This paper presents a mapping study on requirement engineering methods prescribed and used for developing VR applications including requirements elicitation, requirements analysis, and requirements specification. Our study provides insights into the use of such methods in the VR community and suggests using specific requirement engineering methods in various fields of interest. We also discuss future directions in requirement engineering for VR products.
{"title":"Requirement Engineering Methods for Virtual Reality Software Product Development - A Mapping Study","authors":"Sai Anirudh Karre, Y. Raghu Reddy, Raghav Mittal","doi":"10.1145/3649595","DOIUrl":"https://doi.org/10.1145/3649595","url":null,"abstract":"<p>Software practitioners use various methods in Requirements Engineering (RE) to elicit, analyze and specify the requirements of a enterprise products. The methods impact the final product characteristics and influence product delivery. Ad-hoc usage of the methods by software practitioners can lead to inconsistency and ambiguity in the product. With the notable rise in enterprise products, games, etc. across various domains, Virtual Reality (VR) has become an essential technology for the future. The methods adopted for requirement engineering for developing VR products requires a detailed study. This paper presents a mapping study on requirement engineering methods prescribed and used for developing VR applications including requirements elicitation, requirements analysis, and requirements specification. Our study provides insights into the use of such methods in the VR community and suggests using specific requirement engineering methods in various fields of interest. We also discuss future directions in requirement engineering for VR products.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"111 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139968466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fang Liu, Zhiyi Fu, Ge Li, Zhi Jin, Hui Liu, Yiyang Hao, Li Zhang
Software developers frequently use code completion tools to accelerate software development by suggesting the following code elements. Researchers usually employ AutoRegressive (AR) decoders to complete code sequences in a left-to-right, token-by-token fashion. To improve the accuracy and efficiency of code completion, we argue that tokens within a code statement have the potential to be predicted concurrently. In this paper, we first conduct an empirical study to analyze the dependency among the target tokens in line-level code completion. The results suggest that it is potentially practical to generate all statement tokens in parallel. To this end, we introduce SANAR, a simple and effective syntax-aware non-autoregressive model for line-level code completion. To further improve the quality of the generated code, we propose an adaptive and syntax-aware sampling strategy to boost the model’s performance. The experimental results obtained from two widely used datasets indicate that our model outperforms state-of-the-art code completion approaches of similar model size by a considerable margin, and is faster than these models with up to 9 × speed-up. Moreover, the extensive results additionally demonstrate that the enhancements achieved by SANAR become even more pronounced with larger model sizes, highlighting their significance.
{"title":"Non-Autoregressive Line-Level Code Completion","authors":"Fang Liu, Zhiyi Fu, Ge Li, Zhi Jin, Hui Liu, Yiyang Hao, Li Zhang","doi":"10.1145/3649594","DOIUrl":"https://doi.org/10.1145/3649594","url":null,"abstract":"<p>Software developers frequently use code completion tools to accelerate software development by suggesting the following code elements. Researchers usually employ AutoRegressive (AR) decoders to complete code sequences in a left-to-right, token-by-token fashion. To improve the accuracy and efficiency of code completion, we argue that tokens within a code statement have the potential to be predicted concurrently. In this paper, we first conduct an empirical study to analyze the dependency among the target tokens in line-level code completion. The results suggest that it is potentially practical to generate all statement tokens in parallel. To this end, we introduce SANAR, a simple and effective syntax-aware non-autoregressive model for line-level code completion. To further improve the quality of the generated code, we propose an adaptive and syntax-aware sampling strategy to boost the model’s performance. The experimental results obtained from two widely used datasets indicate that our model outperforms state-of-the-art code completion approaches of similar model size by a considerable margin, and is faster than these models with up to 9 × speed-up. Moreover, the extensive results additionally demonstrate that the enhancements achieved by SANAR become even more pronounced with larger model sizes, highlighting their significance.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"14 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139977717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinmeng Xia, Yang Feng, Qingkai Shi, James A. Jones, Xiangyu Zhang, Baowen Xu
Skeletal program enumeration (SPE) can generate a great number of test programs for validating the correctness of compilers or interpreters. The classic SPE generates programs by exhaustively enumerating all possible variable usage patterns into a given syntactic structure. Even though it is capable of producing many test programs, the exhaustive enumeration strategy generates a large number of invalid programs, which may waste plenty of testing time and resources. To address the problem, this paper proposes a tree-based SPE technique. Compared to the state-of-the-art, the key merit of the tree-based approach is that it allows us to take the dependency information into consideration when producing test programs and, thus, make it possible to (1) directly generate non-equivalent programs, and (2) apply dominance relations to eliminate invalid test programs that have undefined variables. Hence, our approach significantly saves the cost of the naïve SPE approach. We have implemented our approach into an automated testing tool, IFuzzer, and applied it to test eight different implementations of Python interpreters, including CPython, PyPy, IronPython, Jython, RustPython, GPython, Pyston, and Codon. In three months of fuzzing, IFuzzer detected 142 bugs, of which 87 have been confirmed to be previously unknown bugs, of which 34 have been fixed. Compared to the state-of-the-art SPE techniques, IFuzzer takes only 61.0% of the time cost given the same number of testing seeds and improves 5.3% source code function coverage in the same time budget of testing.
{"title":"Enumerating Valid Non-Alpha-Equivalent Programs for Interpreter Testing","authors":"Xinmeng Xia, Yang Feng, Qingkai Shi, James A. Jones, Xiangyu Zhang, Baowen Xu","doi":"10.1145/3647994","DOIUrl":"https://doi.org/10.1145/3647994","url":null,"abstract":"<p>Skeletal program enumeration (SPE) can generate a great number of test programs for validating the correctness of compilers or interpreters. The classic SPE generates programs by exhaustively enumerating all possible variable usage patterns into a given syntactic structure. Even though it is capable of producing many test programs, the exhaustive enumeration strategy generates a large number of invalid programs, which may waste plenty of testing time and resources. To address the problem, this paper proposes a tree-based SPE technique. Compared to the state-of-the-art, the key merit of the tree-based approach is that it allows us to take the dependency information into consideration when producing test programs and, thus, make it possible to (1) directly generate non-equivalent programs, and (2) apply dominance relations to eliminate invalid test programs that have undefined variables. Hence, our approach significantly saves the cost of the naïve SPE approach. We have implemented our approach into an automated testing tool, <span>IFuzzer</span>, and applied it to test eight different implementations of Python interpreters, including CPython, PyPy, IronPython, Jython, RustPython, GPython, Pyston, and Codon. In three months of fuzzing, <span>IFuzzer</span> detected 142 bugs, of which 87 have been confirmed to be previously unknown bugs, of which 34 have been fixed. Compared to the state-of-the-art SPE techniques, <span>IFuzzer</span> takes only 61.0% of the time cost given the same number of testing seeds and improves 5.3% source code function coverage in the same time budget of testing.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"31 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cuifeng Gao, Wenzhang Yang, Jiaming Ye, Yinxing Xue, Jun Sun
Smart contracts are becoming appealing targets for hackers because of the vast amount of cryptocurrencies under their control. Asset loss due to the exploitation of smart contract codes has increased significantly in recent years. To guarantee that smart contracts are vulnerability-free, there are many works to detect the vulnerabilities of smart contracts, but only a few vulnerability repair works have been proposed. Repairing smart contract vulnerabilities at the source code level is attractive as it is transparent to users, whereas existing repair tools, such as SCRepair and sGuard, suffer from many limitations: (1) ignoring the code of vulnerability prevention; (2) possibly applying the repair to the wrong statements and changing the original business logic of smart contracts; (3) showing poor performance in terms of time and gas overhead.
In this work, we propose machine learning guided rule-based automated vulnerability repair on smart contracts to improve the effectiveness and efficiency of sGuard. To address the limitations mentioned above, we design the features that characterize both the symptoms of vulnerabilities and the methods of vulnerability prevention to learn various vulnerability patterns and reduce false positives. Additionally, a fine-grained localization algorithm is designed by traversing the nodes of the abstract syntax tree, and we refine and extend the repair rules of sGuard to preserve the original business logic of smart contracts and support new vulnerability types. Our tool, named sGuard+, reduces time overhead based on machine learning models, and reduces gas overhead by fewer code changes and precise patching.
In our experiment, we collect a publicly available vulnerability dataset from CVE, SWC and SmartBugs Curated as a ground truth for evaluations. Overall, sGuard+ repairs more vulnerabilities with less time and gas overhead than state-of-the-art tools. Furthermore, we reproduce about 9,000 historical transactions for regression testing. It is shown that sGuard+ has no impact on the original business logic of smart contracts.
{"title":"sGuard+: Machine Learning Guided Rule-based Automated Vulnerability Repair on Smart Contracts.","authors":"Cuifeng Gao, Wenzhang Yang, Jiaming Ye, Yinxing Xue, Jun Sun","doi":"10.1145/3641846","DOIUrl":"https://doi.org/10.1145/3641846","url":null,"abstract":"<p>Smart contracts are becoming appealing targets for hackers because of the vast amount of cryptocurrencies under their control. Asset loss due to the exploitation of smart contract codes has increased significantly in recent years. To guarantee that smart contracts are vulnerability-free, there are many works to detect the vulnerabilities of smart contracts, but only a few vulnerability repair works have been proposed. Repairing smart contract vulnerabilities at the source code level is attractive as it is transparent to users, whereas existing repair tools, such as <span>SCRepair</span> and <span>sGuard</span>, suffer from many limitations: (1) ignoring the code of vulnerability prevention; (2) possibly applying the repair to the wrong statements and changing the original business logic of smart contracts; (3) showing poor performance in terms of time and gas overhead. </p><p>In this work, we propose machine learning guided rule-based automated vulnerability repair on smart contracts to improve the effectiveness and efficiency of <span>sGuard</span>. To address the limitations mentioned above, we design the features that characterize both the symptoms of vulnerabilities and the methods of vulnerability prevention to learn various vulnerability patterns and reduce false positives. Additionally, a fine-grained localization algorithm is designed by traversing the nodes of the abstract syntax tree, and we refine and extend the repair rules of <span>sGuard</span> to preserve the original business logic of smart contracts and support new vulnerability types. Our tool, named <span>sGuard+</span>, reduces time overhead based on machine learning models, and reduces gas overhead by fewer code changes and precise patching. </p><p>In our experiment, we collect a publicly available vulnerability dataset from CVE, SWC and SmartBugs Curated as a ground truth for evaluations. Overall, <span>sGuard+</span> repairs more vulnerabilities with less time and gas overhead than state-of-the-art tools. Furthermore, we reproduce about 9,000 historical transactions for regression testing. It is shown that <span>sGuard+</span> has no impact on the original business logic of smart contracts.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"67 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore, Lionel Briand
The adoption of deep neural networks (DNNs) in safety-critical contexts is often prevented by the lack of effective means to explain their results, especially when they are erroneous. In our previous work, we proposed a white-box approach (HUDD) and a black-box approach (SAFE) to automatically characterize DNN failures. They both identify clusters of similar images from a potentially large set of images leading to DNN failures. However, the analysis pipelines for HUDD and SAFE were instantiated in specific ways according to common practices, deferring the analysis of other pipelines to future work.
In this paper, we report on an empirical evaluation of 99 different pipelines for root cause analysis of DNN failures. They combine transfer learning, autoencoders, heatmaps of neuron relevance, dimensionality reduction techniques, and different clustering algorithms. Our results show that the best pipeline combines transfer learning, DBSCAN, and UMAP. It leads to clusters almost exclusively capturing images of the same failure scenario, thus facilitating root cause analysis. Further, it generates distinct clusters for each root cause of failure, thus enabling engineers to detect all the unsafe scenarios. Interestingly, these results hold even for failure scenarios that are only observed in a small percentage of the failing images.
{"title":"Supporting Safety Analysis of Image-processing DNNs through Clustering-based Approaches","authors":"Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore, Lionel Briand","doi":"10.1145/3643671","DOIUrl":"https://doi.org/10.1145/3643671","url":null,"abstract":"<p>The adoption of deep neural networks (DNNs) in safety-critical contexts is often prevented by the lack of effective means to explain their results, especially when they are erroneous. In our previous work, we proposed a white-box approach (HUDD) and a black-box approach (SAFE) to automatically characterize DNN failures. They both identify clusters of similar images from a potentially large set of images leading to DNN failures. However, the analysis pipelines for HUDD and SAFE were instantiated in specific ways according to common practices, deferring the analysis of other pipelines to future work. </p><p>In this paper, we report on an empirical evaluation of 99 different pipelines for root cause analysis of DNN failures. They combine transfer learning, autoencoders, heatmaps of neuron relevance, dimensionality reduction techniques, and different clustering algorithms. Our results show that the best pipeline combines transfer learning, DBSCAN, and UMAP. It leads to clusters almost exclusively capturing images of the same failure scenario, thus facilitating root cause analysis. Further, it generates distinct clusters for each root cause of failure, thus enabling engineers to detect all the unsafe scenarios. Interestingly, these results hold even for failure scenarios that are only observed in a small percentage of the failing images.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"17 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Yang, Junjie Chen, Shutao Gao, Zhihao Gong, Hongyu Zhang, Yue Kang, Huaan Li
With the rapid development of deep learning (DL), the recent trend of log-based anomaly detection focuses on extracting semantic information from log events (i.e., templates of log messages) and designing more advanced DL models for anomaly detection. Indeed, the effectiveness of log-based anomaly detection can be improved, but these DL-based techniques further suffer from the limitations of more heavy dependency on training data (such as data quality or data labels) and higher costs in time and resources due to the complexity and scale of DL models, which hinder their practical use. On the contrary, the techniques based on traditional machine learning or data mining algorithms are less dependent on training data and more efficient, but produce worse effectiveness than DL-based techniques which is mainly caused by the problem of unseen log events (some log events in incoming log messages are unseen in training data) confirmed by our motivating study. Intuitively, if we can improve the effectiveness of traditional techniques to be comparable with advanced DL-based techniques, log-based anomaly detection can be more practical. Indeed, an existing study in the other area (i.e., linking questions posted on Stack Overflow) has pointed out that traditional techniques with some optimizations can indeed achieve comparable effectiveness with the state-of-the-art DL-based technique, indicating the feasibility of enhancing traditional log-based anomaly detection techniques to some degree.
Inspired by the idea of “try-with-simpler”, we conducted the first empirical study to explore the potential of improving traditional techniques for more practical log-based anomaly detection. In this work, we optimized the traditional unsupervised PCA (Principal Component Analysis) technique by incorporating a lightweight semantic-based log representation in it, called SemPCA, and conducted an extensive study to investigate the potential of SemPCA for more practical log-based anomaly detection. By comparing seven log-based anomaly detection techniques (including four DL-based techniques, two traditional techniques, and SemPCA) on both public and industrial datasets, our results show that SemPCA achieves comparable effectiveness as advanced supervised/semi-supervised DL-based techniques while being much more stable under insufficient training data and more efficient, demonstrating that the traditional technique can still excel after small but useful adaptation.
{"title":"Try with Simpler – An Evaluation of Improved Principal Component Analysis in Log-based Anomaly Detection","authors":"Lin Yang, Junjie Chen, Shutao Gao, Zhihao Gong, Hongyu Zhang, Yue Kang, Huaan Li","doi":"10.1145/3644386","DOIUrl":"https://doi.org/10.1145/3644386","url":null,"abstract":"<p>With the rapid development of deep learning (DL), the recent trend of log-based anomaly detection focuses on extracting semantic information from log events (i.e., templates of log messages) and designing more advanced DL models for anomaly detection. Indeed, the effectiveness of log-based anomaly detection can be improved, but these DL-based techniques further suffer from the limitations of more heavy dependency on training data (such as data quality or data labels) and higher costs in time and resources due to the complexity and scale of DL models, which hinder their practical use. On the contrary, the techniques based on traditional machine learning or data mining algorithms are less dependent on training data and more efficient, but produce worse effectiveness than DL-based techniques which is mainly caused by the problem of unseen log events (some log events in incoming log messages are unseen in training data) confirmed by our motivating study. Intuitively, if we can improve the effectiveness of traditional techniques to be comparable with advanced DL-based techniques, log-based anomaly detection can be more practical. Indeed, an existing study in the other area (i.e., linking questions posted on Stack Overflow) has pointed out that traditional techniques with some optimizations can indeed achieve comparable effectiveness with the state-of-the-art DL-based technique, indicating the feasibility of enhancing traditional log-based anomaly detection techniques to some degree. </p><p>Inspired by the idea of “try-with-simpler”, we conducted the first empirical study to explore the potential of improving traditional techniques for more practical log-based anomaly detection. In this work, we optimized the traditional unsupervised PCA (Principal Component Analysis) technique by incorporating a lightweight semantic-based log representation in it, called <i>SemPCA</i>, and conducted an extensive study to investigate the potential of <i>SemPCA</i> for more practical log-based anomaly detection. By comparing seven log-based anomaly detection techniques (including four DL-based techniques, two traditional techniques, and <i>SemPCA</i>) on both public and industrial datasets, our results show that <i>SemPCA</i> achieves comparable effectiveness as advanced supervised/semi-supervised DL-based techniques while being much more stable under insufficient training data and more efficient, demonstrating that the traditional technique can still excel after small but useful adaptation.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"44 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural networks (DNNs) are widely used in various application domains such as image processing, speech recognition, and natural language processing. However, testing DNN models may be challenging due to the complexity and size of their input domain. Particularly, testing DNN models often requires generating or exploring large unlabeled datasets. In practice, DNN test oracles, which identify the correct outputs for inputs, often require expensive manual effort to label test data, possibly involving multiple experts to ensure labeling correctness. In this paper, we propose DeepGD, a black-box multi-objective test selection approach for DNN models. It reduces the cost of labeling by prioritizing the selection of test inputs with high fault-revealing power from large unlabeled datasets. DeepGD not only selects test inputs with high uncertainty scores to trigger as many mispredicted inputs as possible but also maximizes the probability of revealing distinct faults in the DNN model by selecting diverse mispredicted inputs. The experimental results conducted on four widely used datasets and five DNN models show that in terms of fault-revealing ability: (1) White-box, coverage-based approaches fare poorly, (2) DeepGD outperforms existing black-box test selection approaches in terms of fault detection, and (3) DeepGD also leads to better guidance for DNN model retraining when using selected inputs to augment the training set.
{"title":"DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks","authors":"Zohreh Aghababaeyan, Manel Abdellatif, Mahboubeh Dadkhah, Lionel Briand","doi":"10.1145/3644388","DOIUrl":"https://doi.org/10.1145/3644388","url":null,"abstract":"<p>Deep neural networks (DNNs) are widely used in various application domains such as image processing, speech recognition, and natural language processing. However, testing DNN models may be challenging due to the complexity and size of their input domain. Particularly, testing DNN models often requires generating or exploring large unlabeled datasets. In practice, DNN test oracles, which identify the correct outputs for inputs, often require expensive manual effort to label test data, possibly involving multiple experts to ensure labeling correctness. In this paper, we propose <i>DeepGD</i>, a black-box multi-objective test selection approach for DNN models. It reduces the cost of labeling by prioritizing the selection of test inputs with high fault-revealing power from large unlabeled datasets. <i>DeepGD</i> not only selects test inputs with high uncertainty scores to trigger as many mispredicted inputs as possible but also maximizes the probability of revealing distinct faults in the DNN model by selecting diverse mispredicted inputs. The experimental results conducted on four widely used datasets and five DNN models show that in terms of fault-revealing ability: (1) White-box, coverage-based approaches fare poorly, (2) <i>DeepGD</i> outperforms existing black-box test selection approaches in terms of fault detection, and (3) <i>DeepGD</i> also leads to better guidance for DNN model retraining when using selected inputs to augment the training set.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"324 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaxiang Liu, Yunhan Xing, Xiaomu Shi, Fu Song, Zhiwu Xu, Zhong Ming
As a new programming paradigm, deep neural networks (DNNs) have been increasingly deployed in practice, but the lack of robustness hinders their applications in safety-critical domains. While there are techniques for verifying DNNs with formal guarantees, they are limited in scalability and accuracy. In this paper, we present a novel counterexample-guided abstraction refinement (CEGAR) approach for scalable and exact verification of DNNs. Specifically, we propose a novel abstraction to break down the size of DNNs by over-approximation. The result of verifying the abstract DNN is conclusive if no spurious counterexample is reported. To eliminate each spurious counterexample introduced by abstraction, we propose a novel counterexample-guided refinement that refines the abstract DNN to exclude the spurious counterexample while still over-approximating the original one, leading to a sound, complete yet efficient CEGAR approach. Our approach is orthogonal to and can be integrated with many existing verification techniques. For demonstration, we implement our approach using two promising tools Marabou and Planet as the underlying verification engines, and evaluate on widely-used benchmarks for three datasets ACASXu, MNIST and CIFAR-10. The results show that our approach can boost their performance by solving more problems in the same time limit, reducing on average 13.4%–86.3% verification time of Marabou on almost all the verification tasks, and reducing on average 8.3%–78.0% verification time of Planet on all the verification tasks. Compared to the most relevant CEGAR-based approach, our approach is 11.6–26.6 times faster.
{"title":"Abstraction and Refinement: Towards Scalable and Exact Verification of Neural Networks","authors":"Jiaxiang Liu, Yunhan Xing, Xiaomu Shi, Fu Song, Zhiwu Xu, Zhong Ming","doi":"10.1145/3644387","DOIUrl":"https://doi.org/10.1145/3644387","url":null,"abstract":"<p>As a new programming paradigm, deep neural networks (DNNs) have been increasingly deployed in practice, but the lack of robustness hinders their applications in safety-critical domains. While there are techniques for verifying DNNs with formal guarantees, they are limited in scalability and accuracy. In this paper, we present a novel counterexample-guided abstraction refinement (CEGAR) approach for scalable and exact verification of DNNs. Specifically, we propose a novel abstraction to break down the size of DNNs by over-approximation. The result of verifying the abstract DNN is conclusive if no spurious counterexample is reported. To eliminate each spurious counterexample introduced by abstraction, we propose a novel counterexample-guided refinement that refines the abstract DNN to exclude the spurious counterexample while still over-approximating the original one, leading to a sound, complete yet efficient CEGAR approach. Our approach is orthogonal to and can be integrated with many existing verification techniques. For demonstration, we implement our approach using two promising tools <span>Marabou</span> and <span>Planet</span> as the underlying verification engines, and evaluate on widely-used benchmarks for three datasets <monospace>ACAS</monospace> <monospace>Xu</monospace>, <monospace>MNIST</monospace> and <monospace>CIFAR-10</monospace>. The results show that our approach can boost their performance by solving more problems in the same time limit, reducing on average 13.4%–86.3% verification time of <span>Marabou</span> on almost all the verification tasks, and reducing on average 8.3%–78.0% verification time of <span>Planet</span> on all the verification tasks. Compared to the most relevant CEGAR-based approach, our approach is 11.6–26.6 times faster.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"29 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139757624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}