Pub Date : 2022-10-01DOI: 10.1109/ICSME55016.2022.00031
Yihao Qin, Shangwen Wang, Kui Liu, Bo Lin, Hongjun Wu, Li Li, Xiaoguang Mao, Tegawendé F. Bissyandé
Regression testing is a widely adopted approach to expose change-induced bugs as well as to verify the correctness/robustness of code in modern software development settings. Unfortunately, the occurrence of flaky tests leads to a significant increase in the cost of regression testing and eventually reduces the productivity of developers (i.e., their ability to find and fix real problems). State-of-the-art approaches leverage dynamic test information obtained through expensive re-execution of test cases to effectively identify flaky tests. Towards accounting for scalability constraints, some recent approaches have built on static test case features, but fall short on effectiveness. In this paper, we introduce Peeler, a new fully static approach for predicting flaky tests through exploring a representation of test cases based on the data dependency relations. The predictor is then trained as a neural network based model, which achieves at the same time scalability (because it does not require any test execution), effectiveness (because it exploits relevant test dependency features), and practicality (because it can be applied in the wild to find new flaky tests). Experimental validation on 17,532 test cases from 21 Java projects shows that Peeler outperforms the state-of-the-art FlakeFlagger by around 20 percentage points: we catch 22% more flaky tests while yielding 51% less false positives. Finally, in a live study with projects in-the-wild, we reported to developers 21 flakiness cases, among which 12 have already been confirmed by developers as being indeed flaky.
{"title":"Peeler: Learning to Effectively Predict Flakiness without Running Tests","authors":"Yihao Qin, Shangwen Wang, Kui Liu, Bo Lin, Hongjun Wu, Li Li, Xiaoguang Mao, Tegawendé F. Bissyandé","doi":"10.1109/ICSME55016.2022.00031","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00031","url":null,"abstract":"Regression testing is a widely adopted approach to expose change-induced bugs as well as to verify the correctness/robustness of code in modern software development settings. Unfortunately, the occurrence of flaky tests leads to a significant increase in the cost of regression testing and eventually reduces the productivity of developers (i.e., their ability to find and fix real problems). State-of-the-art approaches leverage dynamic test information obtained through expensive re-execution of test cases to effectively identify flaky tests. Towards accounting for scalability constraints, some recent approaches have built on static test case features, but fall short on effectiveness. In this paper, we introduce Peeler, a new fully static approach for predicting flaky tests through exploring a representation of test cases based on the data dependency relations. The predictor is then trained as a neural network based model, which achieves at the same time scalability (because it does not require any test execution), effectiveness (because it exploits relevant test dependency features), and practicality (because it can be applied in the wild to find new flaky tests). Experimental validation on 17,532 test cases from 21 Java projects shows that Peeler outperforms the state-of-the-art FlakeFlagger by around 20 percentage points: we catch 22% more flaky tests while yielding 51% less false positives. Finally, in a live study with projects in-the-wild, we reported to developers 21 flakiness cases, among which 12 have already been confirmed by developers as being indeed flaky.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125689875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1109/icsme55016.2022.00005
{"title":"Message from the General Co-Chairs and Program Co-Chairs","authors":"","doi":"10.1109/icsme55016.2022.00005","DOIUrl":"https://doi.org/10.1109/icsme55016.2022.00005","url":null,"abstract":"","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127643019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1109/ICSME55016.2022.00065
Camilo Velázquez-Rodríguez, Eleni Constantinou, Coen De Roover
Selecting a library from a vast ecosystem can be a daunting task. The libraries are not only numerous, but they also lack an enumeration of the features they offer. A feature enumeration for each library in an ecosystem would help developers select the most appropriate library for the task at hand. Within this enumeration, a library feature could take the form of a brief description together with the API references through which the feature can be reused. This paper presents LiFUSO, a tool that leverages Stack Overflow posts to compute a list of such features for a given library. Each feature corresponds to a cluster of related API references based on the similarity of the Stack Overflow posts in which they occur. Once LiFUSO has extracted such a cluster of posts, it applies natural language processing to describe the corresponding feature. We describe the engineering aspects of the tool, and illustrate its usage through a preliminary case study in which we compare the features uncovered for two competing libraries within the same domain. An executable version of the tool is available at https://github.com/softwarelanguageslab/lifuso and its demonstration video is accessible at https://youtu.be/tDE1LWa86cA.
{"title":"LiFUSO: A Tool for Library Feature Unveiling based on Stack Overflow Posts","authors":"Camilo Velázquez-Rodríguez, Eleni Constantinou, Coen De Roover","doi":"10.1109/ICSME55016.2022.00065","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00065","url":null,"abstract":"Selecting a library from a vast ecosystem can be a daunting task. The libraries are not only numerous, but they also lack an enumeration of the features they offer. A feature enumeration for each library in an ecosystem would help developers select the most appropriate library for the task at hand. Within this enumeration, a library feature could take the form of a brief description together with the API references through which the feature can be reused. This paper presents LiFUSO, a tool that leverages Stack Overflow posts to compute a list of such features for a given library. Each feature corresponds to a cluster of related API references based on the similarity of the Stack Overflow posts in which they occur. Once LiFUSO has extracted such a cluster of posts, it applies natural language processing to describe the corresponding feature. We describe the engineering aspects of the tool, and illustrate its usage through a preliminary case study in which we compare the features uncovered for two competing libraries within the same domain. An executable version of the tool is available at https://github.com/softwarelanguageslab/lifuso and its demonstration video is accessible at https://youtu.be/tDE1LWa86cA.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132807874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1109/ICSME55016.2022.00078
N. Lee, R. Abreu, M. Yatbaz, Hang Qu, Nachiappan Nagappan
Allowing developers to move fast when evolving and maintaining low-latency, large-scale distributed systems is a challenging problem due to i) sheer system complexity and scale, ii) degrading code quality, and iii) difficulty of performing reliable rapid change management while the system is in production. Addressing these problems has many benefits to increase system developer efficiency, reliability, performance, as well as code maintenance. In this paper, we present a real-world case study of an architectural refactoring project within an industrial setting. The system in scope is our codenamed ItemIndexer delivery system (I2DS), which is responsible for processing and delivering a large number of items at rapid speed to billions of users in real time. I2DS is running in production, refactored live over a period of 9 months, and assessed through impact validation studies that show a 42% improvement in developer efficiency, 87% improvement in reliability, 20% increase in item scoring, a 10% increase in item matching, and 14% CPU savings.
{"title":"The Engineering Implications of Code Maintenance in Practice","authors":"N. Lee, R. Abreu, M. Yatbaz, Hang Qu, Nachiappan Nagappan","doi":"10.1109/ICSME55016.2022.00078","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00078","url":null,"abstract":"Allowing developers to move fast when evolving and maintaining low-latency, large-scale distributed systems is a challenging problem due to i) sheer system complexity and scale, ii) degrading code quality, and iii) difficulty of performing reliable rapid change management while the system is in production. Addressing these problems has many benefits to increase system developer efficiency, reliability, performance, as well as code maintenance. In this paper, we present a real-world case study of an architectural refactoring project within an industrial setting. The system in scope is our codenamed ItemIndexer delivery system (I2DS), which is responsible for processing and delivering a large number of items at rapid speed to billions of users in real time. I2DS is running in production, refactored live over a period of 9 months, and assessed through impact validation studies that show a 42% improvement in developer efficiency, 87% improvement in reliability, 20% increase in item scoring, a 10% increase in item matching, and 14% CPU savings.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122194926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1109/ICSME55016.2022.00055
Giuseppe Cascavilla, Gemma Catolino, Felipe Ebert, D. Tamburri, Willem-Jan van den Heuvel
The increasing growth of illegal online activities in the so-called dark web—that is, the hidden collective of internet sites only accessible by a specialized web browsers—has challenged law enforcement agencies in recent years with sparse research efforts to help. For example, research has been devoted to supporting law enforcement by employing Natural Language Processing (NLP) to detect illegal activities on the dark web and build models for their classification. However, current approaches strongly rely upon the linguistic characteristics used to train the models, e.g., language semantics, which threatens their generalizability. To overcome this limitation, we tackle the problem of predicting illegal and criminal activities—a process defined as threat intelligence—on the dark web from a complementary perspective—that of dark web code maintenance and evolution— and propose a novel approach that uses software quality metrics and dark website appearance parameters instead of linguistic characteristics. We performed a preliminary empirical study on 10.367 web pages and collected more than 40 code metrics and website parameters using sonarqube. Results show an accuracy of up to 82% for predicting the three types of illegal activities (i.e., suspicious, normal, and unknown) and 66% for detecting 26 specific illegal activities, such as drugs or weapons trafficking. We deem our results can influence the current trends in detecting illegal activities on the dark web and put forward a completely novel research avenue toward dealing with this problem from a software maintenance and evolution perspective.
{"title":"\"When the Code becomes a Crime Scene\" Towards Dark Web Threat Intelligence with Software Quality Metrics","authors":"Giuseppe Cascavilla, Gemma Catolino, Felipe Ebert, D. Tamburri, Willem-Jan van den Heuvel","doi":"10.1109/ICSME55016.2022.00055","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00055","url":null,"abstract":"The increasing growth of illegal online activities in the so-called dark web—that is, the hidden collective of internet sites only accessible by a specialized web browsers—has challenged law enforcement agencies in recent years with sparse research efforts to help. For example, research has been devoted to supporting law enforcement by employing Natural Language Processing (NLP) to detect illegal activities on the dark web and build models for their classification. However, current approaches strongly rely upon the linguistic characteristics used to train the models, e.g., language semantics, which threatens their generalizability. To overcome this limitation, we tackle the problem of predicting illegal and criminal activities—a process defined as threat intelligence—on the dark web from a complementary perspective—that of dark web code maintenance and evolution— and propose a novel approach that uses software quality metrics and dark website appearance parameters instead of linguistic characteristics. We performed a preliminary empirical study on 10.367 web pages and collected more than 40 code metrics and website parameters using sonarqube. Results show an accuracy of up to 82% for predicting the three types of illegal activities (i.e., suspicious, normal, and unknown) and 66% for detecting 26 specific illegal activities, such as drugs or weapons trafficking. We deem our results can influence the current trends in detecting illegal activities on the dark web and put forward a completely novel research avenue toward dealing with this problem from a software maintenance and evolution perspective.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125592526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1109/ICSME55016.2022.00019
W. Wong, Huaijin Wang, Pingchuan Ma, Shuai Wang, Mingyue Jiang, T. Chen, Qiyi Tang, Sen Nie, Shi Wu
Deep neural networks (DNNs) have achieved a major success in solving challenging tasks such as social networks analysis and image classification. Despite the prosperous development of DNNs, recent research has demonstrated the feasibility of exploiting DNNs using adversarial examples, in which a small distortion is added into the input data to largely mislead prediction of DNNs.Determining the similarity of two binary codes is the foundation for many reverse engineering, re-engineering, and security applications. Currently, the majority of binary code matching tools are based on DNNs, the dependability of which has not been completely studied. In this research, we present an attack that perturbs software in executable format to deceive DNN-based binary code matching. Unlike prior attacks which mostly change non-functional code components to generate adversarial programs, our approach proposes the design of several semantics-preserving transformations directly toward the control flow graph of binary code, making it particularly effective to deceive DNNs. To speedup the process, we design a framework that leverages gradient- or hill climbing-based optimizations to generate adversarial examples in both white-box and black-box settings. We evaluated our attack against two popular DNN-based binary code matching tools, asm2vec and ncc, and achieve reasonably high success rates. Our attack toward an industrial-strength DNN-based binary code matching service, BinaryAI, shows that the proposed attack can fool remote APIs in challenging black-box settings with a success rate of over 16.2% (on average). Furthermore, we show that the generated adversarial programs can be used to augment robustness of two white-box models, asm2vec and ncc, reducing the attack success rates by 17.3% and 6.8% while preserving stable, if not better, standard accuracy.
{"title":"Deceiving Deep Neural Networks-Based Binary Code Matching with Adversarial Programs","authors":"W. Wong, Huaijin Wang, Pingchuan Ma, Shuai Wang, Mingyue Jiang, T. Chen, Qiyi Tang, Sen Nie, Shi Wu","doi":"10.1109/ICSME55016.2022.00019","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00019","url":null,"abstract":"Deep neural networks (DNNs) have achieved a major success in solving challenging tasks such as social networks analysis and image classification. Despite the prosperous development of DNNs, recent research has demonstrated the feasibility of exploiting DNNs using adversarial examples, in which a small distortion is added into the input data to largely mislead prediction of DNNs.Determining the similarity of two binary codes is the foundation for many reverse engineering, re-engineering, and security applications. Currently, the majority of binary code matching tools are based on DNNs, the dependability of which has not been completely studied. In this research, we present an attack that perturbs software in executable format to deceive DNN-based binary code matching. Unlike prior attacks which mostly change non-functional code components to generate adversarial programs, our approach proposes the design of several semantics-preserving transformations directly toward the control flow graph of binary code, making it particularly effective to deceive DNNs. To speedup the process, we design a framework that leverages gradient- or hill climbing-based optimizations to generate adversarial examples in both white-box and black-box settings. We evaluated our attack against two popular DNN-based binary code matching tools, asm2vec and ncc, and achieve reasonably high success rates. Our attack toward an industrial-strength DNN-based binary code matching service, BinaryAI, shows that the proposed attack can fool remote APIs in challenging black-box settings with a success rate of over 16.2% (on average). Furthermore, we show that the generated adversarial programs can be used to augment robustness of two white-box models, asm2vec and ncc, reducing the attack success rates by 17.3% and 6.8% while preserving stable, if not better, standard accuracy.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129515928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1109/ICSME55016.2022.00053
Naveen Ganesh Muralidharan, Vera Pantelic, V. Bandur, R. Paige
Awareness of the importance of systems and software traceability, as well as tool support for traceability, have improved over the years. But an effective solution for traceability must align and integrate with an organization’s engineering processes. Specifically, the phases of the traceability process model (traceability strategy, creation, use and maintenance of traceability) must be aligned with the organization’s engineering processes. Previous research has discussed the benefits of integrating traceability into the configuration management process. In this paper, we propose Change Request management using traceability data. In our approach, new Change Requests (CRs) are created from the traceability model of the corresponding project. The created CRs contain a portion of the project’s overall traceability model that is relevant to that change. A proof-of-concept issue tracking system is proposed that uses a traceability model at its core.
{"title":"Integrating Software Issue Tracking and Traceability Models","authors":"Naveen Ganesh Muralidharan, Vera Pantelic, V. Bandur, R. Paige","doi":"10.1109/ICSME55016.2022.00053","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00053","url":null,"abstract":"Awareness of the importance of systems and software traceability, as well as tool support for traceability, have improved over the years. But an effective solution for traceability must align and integrate with an organization’s engineering processes. Specifically, the phases of the traceability process model (traceability strategy, creation, use and maintenance of traceability) must be aligned with the organization’s engineering processes. Previous research has discussed the benefits of integrating traceability into the configuration management process. In this paper, we propose Change Request management using traceability data. In our approach, new Change Requests (CRs) are created from the traceability model of the corresponding project. The created CRs contain a portion of the project’s overall traceability model that is relevant to that change. A proof-of-concept issue tracking system is proposed that uses a traceability model at its core.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126422107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1109/ICSME55016.2022.00009
Issam Sedki, A. Hamou-Lhadj, O. Mohamed, M. Shehab
Because of their contribution to the overall reliability assurance process, software logs have become important data assets for the analysis of software systems. Logs are often the only data points that can shed light on how a software system behaves once deployed. Unfortunately, logs are often unstructured data items, hindering viable analysis of their content. There are studies that aim to automatically parse large log files. The primary goal is to create templates from raw log data samples that can later be used to recognize future logs. In this paper, we propose ULP, a Unified Log Parsing tool, which is highly accurate and efficient. ULP combines string matching and local frequency analysis to parse large log files in an efficient manner. First, log events are organized into groups using a text processing method. Frequency analysis is then applied locally to instances of the same group to identify static and dynamic content of log events. When applied to 10 log datasets of the LogPai benchmark, ULP achieves an average accuracy of 89.2%, which outperforms the accuracy of four leading log parsing tools, namely Drain, Logram, SPELL and AEL. Additionally, ULP can parse up to four million log events in less than 3 minutes. ULP is available online as an open source and can be readily used by practitioners and researchers to parse effectively and efficiently large log files so as to support log analysis tasks.
{"title":"An Effective Approach for Parsing Large Log Files","authors":"Issam Sedki, A. Hamou-Lhadj, O. Mohamed, M. Shehab","doi":"10.1109/ICSME55016.2022.00009","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00009","url":null,"abstract":"Because of their contribution to the overall reliability assurance process, software logs have become important data assets for the analysis of software systems. Logs are often the only data points that can shed light on how a software system behaves once deployed. Unfortunately, logs are often unstructured data items, hindering viable analysis of their content. There are studies that aim to automatically parse large log files. The primary goal is to create templates from raw log data samples that can later be used to recognize future logs. In this paper, we propose ULP, a Unified Log Parsing tool, which is highly accurate and efficient. ULP combines string matching and local frequency analysis to parse large log files in an efficient manner. First, log events are organized into groups using a text processing method. Frequency analysis is then applied locally to instances of the same group to identify static and dynamic content of log events. When applied to 10 log datasets of the LogPai benchmark, ULP achieves an average accuracy of 89.2%, which outperforms the accuracy of four leading log parsing tools, namely Drain, Logram, SPELL and AEL. Additionally, ULP can parse up to four million log events in less than 3 minutes. ULP is available online as an open source and can be readily used by practitioners and researchers to parse effectively and efficiently large log files so as to support log analysis tasks.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117212938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine Learning (ML) based defect prediction models can be used to improve the reliability and overall quality of software systems. However, such defect predictors might not be deployed in real applications due to the lack of transparency. Thus, recently, application of several post-hoc explanation methods (e.g., LIME and SHAP) have gained popularity. These explanation methods can offer insight by ranking features based on their importance in black box decisions. The explainability of ML techniques is reasonably novel in the Software Engineering community. However, it is still unclear whether such explainability methods genuinely help practitioners make better decisions regarding software maintenance. Recent user studies show that data scientists usually utilize multiple post-hoc explainers to understand a single model decision because of the lack of ground truth. Such a scenario causes disagreement between explainability methods and impedes drawing a conclusion. Therefore, our study first investigates three disagreement metrics between LIME and SHAP explanations of 10 defect-predictors, and exposes that disagreements regarding the rankings of feature importance are most frequent. Our findings lead us to propose a method of aggregating LIME and SHAP explanations that puts less emphasis on these disagreements while highlighting the aspect on which explanations agree.
{"title":"Why Don’t XAI Techniques Agree? Characterizing the Disagreements Between Post-hoc Explanations of Defect Predictions","authors":"Saumendu Roy, Gabriel Laberge, Banani Roy, Foutse Khomh, Amin Nikanjam, Saikat Mondal","doi":"10.1109/ICSME55016.2022.00056","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00056","url":null,"abstract":"Machine Learning (ML) based defect prediction models can be used to improve the reliability and overall quality of software systems. However, such defect predictors might not be deployed in real applications due to the lack of transparency. Thus, recently, application of several post-hoc explanation methods (e.g., LIME and SHAP) have gained popularity. These explanation methods can offer insight by ranking features based on their importance in black box decisions. The explainability of ML techniques is reasonably novel in the Software Engineering community. However, it is still unclear whether such explainability methods genuinely help practitioners make better decisions regarding software maintenance. Recent user studies show that data scientists usually utilize multiple post-hoc explainers to understand a single model decision because of the lack of ground truth. Such a scenario causes disagreement between explainability methods and impedes drawing a conclusion. Therefore, our study first investigates three disagreement metrics between LIME and SHAP explanations of 10 defect-predictors, and exposes that disagreements regarding the rankings of feature importance are most frequent. Our findings lead us to propose a method of aggregating LIME and SHAP explanations that puts less emphasis on these disagreements while highlighting the aspect on which explanations agree.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"5 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124301728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-01DOI: 10.1109/ICSME55016.2022.00069
Kowndinya Boyalakuntla, M. Nagappan, S. Chimalakonda, Nuthan Munaiah
Given the drastic rise of repositories on GitHub, it is often hard for developers to find relevant projects meeting their requirements as analyzing source code and other artifacts is effort-intensive. In our prior work, we proposed Repo Reaper (or simply Reaper) that assesses GitHub projects based on seven metrics spanning across project collaboration, quality, and maintenance. Reaper identified 1.4 million projects out of nearly 1.8 million projects to have no purpose for collaboration or software development by classifying them into ‘engineered’ and ‘non-engineered’ software projects. While Reaper can be used to assess millions of repositories based on GHTorrent, it is not designed to be used by developers for standalone repositories on local machines and is dependent on GHTorrent. Hence, in this paper, we propose a re-engineered and extended command-line tool named RepoQuester that aims to assist developers in evaluating GitHub projects on their local machines. RepoQuester computes metrics for projects and does not classify projects into ‘engineered’ and ‘non-engineered’ ones. However, to demonstrate the correctness of metric scores produced by RepoQuester, we have performed the project classification on the Reaper’s training and validation datasets by updating them with the latest metric scores (as reported by RepoQuester). These datasets have their ground truth manually established. During the analysis, we observed that the machine learning classifiers built on the updated datasets produced an F1 score of 72%. During the evaluation, for each project, we found that RepoQuester can analyze metric scores in less than 10 seconds. A demo video explaining the tool highlights and usage is available at https://youtu.be/Q8OdmNzUfN0, and source code at https://github.com/Kowndinya2000/Repoquester.
{"title":"RepoQuester: A Tool Towards Evaluating GitHub Projects","authors":"Kowndinya Boyalakuntla, M. Nagappan, S. Chimalakonda, Nuthan Munaiah","doi":"10.1109/ICSME55016.2022.00069","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00069","url":null,"abstract":"Given the drastic rise of repositories on GitHub, it is often hard for developers to find relevant projects meeting their requirements as analyzing source code and other artifacts is effort-intensive. In our prior work, we proposed Repo Reaper (or simply Reaper) that assesses GitHub projects based on seven metrics spanning across project collaboration, quality, and maintenance. Reaper identified 1.4 million projects out of nearly 1.8 million projects to have no purpose for collaboration or software development by classifying them into ‘engineered’ and ‘non-engineered’ software projects. While Reaper can be used to assess millions of repositories based on GHTorrent, it is not designed to be used by developers for standalone repositories on local machines and is dependent on GHTorrent. Hence, in this paper, we propose a re-engineered and extended command-line tool named RepoQuester that aims to assist developers in evaluating GitHub projects on their local machines. RepoQuester computes metrics for projects and does not classify projects into ‘engineered’ and ‘non-engineered’ ones. However, to demonstrate the correctness of metric scores produced by RepoQuester, we have performed the project classification on the Reaper’s training and validation datasets by updating them with the latest metric scores (as reported by RepoQuester). These datasets have their ground truth manually established. During the analysis, we observed that the machine learning classifiers built on the updated datasets produced an F1 score of 72%. During the evaluation, for each project, we found that RepoQuester can analyze metric scores in less than 10 seconds. A demo video explaining the tool highlights and usage is available at https://youtu.be/Q8OdmNzUfN0, and source code at https://github.com/Kowndinya2000/Repoquester.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123069123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}