{"title":"Studying the explanations for the automated prediction of bug and non-bug issues using LIME and SHAP","authors":"Lukas Schulte, Benjamin Ledel, Steffen Herbold","doi":"10.1007/s10664-024-10469-1","DOIUrl":null,"url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Context</h3><p>The identification of bugs within issues reported to an issue tracking system is crucial for triage. Machine learning models have shown promising results for this task. However, we have only limited knowledge of how such models identify bugs. Explainable AI methods like LIME and SHAP can be used to increase this knowledge.</p><h3 data-test=\"abstract-sub-heading\">Objective</h3><p>We want to understand if explainable AI provides explanations that are reasonable to us as humans and align with our assumptions about the model’s decision-making. We also want to know if the quality of predictions is correlated with the quality of explanations.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>We conduct a study where we rate LIME and SHAP explanations based on their quality of explaining the outcome of an issue type prediction model. For this, we rate the quality of the explanations, i.e., if they align with our expectations and help us understand the underlying machine learning model.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>We found that both LIME and SHAP give reasonable explanations and that correct predictions are well explained. Further, we found that SHAP outperforms LIME due to a lower ambiguity and a higher contextuality that can be attributed to the ability of the deep SHAP variant to capture sentence fragments.</p><h3 data-test=\"abstract-sub-heading\">Conclusion</h3><p>We conclude that the model finds explainable signals for both bugs and non-bugs. Also, we recommend that research dealing with the quality of explanations for classification tasks reports and investigates rater agreement, since the rating of explanations is highly subjective.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"52 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10469-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Context
The identification of bugs within issues reported to an issue tracking system is crucial for triage. Machine learning models have shown promising results for this task. However, we have only limited knowledge of how such models identify bugs. Explainable AI methods like LIME and SHAP can be used to increase this knowledge.
Objective
We want to understand if explainable AI provides explanations that are reasonable to us as humans and align with our assumptions about the model’s decision-making. We also want to know if the quality of predictions is correlated with the quality of explanations.
Methods
We conduct a study where we rate LIME and SHAP explanations based on their quality of explaining the outcome of an issue type prediction model. For this, we rate the quality of the explanations, i.e., if they align with our expectations and help us understand the underlying machine learning model.
Results
We found that both LIME and SHAP give reasonable explanations and that correct predictions are well explained. Further, we found that SHAP outperforms LIME due to a lower ambiguity and a higher contextuality that can be attributed to the ability of the deep SHAP variant to capture sentence fragments.
Conclusion
We conclude that the model finds explainable signals for both bugs and non-bugs. Also, we recommend that research dealing with the quality of explanations for classification tasks reports and investigates rater agreement, since the rating of explanations is highly subjective.
期刊介绍:
Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories.
The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings.
Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.