{"title":"为什么XAI技术不一致?描述缺陷预测的事后解释之间的分歧","authors":"Saumendu Roy, Gabriel Laberge, Banani Roy, Foutse Khomh, Amin Nikanjam, Saikat Mondal","doi":"10.1109/ICSME55016.2022.00056","DOIUrl":null,"url":null,"abstract":"Machine Learning (ML) based defect prediction models can be used to improve the reliability and overall quality of software systems. However, such defect predictors might not be deployed in real applications due to the lack of transparency. Thus, recently, application of several post-hoc explanation methods (e.g., LIME and SHAP) have gained popularity. These explanation methods can offer insight by ranking features based on their importance in black box decisions. The explainability of ML techniques is reasonably novel in the Software Engineering community. However, it is still unclear whether such explainability methods genuinely help practitioners make better decisions regarding software maintenance. Recent user studies show that data scientists usually utilize multiple post-hoc explainers to understand a single model decision because of the lack of ground truth. Such a scenario causes disagreement between explainability methods and impedes drawing a conclusion. Therefore, our study first investigates three disagreement metrics between LIME and SHAP explanations of 10 defect-predictors, and exposes that disagreements regarding the rankings of feature importance are most frequent. Our findings lead us to propose a method of aggregating LIME and SHAP explanations that puts less emphasis on these disagreements while highlighting the aspect on which explanations agree.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"5 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Why Don’t XAI Techniques Agree? Characterizing the Disagreements Between Post-hoc Explanations of Defect Predictions\",\"authors\":\"Saumendu Roy, Gabriel Laberge, Banani Roy, Foutse Khomh, Amin Nikanjam, Saikat Mondal\",\"doi\":\"10.1109/ICSME55016.2022.00056\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine Learning (ML) based defect prediction models can be used to improve the reliability and overall quality of software systems. However, such defect predictors might not be deployed in real applications due to the lack of transparency. Thus, recently, application of several post-hoc explanation methods (e.g., LIME and SHAP) have gained popularity. These explanation methods can offer insight by ranking features based on their importance in black box decisions. The explainability of ML techniques is reasonably novel in the Software Engineering community. However, it is still unclear whether such explainability methods genuinely help practitioners make better decisions regarding software maintenance. Recent user studies show that data scientists usually utilize multiple post-hoc explainers to understand a single model decision because of the lack of ground truth. Such a scenario causes disagreement between explainability methods and impedes drawing a conclusion. Therefore, our study first investigates three disagreement metrics between LIME and SHAP explanations of 10 defect-predictors, and exposes that disagreements regarding the rankings of feature importance are most frequent. Our findings lead us to propose a method of aggregating LIME and SHAP explanations that puts less emphasis on these disagreements while highlighting the aspect on which explanations agree.\",\"PeriodicalId\":300084,\"journal\":{\"name\":\"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"volume\":\"5 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSME55016.2022.00056\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSME55016.2022.00056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Why Don’t XAI Techniques Agree? Characterizing the Disagreements Between Post-hoc Explanations of Defect Predictions
Machine Learning (ML) based defect prediction models can be used to improve the reliability and overall quality of software systems. However, such defect predictors might not be deployed in real applications due to the lack of transparency. Thus, recently, application of several post-hoc explanation methods (e.g., LIME and SHAP) have gained popularity. These explanation methods can offer insight by ranking features based on their importance in black box decisions. The explainability of ML techniques is reasonably novel in the Software Engineering community. However, it is still unclear whether such explainability methods genuinely help practitioners make better decisions regarding software maintenance. Recent user studies show that data scientists usually utilize multiple post-hoc explainers to understand a single model decision because of the lack of ground truth. Such a scenario causes disagreement between explainability methods and impedes drawing a conclusion. Therefore, our study first investigates three disagreement metrics between LIME and SHAP explanations of 10 defect-predictors, and exposes that disagreements regarding the rankings of feature importance are most frequent. Our findings lead us to propose a method of aggregating LIME and SHAP explanations that puts less emphasis on these disagreements while highlighting the aspect on which explanations agree.