特征选择策略：基于 SHAP 值和重要性的方法比较分析

IF 8.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Journal of Big Data Pub Date : 2024-03-26 DOI:10.1186/s40537-024-00905-w

Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar

{"title":"特征选择策略：基于 SHAP 值和重要性的方法比较分析","authors":"Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar","doi":"10.1186/s40537-024-00905-w","DOIUrl":null,"url":null,"abstract":"<p>In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"6 1","pages":""},"PeriodicalIF":8.6000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods\",\"authors\":\"Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar\",\"doi\":\"10.1186/s40537-024-00905-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.</p>\",\"PeriodicalId\":15158,\"journal\":{\"name\":\"Journal of Big Data\",\"volume\":\"6 1\",\"pages\":\"\"},\"PeriodicalIF\":8.6000,\"publicationDate\":\"2024-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Big Data\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1186/s40537-024-00905-w\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Big Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s40537-024-00905-w","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

在高维信用卡欺诈数据方面，研究人员和从业人员通常利用特征选择技术来提高欺诈检测模型的性能。本研究比较了使用 SHAP（SHapley Additive exPlanations）值和模型内置特征重要性列表选择的最重要特征的模型性能。这两种方法都对特征进行排序，并选择最重要的特征进行模型评估。为了评估这些特征选择技术的有效性，我们使用五种分类器建立了分类模型：XGBoost、决策树、CatBoost、极随机树和随机森林。精度-召回曲线下的面积（AUPRC）作为评估指标。所有实验都是在 Kaggle 信用卡欺诈检测数据集上进行的。实验结果和统计测试表明，基于重要性值的特征选择方法优于基于 SHAP 值的分类器和各种特征子集大小的特征选择方法。对于在较大数据集上训练的模型，建议使用模型内置的特征重要性列表作为主要特征选择方法，而不是 SHAP。这一建议的依据是，计算 SHAP 特征重要性是一项独特的工作，而模型在训练过程中自然会提供内置的特征重要性，无需额外工作。因此，对于更大的数据集和更复杂的模型来说，选择模型内置的特征重要性列表可以提供更高效、更实用的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Big Data Computer Science-Information Systems

CiteScore

17.80

自引率

3.70%

发文量

105

审稿时长

13 weeks

期刊介绍： The Journal of Big Data publishes high-quality, scholarly research papers, methodologies, and case studies covering a broad spectrum of topics, from big data analytics to data-intensive computing and all applications of big data research. It addresses challenges facing big data today and in the future, including data capture and storage, search, sharing, analytics, technologies, visualization, architectures, data mining, machine learning, cloud computing, distributed systems, and scalable storage. The journal serves as a seminal source of innovative material for academic researchers and practitioners alike.