Conditional Feature Selection: Evaluating Model Averaging When Selecting Features with Shapley Values

Geomatics Pub Date : 2024-08-08 DOI:10.3390/geomatics4030016

Florian Huber, Volker Steinhage

{"title":"Conditional Feature Selection: Evaluating Model Averaging When Selecting Features with Shapley Values","authors":"Florian Huber, Volker Steinhage","doi":"10.3390/geomatics4030016","DOIUrl":null,"url":null,"abstract":"In the field of geomatics, artificial intelligence (AI) and especially machine learning (ML) are rapidly transforming the field of geomatics with respect to collecting, managing, and analyzing spatial data. Feature selection as a building block in ML is crucial because it directly impacts the performance and predictive power of a model by selecting the most critical variables and eliminating the redundant and irrelevant ones. Random forests have now been used for decades and allow for building models with high accuracy. However, finding the most expressive features from the dataset by selecting the most important features within random forests is still a challenging question. The often-used internal Gini importances of random forests are based on the amount of training examples that are divided by a feature but fail to acknowledge the magnitude of change in the target variable, leading to suboptimal selections. Shapley values are an established and unified framework for feature attribution, i.e., specifying how much each feature in a trained ML model contributes to the predictions for a given instance. Previous studies highlight the effectiveness of Shapley values for feature selection in real-world applications, while other research emphasizes certain theoretical limitations. This study provides an application-driven discussion of Shapley values for feature selection by first proposing four necessary conditions for a successful feature selection with Shapley values that are extracted from a multitude of critical research in the field. Given these valuable conditions, Shapley value feature selection is nevertheless a model averaging procedure by definition, where unimportant features can alter the final selection. Therefore, we additionally present Conditional Feature Selection (CFS) as a novel algorithm for performing feature selection that mitigates this problem and use it to evaluate the impact of model averaging in several real-world examples, covering the use of ML in geomatics. The results of this study show Shapley values as a good measure for feature selection when compared with Gini feature importances on four real-world examples, improving the RMSE by 5% when averaged over selections of all possible subset sizes. An even better selection can be achieved by CFS, improving on the Gini selection by approximately 7.5% in terms of RMSE. For random forests, Shapley value calculation can be performed in polynomial time, offering an advantage over the exponential runtime of CFS, building a trade-off to the lost accuracy in feature selection due to model averaging.","PeriodicalId":507594,"journal":{"name":"Geomatics","volume":"53 29","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geomatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/geomatics4030016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the field of geomatics, artificial intelligence (AI) and especially machine learning (ML) are rapidly transforming the field of geomatics with respect to collecting, managing, and analyzing spatial data. Feature selection as a building block in ML is crucial because it directly impacts the performance and predictive power of a model by selecting the most critical variables and eliminating the redundant and irrelevant ones. Random forests have now been used for decades and allow for building models with high accuracy. However, finding the most expressive features from the dataset by selecting the most important features within random forests is still a challenging question. The often-used internal Gini importances of random forests are based on the amount of training examples that are divided by a feature but fail to acknowledge the magnitude of change in the target variable, leading to suboptimal selections. Shapley values are an established and unified framework for feature attribution, i.e., specifying how much each feature in a trained ML model contributes to the predictions for a given instance. Previous studies highlight the effectiveness of Shapley values for feature selection in real-world applications, while other research emphasizes certain theoretical limitations. This study provides an application-driven discussion of Shapley values for feature selection by first proposing four necessary conditions for a successful feature selection with Shapley values that are extracted from a multitude of critical research in the field. Given these valuable conditions, Shapley value feature selection is nevertheless a model averaging procedure by definition, where unimportant features can alter the final selection. Therefore, we additionally present Conditional Feature Selection (CFS) as a novel algorithm for performing feature selection that mitigates this problem and use it to evaluate the impact of model averaging in several real-world examples, covering the use of ML in geomatics. The results of this study show Shapley values as a good measure for feature selection when compared with Gini feature importances on four real-world examples, improving the RMSE by 5% when averaged over selections of all possible subset sizes. An even better selection can be achieved by CFS, improving on the Gini selection by approximately 7.5% in terms of RMSE. For random forests, Shapley value calculation can be performed in polynomial time, offering an advantage over the exponential runtime of CFS, building a trade-off to the lost accuracy in feature selection due to model averaging.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

条件特征选择：评估使用夏普利值选择特征时的模型平均值

在地理信息学领域，人工智能（AI），尤其是机器学习（ML）正在迅速改变地理信息学领域的空间数据收集、管理和分析。特征选择作为 ML 的基石至关重要，因为它通过选择最关键的变量并剔除多余和不相关的变量，直接影响到模型的性能和预测能力。随机森林已经应用了几十年，可以建立高精度的模型。然而，通过在随机森林中选择最重要的特征来从数据集中找到最具表现力的特征，仍然是一个具有挑战性的问题。经常使用的随机森林内部基尼系数是基于被某个特征分割的训练示例的数量，但没有考虑目标变量的变化幅度，从而导致次优选择。夏普利值是一个成熟而统一的特征归因框架，即指定训练有素的 ML 模型中的每个特征对给定实例的预测有多大贡献。以前的研究强调了 Shapley 值在实际应用中进行特征选择的有效性，而其他研究则强调了某些理论上的局限性。本研究以应用为导向，对用于特征选择的 Shapley 值进行了讨论，首先提出了使用 Shapley 值成功进行特征选择的四个必要条件，这些条件是从该领域的大量重要研究中提取的。考虑到这些有价值的条件，Shapley 值特征选择从定义上来说是一种模型平均程序，其中不重要的特征可能会改变最终的选择结果。因此，我们额外提出了条件特征选择（CFS），作为一种用于执行特征选择的新型算法，它可以缓解这一问题，并在几个实际案例中用它来评估模型平均化的影响，涵盖了 ML 在地理信息领域的应用。研究结果表明，在四个实际案例中，与基尼特征导入值相比，夏普利值是一种很好的特征选择测量方法，在对所有可能子集大小的选择进行平均时，RMSE 提高了 5%。CFS 可以实现更好的选择，在 RMSE 方面比吉尼选择提高了约 7.5%。对于随机森林，夏普利值的计算可以在多项式时间内完成，与 CFS 的指数运行时间相比更具优势，这与模型平均化导致的特征选择准确性损失相得益彰。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Geomatics

自引率

0.00%

发文量