通过局部归因法洞察基于机器学习的数字土壤制图的预测不确定性

IF 4.3 2区农林科学 Q1 SOIL SCIENCE Soil Pub Date : 2024-02-21 DOI:10.5194/egusphere-2024-323

Jeremy Rohmer, Stephane Belbeze, Dominique Guyonnet

{"title":"通过局部归因法洞察基于机器学习的数字土壤制图的预测不确定性","authors":"Jeremy Rohmer, Stephane Belbeze, Dominique Guyonnet","doi":"10.5194/egusphere-2024-323","DOIUrl":null,"url":null,"abstract":"<strong>Abstract.</strong> Machine learning (ML) models have become key ingredients for digital soil mapping. To improve the interpretability of their prediction, diagnostic tools have been developed like the widely used local attribution approach known as ‘SHAP’ (SHapley Additive exPlanation). However, the analysis of the prediction is only one part of the problem and there is an interest in getting deeper insights into the drivers of the prediction uncertainty as well, i.e. to explain why the ML model is confident, given the set of chosen covariates’ values (in addition to why the ML model delivered some particular results). We show in this study how to apply SHAP to the local prediction uncertainty estimates for a case of urban soil pollution, namely the presence of petroleum hydrocarbon in soil at Toulouse (France), which poses a health risk via vapour intrusion into buildings, direct soil ingestion or groundwater contamination. To alleviate the computational burden posed by the multiple covariates (typically >10) and by the large number of grid points on the map (typically over several 10,000s), we propose to rely on an approach that combines screening analysis (to filter out non-influential covariates) and grouping of dependent covariates by means of generic kernel-based dependence measures. Our results show evidence that the drivers of the prediction best estimate are not necessarily the ones that drive the confidence in these predictions, hence justifying that decisions regarding data collection and covariates’ characterisation as well as communication of the results should be made accordingly.","PeriodicalId":48610,"journal":{"name":"Soil","volume":"11 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach\",\"authors\":\"Jeremy Rohmer, Stephane Belbeze, Dominique Guyonnet\",\"doi\":\"10.5194/egusphere-2024-323\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<strong>Abstract.</strong> Machine learning (ML) models have become key ingredients for digital soil mapping. To improve the interpretability of their prediction, diagnostic tools have been developed like the widely used local attribution approach known as ‘SHAP’ (SHapley Additive exPlanation). However, the analysis of the prediction is only one part of the problem and there is an interest in getting deeper insights into the drivers of the prediction uncertainty as well, i.e. to explain why the ML model is confident, given the set of chosen covariates’ values (in addition to why the ML model delivered some particular results). We show in this study how to apply SHAP to the local prediction uncertainty estimates for a case of urban soil pollution, namely the presence of petroleum hydrocarbon in soil at Toulouse (France), which poses a health risk via vapour intrusion into buildings, direct soil ingestion or groundwater contamination. To alleviate the computational burden posed by the multiple covariates (typically >10) and by the large number of grid points on the map (typically over several 10,000s), we propose to rely on an approach that combines screening analysis (to filter out non-influential covariates) and grouping of dependent covariates by means of generic kernel-based dependence measures. Our results show evidence that the drivers of the prediction best estimate are not necessarily the ones that drive the confidence in these predictions, hence justifying that decisions regarding data collection and covariates’ characterisation as well as communication of the results should be made accordingly.\",\"PeriodicalId\":48610,\"journal\":{\"name\":\"Soil\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-02-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Soil\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://doi.org/10.5194/egusphere-2024-323\",\"RegionNum\":2,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"SOIL SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Soil","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.5194/egusphere-2024-323","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOIL SCIENCE","Score":null,"Total":0}

引用次数: 0

摘要

摘要机器学习（ML）模型已成为数字土壤制图的关键要素。为了提高预测的可解释性，人们开发了一些诊断工具，如广泛使用的本地归因方法 "SHAP"（SHapley Additive exPlanation）。然而，预测分析只是问题的一部分，人们还希望更深入地了解预测不确定性的驱动因素，即解释为什么在所选协变因素值的情况下 ML 模型有信心（除了为什么 ML 模型会得出某些特定结果）。在本研究中，我们展示了如何将 SHAP 应用于城市土壤污染案例的局部预测不确定性估计，即图卢兹（法国）土壤中存在的石油碳氢化合物，它通过蒸汽侵入建筑物、直接摄入土壤或地下水污染而对健康构成威胁。为了减轻多个协变量（通常为 10 个）和地图上大量网格点（通常超过 10,000 个）带来的计算负担，我们建议采用一种方法，将筛选分析（过滤掉非影响性协变量）和通过基于通用核的依赖性度量对依赖性协变量进行分组相结合。我们的研究结果表明，预测最佳估计值的驱动因素并不一定就是这些预测结果的置信度，因此在数据收集、协变量特征描述以及结果交流等方面的决策也应相应做出。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach

Abstract. Machine learning (ML) models have become key ingredients for digital soil mapping. To improve the interpretability of their prediction, diagnostic tools have been developed like the widely used local attribution approach known as ‘SHAP’ (SHapley Additive exPlanation). However, the analysis of the prediction is only one part of the problem and there is an interest in getting deeper insights into the drivers of the prediction uncertainty as well, i.e. to explain why the ML model is confident, given the set of chosen covariates’ values (in addition to why the ML model delivered some particular results). We show in this study how to apply SHAP to the local prediction uncertainty estimates for a case of urban soil pollution, namely the presence of petroleum hydrocarbon in soil at Toulouse (France), which poses a health risk via vapour intrusion into buildings, direct soil ingestion or groundwater contamination. To alleviate the computational burden posed by the multiple covariates (typically >10) and by the large number of grid points on the map (typically over several 10,000s), we propose to rely on an approach that combines screening analysis (to filter out non-influential covariates) and grouping of dependent covariates by means of generic kernel-based dependence measures. Our results show evidence that the drivers of the prediction best estimate are not necessarily the ones that drive the confidence in these predictions, hence justifying that decisions regarding data collection and covariates’ characterisation as well as communication of the results should be made accordingly.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Soil Agricultural and Biological Sciences-Soil Science

CiteScore

10.80

自引率

2.90%

发文量

审稿时长

30 weeks

期刊介绍： SOIL is an international scientific journal dedicated to the publication and discussion of high-quality research in the field of soil system sciences. SOIL is at the interface between the atmosphere, lithosphere, hydrosphere, and biosphere. SOIL publishes scientific research that contributes to understanding the soil system and its interaction with humans and the entire Earth system. The scope of the journal includes all topics that fall within the study of soil science as a discipline, with an emphasis on studies that integrate soil science with other sciences (hydrology, agronomy, socio-economics, health sciences, atmospheric sciences, etc.).