随机森林对截尾响应变量的空间预测

Artificial Intelligence in Geosciences Pub Date : 2021-12-01 DOI:10.1016/j.aiig.2022.02.001

Francky Fouedjio

{"title":"随机森林对截尾响应变量的空间预测","authors":"Francky Fouedjio","doi":"10.1016/j.aiig.2022.02.001","DOIUrl":null,"url":null,"abstract":"<div><p>The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields. The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used. Consequently, the response variable's observations are censored (left-censored, right-censored, or interval-censored). Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations. In such cases, they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values. Therefore, the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices. This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable, in which the response variable's censored observations are explicitly taken into account. The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations. Then, the principal component analysis applied to this ensemble allows translating the response variable's observations (uncensored and censored) into a linear equalities and inequalities system. This system of linear equalities and inequalities is solved through randomized quadratic programming, which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations (uncensored and censored). The response variable's spatial prediction is then obtained by averaging this latter ensemble. The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data, including geochemical data. The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.</p></div>","PeriodicalId":100124,"journal":{"name":"Artificial Intelligence in Geosciences","volume":"2 ","pages":"Pages 115-127"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666544122000016/pdfft?md5=5c1b45229424d5b90fff743abbbc97b8&pid=1-s2.0-S2666544122000016-main.pdf","citationCount":"2","resultStr":"{\"title\":\"Random forest for spatial prediction of censored response variables\",\"authors\":\"Francky Fouedjio\",\"doi\":\"10.1016/j.aiig.2022.02.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields. The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used. Consequently, the response variable's observations are censored (left-censored, right-censored, or interval-censored). Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations. In such cases, they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values. Therefore, the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices. This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable, in which the response variable's censored observations are explicitly taken into account. The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations. Then, the principal component analysis applied to this ensemble allows translating the response variable's observations (uncensored and censored) into a linear equalities and inequalities system. This system of linear equalities and inequalities is solved through randomized quadratic programming, which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations (uncensored and censored). The response variable's spatial prediction is then obtained by averaging this latter ensemble. The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data, including geochemical data. The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.</p></div>\",\"PeriodicalId\":100124,\"journal\":{\"name\":\"Artificial Intelligence in Geosciences\",\"volume\":\"2 \",\"pages\":\"Pages 115-127\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666544122000016/pdfft?md5=5c1b45229424d5b90fff743abbbc97b8&pid=1-s2.0-S2666544122000016-main.pdf\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence in Geosciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666544122000016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666544122000016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

当空间穷尽预测变量在研究区域内可用时，连续响应变量的空间预测已经在许多地球科学领域中普遍存在。由于所使用的测量仪器或采样方案的限制，响应变量经常受到检测限的限制。因此，响应变量的观测值被删减(左删减、右删减或区间删减)。致力于对未删减响应变量的空间预测的机器学习方法不能明确地解释响应变量的删减观测值。在这种情况下，它们通常通过特别的方法应用，例如忽略响应变量的审查观察值或用任意值替换它们。因此，响应变量的空间预测可能是不准确的，并且对这些任意选择所涉及的假设和近似很敏感。本文介绍了一种基于随机森林的机器学习方法，用于空间预测截尾响应变量，其中响应变量的截尾观测被明确地考虑在内。基本思想包括通过在仅包含响应变量的未删节观测值的数据子集上训练经典回归随机森林来构建回归树预测器的集合。然后，应用于该集合的主成分分析允许将响应变量的观测值(未审查和审查)转换为线性等式和不等式系统。这个线性等式和不等式系统是通过随机二次规划来解决的，它允许获得重建回归树预测器的集合，这些预测器完全尊重响应变量的观察值(未审查和审查)。响应变量的空间预测是通过对后一个集合求平均值得到的。所提出的机器学习方法的有效性在模拟数据上得到了说明，这些数据可以获得地面真相，并在现实世界数据(包括地球化学数据)上得到了展示。结果表明，所提出的机器学习技术允许比临时方法更好地利用响应变量的审查观察值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Random forest for spatial prediction of censored response variables

The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields. The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used. Consequently, the response variable's observations are censored (left-censored, right-censored, or interval-censored). Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations. In such cases, they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values. Therefore, the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices. This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable, in which the response variable's censored observations are explicitly taken into account. The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations. Then, the principal component analysis applied to this ensemble allows translating the response variable's observations (uncensored and censored) into a linear equalities and inequalities system. This system of linear equalities and inequalities is solved through randomized quadratic programming, which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations (uncensored and censored). The response variable's spatial prediction is then obtained by averaging this latter ensemble. The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data, including geochemical data. The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Artificial Intelligence in Geosciences

CiteScore

4.20

自引率

0.00%

发文量