{"title":"随机森林对截尾响应变量的空间预测","authors":"Francky Fouedjio","doi":"10.1016/j.aiig.2022.02.001","DOIUrl":null,"url":null,"abstract":"<div><p>The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields. The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used. Consequently, the response variable's observations are censored (left-censored, right-censored, or interval-censored). Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations. In such cases, they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values. Therefore, the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices. This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable, in which the response variable's censored observations are explicitly taken into account. The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations. Then, the principal component analysis applied to this ensemble allows translating the response variable's observations (uncensored and censored) into a linear equalities and inequalities system. This system of linear equalities and inequalities is solved through randomized quadratic programming, which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations (uncensored and censored). The response variable's spatial prediction is then obtained by averaging this latter ensemble. The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data, including geochemical data. The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.</p></div>","PeriodicalId":100124,"journal":{"name":"Artificial Intelligence in Geosciences","volume":"2 ","pages":"Pages 115-127"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666544122000016/pdfft?md5=5c1b45229424d5b90fff743abbbc97b8&pid=1-s2.0-S2666544122000016-main.pdf","citationCount":"2","resultStr":"{\"title\":\"Random forest for spatial prediction of censored response variables\",\"authors\":\"Francky Fouedjio\",\"doi\":\"10.1016/j.aiig.2022.02.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields. The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used. Consequently, the response variable's observations are censored (left-censored, right-censored, or interval-censored). Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations. In such cases, they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values. Therefore, the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices. This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable, in which the response variable's censored observations are explicitly taken into account. The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations. Then, the principal component analysis applied to this ensemble allows translating the response variable's observations (uncensored and censored) into a linear equalities and inequalities system. This system of linear equalities and inequalities is solved through randomized quadratic programming, which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations (uncensored and censored). The response variable's spatial prediction is then obtained by averaging this latter ensemble. The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data, including geochemical data. The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.</p></div>\",\"PeriodicalId\":100124,\"journal\":{\"name\":\"Artificial Intelligence in Geosciences\",\"volume\":\"2 \",\"pages\":\"Pages 115-127\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666544122000016/pdfft?md5=5c1b45229424d5b90fff743abbbc97b8&pid=1-s2.0-S2666544122000016-main.pdf\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence in Geosciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666544122000016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666544122000016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Random forest for spatial prediction of censored response variables
The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields. The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used. Consequently, the response variable's observations are censored (left-censored, right-censored, or interval-censored). Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations. In such cases, they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values. Therefore, the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices. This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable, in which the response variable's censored observations are explicitly taken into account. The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations. Then, the principal component analysis applied to this ensemble allows translating the response variable's observations (uncensored and censored) into a linear equalities and inequalities system. This system of linear equalities and inequalities is solved through randomized quadratic programming, which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations (uncensored and censored). The response variable's spatial prediction is then obtained by averaging this latter ensemble. The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data, including geochemical data. The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.