Classification random forest with exact conditioning for spatial prediction of categorical variables

IF 4.2 Artificial Intelligence in Geosciences Pub Date : 2021-12-01 Epub Date: 2021-12-11 DOI:10.1016/j.aiig.2021.11.003

Francky Fouedjio

{"title":"Classification random forest with exact conditioning for spatial prediction of categorical variables","authors":"Francky Fouedjio","doi":"10.1016/j.aiig.2021.11.003","DOIUrl":null,"url":null,"abstract":"<div><p>Machine learning methods are increasingly used for spatially predicting a categorical target variable when spatially exhaustive predictor variables are available within the study region. Even though these methods exhibit competitive spatial prediction performance, they do not exactly honor the categorical target variable's observed values at sampling locations by construction. On the other side, competitor geostatistical methods perfectly match the categorical target variable's observed values at sampling locations by essence. In many geoscience applications, it is often desirable to perfectly match the observed values of the categorical target variable at sampling locations, especially when the categorical target variable's measurements can be reasonably considered error-free. This paper addresses the problem of exact conditioning of machine learning methods for the spatial prediction of categorical variables. It introduces a classification random forest-based approach in which the categorical target variable is exactly conditioned to the data, thus having the exact conditioning property like competitor geostatistical methods. The proposed method extends a previous work dedicated to continuous target variables by using an implicit representation of the categorical target variable. The basic idea consists of transforming the ensemble of classification tree predictors' (categorical) resulting from the traditional classification random forest into an ensemble of signed distances (continuous) associated with each category of the categorical target variable. Then, an orthogonal representation of the ensemble of signed distances is created through the principal component analysis, thus allowing to reformulate the exact conditioning problem as a system of linear inequalities on principal component scores. Then, the sampling of new principal component scores ensuring the data's exact conditioning is performed via randomized quadratic programming. The resulting conditional signed distances are turned out into an ensemble of categorical outputs, which perfectly honor the categorical target variable's observed values at sampling locations. Then, the majority vote is used to aggregate the ensemble of categorical outputs. The effectiveness of the proposed method is illustrated on a simulated dataset for which ground-truth is available and showcased on a real-world dataset, including geochemical data. A comparison with geostatistical and traditional machine learning methods show that the proposed technique can perfectly match the categorical target variable's observed values at sampling locations while maintaining competitive out-of-sample predictive performance.</p></div>","PeriodicalId":100124,"journal":{"name":"Artificial Intelligence in Geosciences","volume":"2 ","pages":"Pages 82-95"},"PeriodicalIF":4.2000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666544121000290/pdfft?md5=7f00cfaafe708bd97ae6249d5444d7b5&pid=1-s2.0-S2666544121000290-main.pdf","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666544121000290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/12/11 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Machine learning methods are increasingly used for spatially predicting a categorical target variable when spatially exhaustive predictor variables are available within the study region. Even though these methods exhibit competitive spatial prediction performance, they do not exactly honor the categorical target variable's observed values at sampling locations by construction. On the other side, competitor geostatistical methods perfectly match the categorical target variable's observed values at sampling locations by essence. In many geoscience applications, it is often desirable to perfectly match the observed values of the categorical target variable at sampling locations, especially when the categorical target variable's measurements can be reasonably considered error-free. This paper addresses the problem of exact conditioning of machine learning methods for the spatial prediction of categorical variables. It introduces a classification random forest-based approach in which the categorical target variable is exactly conditioned to the data, thus having the exact conditioning property like competitor geostatistical methods. The proposed method extends a previous work dedicated to continuous target variables by using an implicit representation of the categorical target variable. The basic idea consists of transforming the ensemble of classification tree predictors' (categorical) resulting from the traditional classification random forest into an ensemble of signed distances (continuous) associated with each category of the categorical target variable. Then, an orthogonal representation of the ensemble of signed distances is created through the principal component analysis, thus allowing to reformulate the exact conditioning problem as a system of linear inequalities on principal component scores. Then, the sampling of new principal component scores ensuring the data's exact conditioning is performed via randomized quadratic programming. The resulting conditional signed distances are turned out into an ensemble of categorical outputs, which perfectly honor the categorical target variable's observed values at sampling locations. Then, the majority vote is used to aggregate the ensemble of categorical outputs. The effectiveness of the proposed method is illustrated on a simulated dataset for which ground-truth is available and showcased on a real-world dataset, including geochemical data. A comparison with geostatistical and traditional machine learning methods show that the proposed technique can perfectly match the categorical target variable's observed values at sampling locations while maintaining competitive out-of-sample predictive performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

具有精确条件的分类随机森林用于分类变量的空间预测

当空间穷尽预测变量在研究区域内可用时，机器学习方法越来越多地用于空间预测分类目标变量。尽管这些方法在空间预测方面表现出一定的竞争力，但它们并不能完全尊重在采样位置的分类目标变量的观测值。另一方面，竞争对手的地质统计方法在本质上与分类目标变量在采样位置的观测值完全匹配。在许多地球科学应用中，通常希望完全匹配采样位置的分类目标变量的观测值，特别是当分类目标变量的测量值可以合理地认为是无误差的时候。本文讨论了用于分类变量空间预测的机器学习方法的精确调节问题。它引入了一种基于分类随机森林的方法，在该方法中，分类目标变量被精确地约束于数据，从而具有与竞争对手的地质统计方法一样的精确条件调节特性。提出的方法通过使用分类目标变量的隐式表示扩展了先前致力于连续目标变量的工作。其基本思想是将传统分类随机森林生成的分类树预测器集合(分类)转化为与分类目标变量的每个类别相关联的带符号距离集合(连续)。然后，通过主成分分析创建有符号距离集合的正交表示，从而允许将精确的条件作用问题重新表述为主成分分数上的线性不等式系统。然后，通过随机二次规划对新主成分得分进行采样，确保数据的精确调理。所得到的条件带符号距离被转化为分类输出的集合，它完全尊重在采样位置的分类目标变量的观测值。然后，使用多数投票来汇总分类输出的集合。所提出方法的有效性在一个模拟数据集上进行了说明，该数据集具有地面真实值，并在包括地球化学数据在内的真实数据集上进行了展示。与地质统计学和传统机器学习方法的比较表明，该方法可以完美匹配采样位置的分类目标变量的观测值，同时保持竞争性的样本外预测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Artificial Intelligence in Geosciences

CiteScore

4.20

自引率

0.00%

发文量