Predictor Selection and Machine Learning Regression Methods to Predict Saturated Hydraulic Conductivity From a Large Public Soil Database

IF 1 4区农林科学 Q3 AGRICULTURAL ENGINEERING Journal of the ASABE Pub Date : 2023-01-01 DOI:10.13031/ja.15068

Toby A. Adjuik, S. Nokes, M. Montross, M. Sama, O. Wendroth

{"title":"Predictor Selection and Machine Learning Regression Methods to Predict Saturated Hydraulic Conductivity From a Large Public Soil Database","authors":"Toby A. Adjuik, S. Nokes, M. Montross, M. Sama, O. Wendroth","doi":"10.13031/ja.15068","DOIUrl":null,"url":null,"abstract":"Highlights In this study, six machine learning (ML) models were developed using a large database of soils to predict saturated hydraulic conductivity of these soils using easily measured soil characteristics. Tree-based regression models outperformed all other ML models tested. Neural networks were not suitable for predicting saturated hydraulic conductivity. Clay content, followed by bulk density, explained the highest amount of variation in the data of the predictors examined. Abstract. One of the most important soil hydraulic properties for modeling water transport in the vadose zone is saturated hydraulic conductivity. However, it is challenging to measure it in the field. Pedotransfer Functions (PTFs) are mathematical models that can predict saturated hydraulic conductivity (Ks) from easily measured soil characteristics. Though the development of PTFs for predicting Ks is not new, the tools and methods used to predict Ks are continuously evolving. Model performance depends on choosing soil features that explain the largest amount of Ks variance with the fewest input variables. In addition, the lack of interpretability in most “black box” machine learning models makes it difficult to extract practical knowledge as the machine learning process obfuscates the relationship between inputs and outputs in the PTF models. The objective of this study was to develop a set of new PTFs for predicting Ks using machine learning algorithms and a large database of over 8000 soil samples (the Florida Soil Characterization Database) while incorporating statistical methods to inform predictor selection for the model inputs. Of the machine learning (ML) models tested, random forest regression (RF) and gradient-boosted regression (GB) gave the best performances, with R2 = 0.71 and RMSE = 0.47 cm h-1 on the test data for both. Using the permutation feature importance technique, the GB and RF regression models showed similar results, where clay content described the most variation in the data, followed by bulk density. The implication of this study is that, when predicting Ks using the Florida Soil Characterization Database, priority should be given to obtaining quality data on clay content and bulk density as they are the most influential predictors for estimating Ks. Keywords: Deep learning, Gradient boosted regression, Pedotransfer functions, Random forest regression, Soil database, Soil properties.","PeriodicalId":29714,"journal":{"name":"Journal of the ASABE","volume":"49 1","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the ASABE","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13031/ja.15068","RegionNum":4,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AGRICULTURAL ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Highlights In this study, six machine learning (ML) models were developed using a large database of soils to predict saturated hydraulic conductivity of these soils using easily measured soil characteristics. Tree-based regression models outperformed all other ML models tested. Neural networks were not suitable for predicting saturated hydraulic conductivity. Clay content, followed by bulk density, explained the highest amount of variation in the data of the predictors examined. Abstract. One of the most important soil hydraulic properties for modeling water transport in the vadose zone is saturated hydraulic conductivity. However, it is challenging to measure it in the field. Pedotransfer Functions (PTFs) are mathematical models that can predict saturated hydraulic conductivity (Ks) from easily measured soil characteristics. Though the development of PTFs for predicting Ks is not new, the tools and methods used to predict Ks are continuously evolving. Model performance depends on choosing soil features that explain the largest amount of Ks variance with the fewest input variables. In addition, the lack of interpretability in most “black box” machine learning models makes it difficult to extract practical knowledge as the machine learning process obfuscates the relationship between inputs and outputs in the PTF models. The objective of this study was to develop a set of new PTFs for predicting Ks using machine learning algorithms and a large database of over 8000 soil samples (the Florida Soil Characterization Database) while incorporating statistical methods to inform predictor selection for the model inputs. Of the machine learning (ML) models tested, random forest regression (RF) and gradient-boosted regression (GB) gave the best performances, with R2 = 0.71 and RMSE = 0.47 cm h-1 on the test data for both. Using the permutation feature importance technique, the GB and RF regression models showed similar results, where clay content described the most variation in the data, followed by bulk density. The implication of this study is that, when predicting Ks using the Florida Soil Characterization Database, priority should be given to obtaining quality data on clay content and bulk density as they are the most influential predictors for estimating Ks. Keywords: Deep learning, Gradient boosted regression, Pedotransfer functions, Random forest regression, Soil database, Soil properties.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

预测器选择和机器学习回归方法从大型公共土壤数据库预测饱和水力传导性

在这项研究中，利用一个大型土壤数据库开发了六个机器学习(ML)模型，利用易于测量的土壤特征来预测这些土壤的饱和水力传导性。基于树的回归模型优于所有其他测试的ML模型。神经网络不适合预测饱和水导率。粘土含量，其次是体积密度，解释了所检查的预测数据中最大的变化。摘要其中一个最重要的土壤水力性质的模拟水在渗透带是饱和水力传导性。然而，在现场测量它是具有挑战性的。土壤传递函数(PTFs)是一种数学模型，可以根据容易测量的土壤特性预测饱和水力传导率(Ks)。虽然用于预测k的ptf的发展并不新鲜，但用于预测k的工具和方法仍在不断发展。模型的性能取决于选择用最少的输入变量解释最大数量的k方差的土壤特征。此外，大多数“黑箱”机器学习模型缺乏可解释性，这使得提取实用知识变得困难，因为机器学习过程模糊了PTF模型中输入和输出之间的关系。本研究的目的是开发一套新的ptf，用于使用机器学习算法和超过8000个土壤样本的大型数据库(佛罗里达土壤特征数据库)来预测k，同时结合统计方法来为模型输入的预测器选择提供信息。在测试的机器学习(ML)模型中，随机森林回归(RF)和梯度增强回归(GB)的性能最好，两者的测试数据的R2 = 0.71, RMSE = 0.47 cm h-1。使用排列特征重要性技术，GB和RF回归模型显示了相似的结果，其中粘土含量描述了数据中最大的变化，其次是体积密度。本研究的含义是，当使用佛罗里达土壤特征数据库预测k时，应优先考虑获得粘土含量和容重的高质量数据，因为它们是估计k的最具影响力的预测因子。关键词:深度学习，梯度增强回归，土壤传递函数，随机森林回归，土壤数据库，土壤性质

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of the ASABE

CiteScore

3.10

自引率

0.00%

发文量