{"title":"Enhancing XGBoost’s accuracy in soil organic matter prediction through feature fusion","authors":"Shaofang He, Li Zhou, Hongxia Xie, Siqiao Tan","doi":"10.1007/s10333-024-00980-y","DOIUrl":null,"url":null,"abstract":"<p>Soil organic matter (SOM) content serves as a crucial indicator for assessing soil fertility and quality, making accurate and efficient prediction methods paramount. The application of visible near-infrared reflectance (vis–NIR) spectroscopy has been pivotal in predicting SOM content. However, utilizing soil profile data obtained during soil sample collection can provide additional insights into organic matter, suggesting that their separate use may not be optimal. This study aimed to investigate whether the fusion of vis–NIR and soil profile properties could enhance the performance of the extreme gradient boosting (XGBoost) algorithm in predicting SOM content. The sample set was sourced from paddy soils in Changsha and Zhuzhou, China. Three different modeling approaches (XGBoost constructed by LASSO feature of vis–NIR spectroscopy (LF-XGBoost), profile feature (PF-XGBoost), and fusion feature (FF-XGBoost)) were compared and evaluated using randomly split sample sets, fivefold cross-validation (fivefold CV), coefficient of determination (<i>R</i><sup>2</sup>), root mean square error (RMSE), and mean absolute error (MAE). Compared to LF-XGBoost and PF-XGBoost, the FF-XGBoost model demonstrated superior prediction capabilities for SOM content, indicating that the fusion feature improved SOM content prediction. In randomly segmented datasets, FF-XGBoost achieved an <i>R</i><sup>2</sup> of 0.897, RMSE of 3.746, and MAE of 2.935, with <i>R</i><sup>2</sup> improvements of 31 and 24%, respectively. In fivefold CV, FF-XGBoost achieved an <i>R</i><sup>2</sup><sub>CV</sub> of 0.806, RMSE<sub>CV</sub> of 5.136, and MAE<sub>CV</sub> of 1.913, with <i>R</i><sup>2</sup><sub>CV</sub> improvements of 11 and 51%, respectively. According to Shapley additive explanations model, variations in ‘Color_class’, ‘Profile_level’, and wavelength ‘767’ within the fusion feature had the most significant impact on FF-XGBoost’s output. Compared to other commonly used regression algorithms, FF-XGBoost demonstrated higher prediction accuracy. This study only focused on paddy soils in Changsha and Zhuzhou and employed well-established modeling methods. These results can serve as a catalyst for further research into new feature fusion techniques, advanced modeling methods, and the transferability of findings to other soil landscapes.</p>","PeriodicalId":56101,"journal":{"name":"Paddy and Water Environment","volume":"68 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Paddy and Water Environment","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.1007/s10333-024-00980-y","RegionNum":4,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRICULTURAL ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Soil organic matter (SOM) content serves as a crucial indicator for assessing soil fertility and quality, making accurate and efficient prediction methods paramount. The application of visible near-infrared reflectance (vis–NIR) spectroscopy has been pivotal in predicting SOM content. However, utilizing soil profile data obtained during soil sample collection can provide additional insights into organic matter, suggesting that their separate use may not be optimal. This study aimed to investigate whether the fusion of vis–NIR and soil profile properties could enhance the performance of the extreme gradient boosting (XGBoost) algorithm in predicting SOM content. The sample set was sourced from paddy soils in Changsha and Zhuzhou, China. Three different modeling approaches (XGBoost constructed by LASSO feature of vis–NIR spectroscopy (LF-XGBoost), profile feature (PF-XGBoost), and fusion feature (FF-XGBoost)) were compared and evaluated using randomly split sample sets, fivefold cross-validation (fivefold CV), coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE). Compared to LF-XGBoost and PF-XGBoost, the FF-XGBoost model demonstrated superior prediction capabilities for SOM content, indicating that the fusion feature improved SOM content prediction. In randomly segmented datasets, FF-XGBoost achieved an R2 of 0.897, RMSE of 3.746, and MAE of 2.935, with R2 improvements of 31 and 24%, respectively. In fivefold CV, FF-XGBoost achieved an R2CV of 0.806, RMSECV of 5.136, and MAECV of 1.913, with R2CV improvements of 11 and 51%, respectively. According to Shapley additive explanations model, variations in ‘Color_class’, ‘Profile_level’, and wavelength ‘767’ within the fusion feature had the most significant impact on FF-XGBoost’s output. Compared to other commonly used regression algorithms, FF-XGBoost demonstrated higher prediction accuracy. This study only focused on paddy soils in Changsha and Zhuzhou and employed well-established modeling methods. These results can serve as a catalyst for further research into new feature fusion techniques, advanced modeling methods, and the transferability of findings to other soil landscapes.
期刊介绍:
The aim of Paddy and Water Environment is to advance the science and technology of water and environment related disciplines in paddy-farming. The scope includes the paddy-farming related scientific and technological aspects in agricultural engineering such as irrigation and drainage, soil and water conservation, land and water resources management, irrigation facilities and disaster management, paddy multi-functionality, agricultural policy, regional planning, bioenvironmental systems, and ecological conservation and restoration in paddy farming regions.