Sample size effects on landslide susceptibility models: A comparative study of heuristic, statistical, machine learning, deep learning and ensemble learning models with SHAP analysis

IF 4.4 2区地球科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computers & Geosciences Pub Date : 2024-09-13 DOI:10.1016/j.cageo.2024.105723

Shilong Yang , Jiayao Tan , Danyuan Luo , Yuzhou Wang , Xu Guo , Qiuyu Zhu , Chuanming Ma , Hanxiang Xiong

{"title":"Sample size effects on landslide susceptibility models: A comparative study of heuristic, statistical, machine learning, deep learning and ensemble learning models with SHAP analysis","authors":"Shilong Yang , Jiayao Tan , Danyuan Luo , Yuzhou Wang , Xu Guo , Qiuyu Zhu , Chuanming Ma , Hanxiang Xiong","doi":"10.1016/j.cageo.2024.105723","DOIUrl":null,"url":null,"abstract":"<div><p>In landslide susceptibility assessment (LSA), inventory incompleteness impacts the accuracy of different models to varying degrees. However, this area remains under-researched. This study investigated six LSA models from heuristic, statistical, machine learning and ensemble learning models (analytical hierarchy process (AHP), frequency ratio (FR), logistic regression (LR), Keras based deep learning (KBDL), XGBoost, and LightGBM) across six different sample sizes (100%, 90%, 75%, 50%, 25%, and 10%). Results revealed that XGBoost and LightGBM consistently outperformed other models across all sample sizes. The LR and KBDL models followed, while FR model was the most affected by sample size variations. AHP, an empirical model, remained unaffected by sample size. Through SHapley Additive exPlanations (SHAP) analysis, elevation, NDVI, slope, land use, and distance to roads and rivers emerged as pivotal indicators for landslide occurrences in the study area, suggesting that human activities significantly influence these events. Five time-varying indicators regarding human activity and climate validated this inference, which provides a new method to identify landslide triggering factors, especially in areas of intense human activity. Based on the findings, a comprehensive framework for LSA is proposed to assist landslide managers in making informed decisions. Future research should focus on expanding model diversity to address the effects of sample size, enhancing the adaptability of the LSA framework, deepening the analysis of human activity impacts on landslides using explainable machine learning techniques, addressing temporal inventory incompleteness in LSA, and critically evaluating model sensitivity to sample size variations across multiple disciplines.</p></div>","PeriodicalId":55221,"journal":{"name":"Computers & Geosciences","volume":"193 ","pages":"Article 105723"},"PeriodicalIF":4.4000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Geosciences","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0098300424002061","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

In landslide susceptibility assessment (LSA), inventory incompleteness impacts the accuracy of different models to varying degrees. However, this area remains under-researched. This study investigated six LSA models from heuristic, statistical, machine learning and ensemble learning models (analytical hierarchy process (AHP), frequency ratio (FR), logistic regression (LR), Keras based deep learning (KBDL), XGBoost, and LightGBM) across six different sample sizes (100%, 90%, 75%, 50%, 25%, and 10%). Results revealed that XGBoost and LightGBM consistently outperformed other models across all sample sizes. The LR and KBDL models followed, while FR model was the most affected by sample size variations. AHP, an empirical model, remained unaffected by sample size. Through SHapley Additive exPlanations (SHAP) analysis, elevation, NDVI, slope, land use, and distance to roads and rivers emerged as pivotal indicators for landslide occurrences in the study area, suggesting that human activities significantly influence these events. Five time-varying indicators regarding human activity and climate validated this inference, which provides a new method to identify landslide triggering factors, especially in areas of intense human activity. Based on the findings, a comprehensive framework for LSA is proposed to assist landslide managers in making informed decisions. Future research should focus on expanding model diversity to address the effects of sample size, enhancing the adaptability of the LSA framework, deepening the analysis of human activity impacts on landslides using explainable machine learning techniques, addressing temporal inventory incompleteness in LSA, and critically evaluating model sensitivity to sample size variations across multiple disciplines.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

样本量对滑坡易感性模型的影响：启发式、统计、机器学习、深度学习和集合学习模型与 SHAP 分析的比较研究

在滑坡易发性评估（LSA）中，清单的不完整性会在不同程度上影响不同模型的准确性。然而，这一领域的研究仍然不足。本研究调查了六种不同样本量（100%、90%、75%、50%、25% 和 10%）的启发式、统计、机器学习和集合学习模型（分析层次过程 (AHP)、频率比 (FR)、逻辑回归 (LR)、基于 Keras 的深度学习 (KBDL)、XGBoost 和 LightGBM）中的六种 LSA 模型。结果显示，在所有样本量下，XGBoost 和 LightGBM 的表现始终优于其他模型。LR 和 KBDL 模型紧随其后，而 FR 模型受样本量变化的影响最大。经验模型 AHP 则不受样本量的影响。通过 SHapley Additive exPlanations（SHAP）分析，海拔、NDVI、坡度、土地利用以及与道路和河流的距离成为研究区域滑坡发生的关键指标，这表明人类活动对这些事件有重大影响。有关人类活动和气候的五个时变指标验证了这一推论，为识别滑坡诱发因素，尤其是人类活动频繁地区的滑坡诱发因素提供了一种新方法。根据研究结果，提出了一个全面的山体滑坡评估框架，以帮助山体滑坡管理者做出明智的决策。未来的研究应侧重于扩大模型的多样性以解决样本大小的影响，增强 LSA 框架的适应性，利用可解释的机器学习技术深化人类活动对滑坡影响的分析，解决 LSA 中时间清单的不完整性，以及批判性地评估模型对跨学科样本大小变化的敏感性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computers & Geosciences 地学-地球科学综合

CiteScore

9.30

自引率

6.80%

发文量

164

审稿时长

3.4 months

期刊介绍： Computers & Geosciences publishes high impact, original research at the interface between Computer Sciences and Geosciences. Publications should apply modern computer science paradigms, whether computational or informatics-based, to address problems in the geosciences.