{"title":"结合分类提升和沙普利加法解释,建立可解释的集合分类器,识别与成矿有关的地球化学异常现象","authors":"Yongliang Chen , Bowen Chen , Alina Shayilan","doi":"10.1016/j.oregeorev.2024.106263","DOIUrl":null,"url":null,"abstract":"<div><div>The vast majority of shallow and deep learning techniques used to identify mineralization-related geochemical anomalies are black-box algorithms that lack the ability to elucidate the individual contributions of each element towards the model predictions. In addition, most of the anomaly identification models established by both shallow and deep learning algorithms lack robustness. Establishing interpretable and robust machine learning models is a challenge in applying machine learning techniques to geochemical anomaly identification. To this end, the categorical boosting (CatBoost) algorithm was employed to build a robust ensemble classifier to identify mineralization-related anomalies from the 1:50,000 geochemical reconnaissance data (stream sediment survey) in the Yeniugou area of Xinjiang (China). The receiver operating characteristic curve (ROC) and precision-recall (P-R) curve of the ensemble model were plotted, and the area under the ROC curve (AUC) as well as the area under the P-R curve (AUPRC) of the ensemble model were calculated to measure the performance of the ensemble model. The ROC curve of the ensemble model approximates that of the perfect classification model. The P-R curve of the ensemble model is close to the upper right corner of the P-R space. The AUC and AUPRC values of the ensemble model reaches 0.9981 and 0.7816, respectively. The identified polymetallic mineralization-related geochemical anomalies account for 3% of the whole exploration area, correctly identifying all known polymetallic deposits. To enhance the interpretability of the CatBoost model, the Shapley additive explanations (SHAP) tool was adopted to graphically interpret the predictions of the ensemble model. The graphic interpretation shows that the importance order of the 14 elements is Ni-Au-Ag-Sn-As-Cr-Zn-Cu-Pb-Sb-W-Bi-Mo-Co. Cu and Ni are most likely metallogenic elements of the study area. Cu interacts with Ni, Ag, As, Sn, Cr, Zn, Pb, Sb, W, Bi, and Co; and Ni interacts with Au, Sn, As, Zn, Cu, W, Bi, and Co. Two polymetallic prospective areas were delineated in the study area. One is Cu-Ni-polymetallic mineralization prospective area, and the other is Ni-polymetallic mineralization prospective area. It can be concluded that the combination of CatBoost and SHAP is an effective way to construct an interpretable ensemble model with high-performance and robustness in identifying mineralization-related geochemical anomalies.</div></div>","PeriodicalId":19644,"journal":{"name":"Ore Geology Reviews","volume":"173 ","pages":"Article 106263"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Combining categorical boosting and Shapley additive explanations for building an interpretable ensemble classifier for identifying mineralization-related geochemical anomalies\",\"authors\":\"Yongliang Chen , Bowen Chen , Alina Shayilan\",\"doi\":\"10.1016/j.oregeorev.2024.106263\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The vast majority of shallow and deep learning techniques used to identify mineralization-related geochemical anomalies are black-box algorithms that lack the ability to elucidate the individual contributions of each element towards the model predictions. In addition, most of the anomaly identification models established by both shallow and deep learning algorithms lack robustness. Establishing interpretable and robust machine learning models is a challenge in applying machine learning techniques to geochemical anomaly identification. To this end, the categorical boosting (CatBoost) algorithm was employed to build a robust ensemble classifier to identify mineralization-related anomalies from the 1:50,000 geochemical reconnaissance data (stream sediment survey) in the Yeniugou area of Xinjiang (China). The receiver operating characteristic curve (ROC) and precision-recall (P-R) curve of the ensemble model were plotted, and the area under the ROC curve (AUC) as well as the area under the P-R curve (AUPRC) of the ensemble model were calculated to measure the performance of the ensemble model. The ROC curve of the ensemble model approximates that of the perfect classification model. The P-R curve of the ensemble model is close to the upper right corner of the P-R space. The AUC and AUPRC values of the ensemble model reaches 0.9981 and 0.7816, respectively. The identified polymetallic mineralization-related geochemical anomalies account for 3% of the whole exploration area, correctly identifying all known polymetallic deposits. To enhance the interpretability of the CatBoost model, the Shapley additive explanations (SHAP) tool was adopted to graphically interpret the predictions of the ensemble model. The graphic interpretation shows that the importance order of the 14 elements is Ni-Au-Ag-Sn-As-Cr-Zn-Cu-Pb-Sb-W-Bi-Mo-Co. Cu and Ni are most likely metallogenic elements of the study area. Cu interacts with Ni, Ag, As, Sn, Cr, Zn, Pb, Sb, W, Bi, and Co; and Ni interacts with Au, Sn, As, Zn, Cu, W, Bi, and Co. Two polymetallic prospective areas were delineated in the study area. One is Cu-Ni-polymetallic mineralization prospective area, and the other is Ni-polymetallic mineralization prospective area. It can be concluded that the combination of CatBoost and SHAP is an effective way to construct an interpretable ensemble model with high-performance and robustness in identifying mineralization-related geochemical anomalies.</div></div>\",\"PeriodicalId\":19644,\"journal\":{\"name\":\"Ore Geology Reviews\",\"volume\":\"173 \",\"pages\":\"Article 106263\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ore Geology Reviews\",\"FirstCategoryId\":\"89\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169136824003962\",\"RegionNum\":2,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"GEOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ore Geology Reviews","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169136824003962","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOLOGY","Score":null,"Total":0}
Combining categorical boosting and Shapley additive explanations for building an interpretable ensemble classifier for identifying mineralization-related geochemical anomalies
The vast majority of shallow and deep learning techniques used to identify mineralization-related geochemical anomalies are black-box algorithms that lack the ability to elucidate the individual contributions of each element towards the model predictions. In addition, most of the anomaly identification models established by both shallow and deep learning algorithms lack robustness. Establishing interpretable and robust machine learning models is a challenge in applying machine learning techniques to geochemical anomaly identification. To this end, the categorical boosting (CatBoost) algorithm was employed to build a robust ensemble classifier to identify mineralization-related anomalies from the 1:50,000 geochemical reconnaissance data (stream sediment survey) in the Yeniugou area of Xinjiang (China). The receiver operating characteristic curve (ROC) and precision-recall (P-R) curve of the ensemble model were plotted, and the area under the ROC curve (AUC) as well as the area under the P-R curve (AUPRC) of the ensemble model were calculated to measure the performance of the ensemble model. The ROC curve of the ensemble model approximates that of the perfect classification model. The P-R curve of the ensemble model is close to the upper right corner of the P-R space. The AUC and AUPRC values of the ensemble model reaches 0.9981 and 0.7816, respectively. The identified polymetallic mineralization-related geochemical anomalies account for 3% of the whole exploration area, correctly identifying all known polymetallic deposits. To enhance the interpretability of the CatBoost model, the Shapley additive explanations (SHAP) tool was adopted to graphically interpret the predictions of the ensemble model. The graphic interpretation shows that the importance order of the 14 elements is Ni-Au-Ag-Sn-As-Cr-Zn-Cu-Pb-Sb-W-Bi-Mo-Co. Cu and Ni are most likely metallogenic elements of the study area. Cu interacts with Ni, Ag, As, Sn, Cr, Zn, Pb, Sb, W, Bi, and Co; and Ni interacts with Au, Sn, As, Zn, Cu, W, Bi, and Co. Two polymetallic prospective areas were delineated in the study area. One is Cu-Ni-polymetallic mineralization prospective area, and the other is Ni-polymetallic mineralization prospective area. It can be concluded that the combination of CatBoost and SHAP is an effective way to construct an interpretable ensemble model with high-performance and robustness in identifying mineralization-related geochemical anomalies.
期刊介绍:
Ore Geology Reviews aims to familiarize all earth scientists with recent advances in a number of interconnected disciplines related to the study of, and search for, ore deposits. The reviews range from brief to longer contributions, but the journal preferentially publishes manuscripts that fill the niche between the commonly shorter journal articles and the comprehensive book coverages, and thus has a special appeal to many authors and readers.