{"title":"通过可解释的机器学习算法对急性缺血性中风进行病因学预测分类:一项多中心前瞻性队列研究","authors":"Siding Chen, Xiaomeng Yang, Hongqiu Gu, Yanzhao Wang, Zhe Xu, Yong Jiang, Yongjun Wang","doi":"10.1186/s12874-024-02331-1","DOIUrl":null,"url":null,"abstract":"The prognosis, recurrence rates, and secondary prevention strategies varied significantly among different subtypes of acute ischemic stroke (AIS). Machine learning (ML) techniques can uncover intricate, non-linear relationships within medical data, enabling the identification of factors associated with etiological classification. However, there is currently a lack of research utilizing ML algorithms for predicting AIS etiology. We aimed to use interpretable ML algorithms to develop AIS etiology prediction models, identify critical factors in etiology classification, and enhance existing clinical categorization. This study involved patients with the Third China National Stroke Registry (CNSR-III). Nine models, which included Natural Gradient Boosting (NGBoost), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Light Gradient Boosting Machine (LGBM), Gradient Boosting Decision Tree (GBDT), Adaptive Boosting (AdaBoost), Support Vector Machine (SVM), and logistic regression (LR), were employed to predict large artery atherosclerosis (LAA), small vessel occlusion (SVO), and cardioembolism (CE) using an 80:20 randomly split training and test set. We designed an SFS-XGB with 10-fold cross-validation for feature selection. The primary evaluation metrics for the models included the area under the receiver operating characteristic curve (AUC) for discrimination and the Brier score (or calibration plots) for calibration. A total of 5,213 patients were included, comprising 2,471 (47.4%) with LAA, 2,153 (41.3%) with SVO, and 589 (11.3%) with CE. In both LAA and SVO models, the AUC values of the ML models were significantly higher than that of the LR model (P < 0.001). The optimal model for predicting SVO (AUC [RF model] = 0.932) outperformed the optimal LAA model (AUC [NGB model] = 0.917) and the optimal CE model (AUC [LGBM model] = 0.846). Each model displayed relatively satisfactory calibration. Further analysis showed that the optimal CE model could identify potential CE patients in the undetermined etiology (SUE) group, accounting for 1,900 out of 4,156 (45.7%). The ML algorithm effectively classified patients with LAA, SVO, and CE, demonstrating superior classification performance compared to the LR model. The optimal ML model can identify potential CE patients among SUE patients. These newly identified predictive factors may complement the existing etiological classification system, enabling clinicians to promptly categorize stroke patients’ etiology and initiate optimal strategies for secondary prevention.","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"4 1","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predictive etiological classification of acute ischemic stroke through interpretable machine learning algorithms: a multicenter, prospective cohort study\",\"authors\":\"Siding Chen, Xiaomeng Yang, Hongqiu Gu, Yanzhao Wang, Zhe Xu, Yong Jiang, Yongjun Wang\",\"doi\":\"10.1186/s12874-024-02331-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The prognosis, recurrence rates, and secondary prevention strategies varied significantly among different subtypes of acute ischemic stroke (AIS). Machine learning (ML) techniques can uncover intricate, non-linear relationships within medical data, enabling the identification of factors associated with etiological classification. However, there is currently a lack of research utilizing ML algorithms for predicting AIS etiology. We aimed to use interpretable ML algorithms to develop AIS etiology prediction models, identify critical factors in etiology classification, and enhance existing clinical categorization. This study involved patients with the Third China National Stroke Registry (CNSR-III). Nine models, which included Natural Gradient Boosting (NGBoost), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Light Gradient Boosting Machine (LGBM), Gradient Boosting Decision Tree (GBDT), Adaptive Boosting (AdaBoost), Support Vector Machine (SVM), and logistic regression (LR), were employed to predict large artery atherosclerosis (LAA), small vessel occlusion (SVO), and cardioembolism (CE) using an 80:20 randomly split training and test set. We designed an SFS-XGB with 10-fold cross-validation for feature selection. The primary evaluation metrics for the models included the area under the receiver operating characteristic curve (AUC) for discrimination and the Brier score (or calibration plots) for calibration. A total of 5,213 patients were included, comprising 2,471 (47.4%) with LAA, 2,153 (41.3%) with SVO, and 589 (11.3%) with CE. In both LAA and SVO models, the AUC values of the ML models were significantly higher than that of the LR model (P < 0.001). The optimal model for predicting SVO (AUC [RF model] = 0.932) outperformed the optimal LAA model (AUC [NGB model] = 0.917) and the optimal CE model (AUC [LGBM model] = 0.846). Each model displayed relatively satisfactory calibration. Further analysis showed that the optimal CE model could identify potential CE patients in the undetermined etiology (SUE) group, accounting for 1,900 out of 4,156 (45.7%). The ML algorithm effectively classified patients with LAA, SVO, and CE, demonstrating superior classification performance compared to the LR model. The optimal ML model can identify potential CE patients among SUE patients. These newly identified predictive factors may complement the existing etiological classification system, enabling clinicians to promptly categorize stroke patients’ etiology and initiate optimal strategies for secondary prevention.\",\"PeriodicalId\":9114,\"journal\":{\"name\":\"BMC Medical Research Methodology\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Research Methodology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12874-024-02331-1\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-024-02331-1","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
摘要
不同亚型的急性缺血性脑卒中(AIS)在预后、复发率和二级预防策略方面差异很大。机器学习(ML)技术可以发现医疗数据中错综复杂的非线性关系,从而识别与病因分类相关的因素。然而,目前还缺乏利用 ML 算法预测 AIS 病因的研究。我们旨在利用可解释的 ML 算法开发 AIS 病因学预测模型,识别病因学分类的关键因素,并加强现有的临床分类。本研究涉及第三届中国卒中登记(CNSR-III)的患者。九种模型包括自然梯度提升(NGBoost)、分类提升(CatBoost)、极端梯度提升(XGBoost)、随机森林(RF)、轻梯度提升机(LGBM)、梯度提升决策树(GBDT)、采用自适应提升(AdaBoost)、支持向量机(SVM)和逻辑回归(LR)来预测大动脉粥样硬化(LAA)、小血管闭塞(SVO)和心肌栓塞(CE):20 随机分配训练集和测试集。我们设计的 SFS-XGB 采用 10 倍交叉验证进行特征选择。模型的主要评估指标包括用于判别的接收者操作特征曲线下面积(AUC)和用于校准的布赖尔评分(或校准图)。共纳入了 5213 例患者,其中 LAA 患者 2471 例(47.4%),SVO 患者 2153 例(41.3%),CE 患者 589 例(11.3%)。在 LAA 和 SVO 模型中,ML 模型的 AUC 值均显著高于 LR 模型(P < 0.001)。预测 SVO 的最佳模型(AUC [RF 模型] = 0.932)优于最佳 LAA 模型(AUC [NGB 模型] = 0.917)和最佳 CE 模型(AUC [LGBM 模型] = 0.846)。每个模型都显示出相对令人满意的校准效果。进一步分析表明,最佳 CE 模型可以识别病因未定(SUE)组中潜在的 CE 患者,在 4156 例患者中占 1900 例(45.7%)。与 LR 模型相比,ML 算法能有效地对 LAA、SVO 和 CE 患者进行分类,显示出更优越的分类性能。最佳的 ML 模型可以在 SUE 患者中识别出潜在的 CE 患者。这些新发现的预测因素可以补充现有的病因分类系统,使临床医生能够及时对卒中患者的病因进行分类,并启动二级预防的最佳策略。
Predictive etiological classification of acute ischemic stroke through interpretable machine learning algorithms: a multicenter, prospective cohort study
The prognosis, recurrence rates, and secondary prevention strategies varied significantly among different subtypes of acute ischemic stroke (AIS). Machine learning (ML) techniques can uncover intricate, non-linear relationships within medical data, enabling the identification of factors associated with etiological classification. However, there is currently a lack of research utilizing ML algorithms for predicting AIS etiology. We aimed to use interpretable ML algorithms to develop AIS etiology prediction models, identify critical factors in etiology classification, and enhance existing clinical categorization. This study involved patients with the Third China National Stroke Registry (CNSR-III). Nine models, which included Natural Gradient Boosting (NGBoost), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Light Gradient Boosting Machine (LGBM), Gradient Boosting Decision Tree (GBDT), Adaptive Boosting (AdaBoost), Support Vector Machine (SVM), and logistic regression (LR), were employed to predict large artery atherosclerosis (LAA), small vessel occlusion (SVO), and cardioembolism (CE) using an 80:20 randomly split training and test set. We designed an SFS-XGB with 10-fold cross-validation for feature selection. The primary evaluation metrics for the models included the area under the receiver operating characteristic curve (AUC) for discrimination and the Brier score (or calibration plots) for calibration. A total of 5,213 patients were included, comprising 2,471 (47.4%) with LAA, 2,153 (41.3%) with SVO, and 589 (11.3%) with CE. In both LAA and SVO models, the AUC values of the ML models were significantly higher than that of the LR model (P < 0.001). The optimal model for predicting SVO (AUC [RF model] = 0.932) outperformed the optimal LAA model (AUC [NGB model] = 0.917) and the optimal CE model (AUC [LGBM model] = 0.846). Each model displayed relatively satisfactory calibration. Further analysis showed that the optimal CE model could identify potential CE patients in the undetermined etiology (SUE) group, accounting for 1,900 out of 4,156 (45.7%). The ML algorithm effectively classified patients with LAA, SVO, and CE, demonstrating superior classification performance compared to the LR model. The optimal ML model can identify potential CE patients among SUE patients. These newly identified predictive factors may complement the existing etiological classification system, enabling clinicians to promptly categorize stroke patients’ etiology and initiate optimal strategies for secondary prevention.
期刊介绍:
BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.