Sofonyas Abebaw Tiruneh , Daniel Lorber Rolnik , Helena J. Teede , Joanne Enticott
{"title":"利用机器学习方法预测先兆子痫:利用日常收集数据中的重要信息。","authors":"Sofonyas Abebaw Tiruneh , Daniel Lorber Rolnik , Helena J. Teede , Joanne Enticott","doi":"10.1016/j.ijmedinf.2024.105645","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Globally, pre-eclampsia (PE) is a leading cause of maternal and perinatal morbidity and mortality. PE prediction using routinely collected data has the advantage of being widely applicable, particularly in low-resource settings. Early intervention for high-risk women might reduce PE incidence and related complications. We aimed to replicate our machine learning (ML) published work predicting another maternal condition (gestational diabetes) to (1) predict PE using routine health data, (2) identify the optimal ML model, and (3) compare it with logistic regression approach.</div></div><div><h3>Methods</h3><div>Data were from a large health service network with 48,250 singleton pregnancies between January 2016 and June 2021. Supervised ML models were employed. Maternal clinical and medical characteristics were the feature variables (predictors), and a 70/30 data split was used for training and testing the model. Predictive performance was assessed using area under the curve (AUC) and calibration plots. Shapley value analysis assessed the contribution of feature variables.</div></div><div><h3>Results</h3><div>The random forest approach provided excellent discrimination with an AUC of 0.84 (95% CI: 0.82–0.86) and highest prediction accuracy (0.79); however, the calibration curve (slope of 1.21, 95% CI 1.13–1.30) was acceptable only for a threshold of 0.3 or less. The next best approach was extreme gradient boosting, which provided an AUC of 0.77 (95% CI: 0.76–0.79) and well-calibrated (slope of 0.93, 95% CI 0.85–1.01). Logistic regression provided good discrimination performance with an AUC of 0.75 (95% CI: 0.74–0.76) and perfect calibration. Nulliparous, pre-pregnancy body mass index, previous pregnancy with prior PE, maternal age, family history of hypertension, and pre-existing hypertension and diabetes were the top-ranked features in Shapley value analysis.</div></div><div><h3>Conclusion</h3><div>Two ML models created the highest-performing prediction using routinely collected data to identify women at high risk of PE, with acceptable discrimination. However, to confirm this result and also examine model generalisability, external validation studies are needed in other settings, utilising standardised prognostic factors.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prediction of pre-eclampsia with machine learning approaches: Leveraging important information from routinely collected data\",\"authors\":\"Sofonyas Abebaw Tiruneh , Daniel Lorber Rolnik , Helena J. Teede , Joanne Enticott\",\"doi\":\"10.1016/j.ijmedinf.2024.105645\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Globally, pre-eclampsia (PE) is a leading cause of maternal and perinatal morbidity and mortality. PE prediction using routinely collected data has the advantage of being widely applicable, particularly in low-resource settings. Early intervention for high-risk women might reduce PE incidence and related complications. We aimed to replicate our machine learning (ML) published work predicting another maternal condition (gestational diabetes) to (1) predict PE using routine health data, (2) identify the optimal ML model, and (3) compare it with logistic regression approach.</div></div><div><h3>Methods</h3><div>Data were from a large health service network with 48,250 singleton pregnancies between January 2016 and June 2021. Supervised ML models were employed. Maternal clinical and medical characteristics were the feature variables (predictors), and a 70/30 data split was used for training and testing the model. Predictive performance was assessed using area under the curve (AUC) and calibration plots. Shapley value analysis assessed the contribution of feature variables.</div></div><div><h3>Results</h3><div>The random forest approach provided excellent discrimination with an AUC of 0.84 (95% CI: 0.82–0.86) and highest prediction accuracy (0.79); however, the calibration curve (slope of 1.21, 95% CI 1.13–1.30) was acceptable only for a threshold of 0.3 or less. The next best approach was extreme gradient boosting, which provided an AUC of 0.77 (95% CI: 0.76–0.79) and well-calibrated (slope of 0.93, 95% CI 0.85–1.01). Logistic regression provided good discrimination performance with an AUC of 0.75 (95% CI: 0.74–0.76) and perfect calibration. Nulliparous, pre-pregnancy body mass index, previous pregnancy with prior PE, maternal age, family history of hypertension, and pre-existing hypertension and diabetes were the top-ranked features in Shapley value analysis.</div></div><div><h3>Conclusion</h3><div>Two ML models created the highest-performing prediction using routinely collected data to identify women at high risk of PE, with acceptable discrimination. However, to confirm this result and also examine model generalisability, external validation studies are needed in other settings, utilising standardised prognostic factors.</div></div>\",\"PeriodicalId\":54950,\"journal\":{\"name\":\"International Journal of Medical Informatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1386505624003083\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624003083","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
背景:在全球范围内,子痫前期(PE)是孕产妇和围产期发病率和死亡率的主要原因。利用常规收集的数据进行子痫前期预测具有广泛的适用性,尤其是在资源匮乏的环境中。对高危产妇进行早期干预可降低 PE 的发病率和相关并发症。我们的目标是复制我们已发表的预测另一种孕产妇疾病(妊娠糖尿病)的机器学习(ML)工作,(1) 利用常规健康数据预测 PE,(2) 确定最佳 ML 模型,(3) 将其与逻辑回归方法进行比较:数据来自一个大型医疗服务网络,其中包括 2016 年 1 月至 2021 年 6 月期间的 48,250 例单胎妊娠。采用了有监督的 ML 模型。孕产妇的临床和医疗特征是特征变量(预测因子),模型的训练和测试采用 70/30 的数据分配比例。预测性能通过曲线下面积(AUC)和校准图进行评估。沙普利值分析评估了特征变量的贡献:随机森林方法提供了极佳的分辨能力,AUC 为 0.84(95% CI:0.82-0.86),预测准确率最高(0.79);然而,校准曲线(斜率为 1.21,95% CI 为 1.13-1.30)仅在阈值为 0.3 或更低时可以接受。其次是极梯度提升法,其 AUC 为 0.77(95% CI:0.76-0.79),校准良好(斜率为 0.93,95% CI 为 0.85-1.01)。逻辑回归具有良好的分辨性能,AUC 为 0.75(95% CI:0.74-0.76),校准完美。在 Shapley 值分析中,无子宫、孕前体重指数、既往妊娠合并 PE、孕产妇年龄、高血压家族史、既往高血压和糖尿病是排名靠前的特征:结论:利用常规收集的数据识别 PE 高危妇女,两个 ML 模型的预测效果最好,且具有可接受的区分度。不过,为了证实这一结果并检验模型的通用性,还需要在其他环境中利用标准化的预后因素进行外部验证研究。
Prediction of pre-eclampsia with machine learning approaches: Leveraging important information from routinely collected data
Background
Globally, pre-eclampsia (PE) is a leading cause of maternal and perinatal morbidity and mortality. PE prediction using routinely collected data has the advantage of being widely applicable, particularly in low-resource settings. Early intervention for high-risk women might reduce PE incidence and related complications. We aimed to replicate our machine learning (ML) published work predicting another maternal condition (gestational diabetes) to (1) predict PE using routine health data, (2) identify the optimal ML model, and (3) compare it with logistic regression approach.
Methods
Data were from a large health service network with 48,250 singleton pregnancies between January 2016 and June 2021. Supervised ML models were employed. Maternal clinical and medical characteristics were the feature variables (predictors), and a 70/30 data split was used for training and testing the model. Predictive performance was assessed using area under the curve (AUC) and calibration plots. Shapley value analysis assessed the contribution of feature variables.
Results
The random forest approach provided excellent discrimination with an AUC of 0.84 (95% CI: 0.82–0.86) and highest prediction accuracy (0.79); however, the calibration curve (slope of 1.21, 95% CI 1.13–1.30) was acceptable only for a threshold of 0.3 or less. The next best approach was extreme gradient boosting, which provided an AUC of 0.77 (95% CI: 0.76–0.79) and well-calibrated (slope of 0.93, 95% CI 0.85–1.01). Logistic regression provided good discrimination performance with an AUC of 0.75 (95% CI: 0.74–0.76) and perfect calibration. Nulliparous, pre-pregnancy body mass index, previous pregnancy with prior PE, maternal age, family history of hypertension, and pre-existing hypertension and diabetes were the top-ranked features in Shapley value analysis.
Conclusion
Two ML models created the highest-performing prediction using routinely collected data to identify women at high risk of PE, with acceptable discrimination. However, to confirm this result and also examine model generalisability, external validation studies are needed in other settings, utilising standardised prognostic factors.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.