Sofonyas Abebaw Tiruneh , Daniel Lorber Rolnik , Helena J. Teede , Joanne Enticott
{"title":"Prediction of pre-eclampsia with machine learning approaches: Leveraging important information from routinely collected data","authors":"Sofonyas Abebaw Tiruneh , Daniel Lorber Rolnik , Helena J. Teede , Joanne Enticott","doi":"10.1016/j.ijmedinf.2024.105645","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Globally, pre-eclampsia (PE) is a leading cause of maternal and perinatal morbidity and mortality. PE prediction using routinely collected data has the advantage of being widely applicable, particularly in low-resource settings. Early intervention for high-risk women might reduce PE incidence and related complications. We aimed to replicate our machine learning (ML) published work predicting another maternal condition (gestational diabetes) to (1) predict PE using routine health data, (2) identify the optimal ML model, and (3) compare it with logistic regression approach.</div></div><div><h3>Methods</h3><div>Data were from a large health service network with 48,250 singleton pregnancies between January 2016 and June 2021. Supervised ML models were employed. Maternal clinical and medical characteristics were the feature variables (predictors), and a 70/30 data split was used for training and testing the model. Predictive performance was assessed using area under the curve (AUC) and calibration plots. Shapley value analysis assessed the contribution of feature variables.</div></div><div><h3>Results</h3><div>The random forest approach provided excellent discrimination with an AUC of 0.84 (95% CI: 0.82–0.86) and highest prediction accuracy (0.79); however, the calibration curve (slope of 1.21, 95% CI 1.13–1.30) was acceptable only for a threshold of 0.3 or less. The next best approach was extreme gradient boosting, which provided an AUC of 0.77 (95% CI: 0.76–0.79) and well-calibrated (slope of 0.93, 95% CI 0.85–1.01). Logistic regression provided good discrimination performance with an AUC of 0.75 (95% CI: 0.74–0.76) and perfect calibration. Nulliparous, pre-pregnancy body mass index, previous pregnancy with prior PE, maternal age, family history of hypertension, and pre-existing hypertension and diabetes were the top-ranked features in Shapley value analysis.</div></div><div><h3>Conclusion</h3><div>Two ML models created the highest-performing prediction using routinely collected data to identify women at high risk of PE, with acceptable discrimination. However, to confirm this result and also examine model generalisability, external validation studies are needed in other settings, utilising standardised prognostic factors.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624003083","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Globally, pre-eclampsia (PE) is a leading cause of maternal and perinatal morbidity and mortality. PE prediction using routinely collected data has the advantage of being widely applicable, particularly in low-resource settings. Early intervention for high-risk women might reduce PE incidence and related complications. We aimed to replicate our machine learning (ML) published work predicting another maternal condition (gestational diabetes) to (1) predict PE using routine health data, (2) identify the optimal ML model, and (3) compare it with logistic regression approach.
Methods
Data were from a large health service network with 48,250 singleton pregnancies between January 2016 and June 2021. Supervised ML models were employed. Maternal clinical and medical characteristics were the feature variables (predictors), and a 70/30 data split was used for training and testing the model. Predictive performance was assessed using area under the curve (AUC) and calibration plots. Shapley value analysis assessed the contribution of feature variables.
Results
The random forest approach provided excellent discrimination with an AUC of 0.84 (95% CI: 0.82–0.86) and highest prediction accuracy (0.79); however, the calibration curve (slope of 1.21, 95% CI 1.13–1.30) was acceptable only for a threshold of 0.3 or less. The next best approach was extreme gradient boosting, which provided an AUC of 0.77 (95% CI: 0.76–0.79) and well-calibrated (slope of 0.93, 95% CI 0.85–1.01). Logistic regression provided good discrimination performance with an AUC of 0.75 (95% CI: 0.74–0.76) and perfect calibration. Nulliparous, pre-pregnancy body mass index, previous pregnancy with prior PE, maternal age, family history of hypertension, and pre-existing hypertension and diabetes were the top-ranked features in Shapley value analysis.
Conclusion
Two ML models created the highest-performing prediction using routinely collected data to identify women at high risk of PE, with acceptable discrimination. However, to confirm this result and also examine model generalisability, external validation studies are needed in other settings, utilising standardised prognostic factors.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.