Machine Learning-Based Prediction for High Health Care Utilizers by Using a Multi-Institutional Diabetes Registry: Model Training and Evaluation.

IF 2 JMIR AI Pub Date : 2024-10-17 DOI:10.2196/58463

Joshua Kuan Tan, Le Quan, Nur Nasyitah Mohamed Salim, Jen Hong Tan, Su-Yen Goh, Julian Thumboo, Yong Mong Bee

{"title":"Machine Learning-Based Prediction for High Health Care Utilizers by Using a Multi-Institutional Diabetes Registry: Model Training and Evaluation.","authors":"Joshua Kuan Tan, Le Quan, Nur Nasyitah Mohamed Salim, Jen Hong Tan, Su-Yen Goh, Julian Thumboo, Yong Mong Bee","doi":"10.2196/58463","DOIUrl":null,"url":null,"abstract":"Background: The cost of health care in many countries is increasing rapidly. There is a growing interest in using machine learning for predicting high health care utilizers for population health initiatives. Previous studies have focused on individuals who contribute to the highest financial burden. However, this group is small and represents a limited opportunity for long-term cost reduction.Objective: We developed a collection of models that predict future health care utilization at various thresholds.Methods: We utilized data from a multi-institutional diabetes database from the year 2019 to develop binary classification models. These models predict health care utilization in the subsequent year across 6 different outcomes: patients having a length of stay of ≥7, ≥14, and ≥30 days and emergency department attendance of ≥3, ≥5, and ≥10 visits. To address class imbalance, random and synthetic minority oversampling techniques were employed. The models were then applied to unseen data from 2020 and 2021 to predict health care utilization in the following year. A portfolio of performance metrics, with priority on area under the receiver operating characteristic curve, sensitivity, and positive predictive value, was used for comparison. Explainability analyses were conducted on the best performing models.Results: When trained with random oversampling, 4 models, that is, logistic regression, multivariate adaptive regression splines, boosted trees, and multilayer perceptron consistently achieved high area under the receiver operating characteristic curve (>0.80) and sensitivity (>0.60) across training-validation and test data sets. Correcting for class imbalance proved critical for model performance. Important predictors for all outcomes included age, number of emergency department visits in the present year, chronic kidney disease stage, inpatient bed days in the present year, and mean hemoglobin A1c levels. Explainability analyses using partial dependence plots demonstrated that for the best performing models, the learned patterns were consistent with real-world knowledge, thereby supporting the validity of the models.Conclusions: We successfully developed machine learning models capable of predicting high service level utilization with strong performance and valid explainability. These models can be integrated into wider diabetes-related population health initiatives.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e58463"},"PeriodicalIF":2.0000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11528163/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/58463","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The cost of health care in many countries is increasing rapidly. There is a growing interest in using machine learning for predicting high health care utilizers for population health initiatives. Previous studies have focused on individuals who contribute to the highest financial burden. However, this group is small and represents a limited opportunity for long-term cost reduction.

Objective: We developed a collection of models that predict future health care utilization at various thresholds.

Methods: We utilized data from a multi-institutional diabetes database from the year 2019 to develop binary classification models. These models predict health care utilization in the subsequent year across 6 different outcomes: patients having a length of stay of ≥7, ≥14, and ≥30 days and emergency department attendance of ≥3, ≥5, and ≥10 visits. To address class imbalance, random and synthetic minority oversampling techniques were employed. The models were then applied to unseen data from 2020 and 2021 to predict health care utilization in the following year. A portfolio of performance metrics, with priority on area under the receiver operating characteristic curve, sensitivity, and positive predictive value, was used for comparison. Explainability analyses were conducted on the best performing models.

Results: When trained with random oversampling, 4 models, that is, logistic regression, multivariate adaptive regression splines, boosted trees, and multilayer perceptron consistently achieved high area under the receiver operating characteristic curve (>0.80) and sensitivity (>0.60) across training-validation and test data sets. Correcting for class imbalance proved critical for model performance. Important predictors for all outcomes included age, number of emergency department visits in the present year, chronic kidney disease stage, inpatient bed days in the present year, and mean hemoglobin A_1c levels. Explainability analyses using partial dependence plots demonstrated that for the best performing models, the learned patterns were consistent with real-world knowledge, thereby supporting the validity of the models.

Conclusions: We successfully developed machine learning models capable of predicting high service level utilization with strong performance and valid explainability. These models can be integrated into wider diabetes-related population health initiatives.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用多机构糖尿病登记处，基于机器学习预测医疗服务高利用率者：模型训练与评估。

背景：许多国家的医疗成本正在迅速增加。越来越多的人开始关注利用机器学习预测高医疗使用率的人群，以促进人口健康。以往的研究侧重于造成最高经济负担的个人。然而，这一群体人数较少，长期降低成本的机会有限：我们开发了一系列模型，可预测不同阈值下的未来医疗使用情况：我们利用多机构糖尿病数据库中 2019 年的数据开发了二元分类模型。这些模型通过 6 种不同的结果预测下一年的医疗利用率：住院时间≥7 天、≥14 天和≥30 天的患者，以及急诊就诊次数≥3 次、≥5 次和≥10 次的患者。为解决类不平衡问题，采用了随机和合成少数群体超采样技术。然后将模型应用于 2020 年和 2021 年的未见数据，以预测下一年的医疗利用率。为了进行比较，使用了一系列性能指标，重点是接收者工作特征曲线下面积、灵敏度和阳性预测值。对表现最好的模型进行了可解释性分析：当使用随机超采样进行训练时，4 个模型，即逻辑回归、多元自适应回归样条、助推树和多层感知器，在训练-验证和测试数据集上始终达到较高的接收者操作特征曲线下面积（>0.80）和灵敏度（>0.60）。事实证明，校正类别不平衡对模型性能至关重要。所有结果的重要预测因素包括年龄、当年急诊就诊次数、慢性肾脏病分期、当年住院天数和平均血红蛋白 A1c 水平。使用偏倚图进行的可解释性分析表明，对于表现最好的模型，学习到的模式与现实世界的知识是一致的，从而支持了模型的有效性：我们成功地开发了能够预测高服务水平利用率的机器学习模型，这些模型具有强大的性能和有效的可解释性。这些模型可以整合到更广泛的糖尿病相关人群健康计划中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

JMIR AI

自引率

0.00%

发文量