Prediction model for type 2 diabetes mellitus and its association with mortality using machine learning in three independent cohorts from South Korea, Japan, and the UK: a model development and validation study.
Hayeon Lee, Seung Ha Hwang, Seoyoung Park, Yunjeong Choi, Sooji Lee, Jaeyu Park, Yejun Son, Hyeon Jin Kim, Soeun Kim, Jiyeon Oh, Lee Smith, Damiano Pizzol, Sang Youl Rhee, Hyunji Sang, Jinseok Lee, Dong Keon Yon
{"title":"Prediction model for type 2 diabetes mellitus and its association with mortality using machine learning in three independent cohorts from South Korea, Japan, and the UK: a model development and validation study.","authors":"Hayeon Lee, Seung Ha Hwang, Seoyoung Park, Yunjeong Choi, Sooji Lee, Jaeyu Park, Yejun Son, Hyeon Jin Kim, Soeun Kim, Jiyeon Oh, Lee Smith, Damiano Pizzol, Sang Youl Rhee, Hyunji Sang, Jinseok Lee, Dong Keon Yon","doi":"10.1016/j.eclinm.2025.103069","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Type 2 diabetes mellitus (T2DM) is a significant global public health concern that has steadily increased over the past few decades. Thus, this study aimed to predict the incidence of T2DM within 5 years and the risk of mortality following the onset of T2DM. Data from three independent cohorts worldwide were used.</p><p><strong>Methods: </strong>We utilized data from three independent, large-scale, general population-based, and worldwide cohort studies. The Korean cohort (NHIS-NSC cohort; discovery cohort; n = 973,303), conducted between 1 January, 2002 and 31 December, 2013, was used for training and internal validation, whereas the Japanese cohort (JMDC cohort; validation cohort A; n = 12,143,715) and UK cohort (UK Biobank; validation cohort B; n = 416,656) were used for external validation. We employed various machine learning (ML)-based models, using 18 features, to predict the incidence of T2DM within five years of regular health checkups and calculated the Shapley Additive Explanation (SHAP) values. To ensure the robustness of our ML-based prediction model, we investigated the potential association between the model probability divided into tertiles and the risk of mortality following the onset of T2DM.</p><p><strong>Findings: </strong>In the discovery cohort, the ensemble model using voting with logistic regression and adaptive boosting achieved a balanced accuracy of 72.6% and an area under the receiver operating characteristics curve (AUROC) of 0.792. The SHAP value analysis of our proposed model revealed that age was the most important predictor of incident T2DM, followed by fasting blood glucose, hemoglobin, γ-glutamyl transferase level, and body mass index. The model probability is associated with an increased risk of mortality (T1: adjusted hazard ratio, 2.82 [95% CI, 2.01-3.94]; T2: 3.89 [2.74-5.53]; and T3: 7.73 [5.37-11.12]). Similar patterns and trends were observed in the validation cohorts (T1: 1.74 [1.49-2.03], T2: 1.97 [1.69-2.30], and T3: 3.31 [2.82-3.38] in validation cohort A; T1: 1.33 [1.03-1.71], T2: 1.54 [1.21-1.96], and T3: 1.73 [1.36-2.20] in validation cohort B).</p><p><strong>Interpretation: </strong>This study derived and validated an ML-based model to predict the incidence of T2DM within 5 years across three countries (South Korea, Japan, and the UK), showing that the model probability is associated with an increased risk of mortality.</p><p><strong>Funding: </strong>Institute of Information & Communications Technology Planning & Evaluation, South Korea.</p>","PeriodicalId":11393,"journal":{"name":"EClinicalMedicine","volume":"80 ","pages":"103069"},"PeriodicalIF":9.6000,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11787438/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EClinicalMedicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.eclinm.2025.103069","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Type 2 diabetes mellitus (T2DM) is a significant global public health concern that has steadily increased over the past few decades. Thus, this study aimed to predict the incidence of T2DM within 5 years and the risk of mortality following the onset of T2DM. Data from three independent cohorts worldwide were used.
Methods: We utilized data from three independent, large-scale, general population-based, and worldwide cohort studies. The Korean cohort (NHIS-NSC cohort; discovery cohort; n = 973,303), conducted between 1 January, 2002 and 31 December, 2013, was used for training and internal validation, whereas the Japanese cohort (JMDC cohort; validation cohort A; n = 12,143,715) and UK cohort (UK Biobank; validation cohort B; n = 416,656) were used for external validation. We employed various machine learning (ML)-based models, using 18 features, to predict the incidence of T2DM within five years of regular health checkups and calculated the Shapley Additive Explanation (SHAP) values. To ensure the robustness of our ML-based prediction model, we investigated the potential association between the model probability divided into tertiles and the risk of mortality following the onset of T2DM.
Findings: In the discovery cohort, the ensemble model using voting with logistic regression and adaptive boosting achieved a balanced accuracy of 72.6% and an area under the receiver operating characteristics curve (AUROC) of 0.792. The SHAP value analysis of our proposed model revealed that age was the most important predictor of incident T2DM, followed by fasting blood glucose, hemoglobin, γ-glutamyl transferase level, and body mass index. The model probability is associated with an increased risk of mortality (T1: adjusted hazard ratio, 2.82 [95% CI, 2.01-3.94]; T2: 3.89 [2.74-5.53]; and T3: 7.73 [5.37-11.12]). Similar patterns and trends were observed in the validation cohorts (T1: 1.74 [1.49-2.03], T2: 1.97 [1.69-2.30], and T3: 3.31 [2.82-3.38] in validation cohort A; T1: 1.33 [1.03-1.71], T2: 1.54 [1.21-1.96], and T3: 1.73 [1.36-2.20] in validation cohort B).
Interpretation: This study derived and validated an ML-based model to predict the incidence of T2DM within 5 years across three countries (South Korea, Japan, and the UK), showing that the model probability is associated with an increased risk of mortality.
Funding: Institute of Information & Communications Technology Planning & Evaluation, South Korea.
期刊介绍:
eClinicalMedicine is a gold open-access clinical journal designed to support frontline health professionals in addressing the complex and rapid health transitions affecting societies globally. The journal aims to assist practitioners in overcoming healthcare challenges across diverse communities, spanning diagnosis, treatment, prevention, and health promotion. Integrating disciplines from various specialties and life stages, it seeks to enhance health systems as fundamental institutions within societies. With a forward-thinking approach, eClinicalMedicine aims to redefine the future of healthcare.