Thien Vu, Research Dawadi, Masaki Yamamoto, Jie Ting Tay, Naoki Watanabe, Yuki Kuriya, Ai Oya, Phap Ngoc Hoang Tran, Michihiro Araki
{"title":"Prediction of depressive disorder using machine learning approaches: findings from the NHANES.","authors":"Thien Vu, Research Dawadi, Masaki Yamamoto, Jie Ting Tay, Naoki Watanabe, Yuki Kuriya, Ai Oya, Phap Ngoc Hoang Tran, Michihiro Araki","doi":"10.1186/s12911-025-02903-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Depressive disorder, particularly major depressive disorder (MDD), significantly impact individuals and society. Traditional analysis methods often suffer from subjectivity and may not capture complex, non-linear relationships between risk factors. Machine learning (ML) offers a data-driven approach to predict and diagnose depression more accurately by analyzing large and complex datasets.</p><p><strong>Methods: </strong>This study utilized data from the National Health and Nutrition Examination Survey (NHANES) 2013-2014 to predict depression using six supervised ML models: Logistic Regression, Random Forest, Naive Bayes, Support Vector Machine (SVM), Extreme Gradient Boost (XGBoost), and Light Gradient Boosting Machine (LightGBM). Depression was assessed using the Patient Health Questionnaire (PHQ-9), with a score of 10 or higher indicating moderate to severe depression. The dataset was split into training and testing sets (80% and 20%, respectively), and model performance was evaluated using accuracy, sensitivity, specificity, precision, AUC, and F1 score. SHAP (SHapley Additive exPlanations) values were used to identify the critical risk factors and interpret the contributions of each feature to the prediction.</p><p><strong>Results: </strong>XGBoost was identified as the best-performing model, achieving the highest accuracy, sensitivity, specificity, precision, AUC, and F1 score. SHAP analysis highlighted the most significant predictors of depression: the ratio family income to poverty (PIR), sex, hypertension, serum cotinine and hydroxycotine, BMI, education level, glucose levels, age, marital status, and renal function (eGFR).</p><p><strong>Conclusion: </strong>We developed ML models to predict depression and utilized SHAP for interpretation. This approach identifies key factors associated with depression, encompassing socioeconomic, demographic, and health-related aspects.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"83"},"PeriodicalIF":3.3000,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-02903-1","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Depressive disorder, particularly major depressive disorder (MDD), significantly impact individuals and society. Traditional analysis methods often suffer from subjectivity and may not capture complex, non-linear relationships between risk factors. Machine learning (ML) offers a data-driven approach to predict and diagnose depression more accurately by analyzing large and complex datasets.
Methods: This study utilized data from the National Health and Nutrition Examination Survey (NHANES) 2013-2014 to predict depression using six supervised ML models: Logistic Regression, Random Forest, Naive Bayes, Support Vector Machine (SVM), Extreme Gradient Boost (XGBoost), and Light Gradient Boosting Machine (LightGBM). Depression was assessed using the Patient Health Questionnaire (PHQ-9), with a score of 10 or higher indicating moderate to severe depression. The dataset was split into training and testing sets (80% and 20%, respectively), and model performance was evaluated using accuracy, sensitivity, specificity, precision, AUC, and F1 score. SHAP (SHapley Additive exPlanations) values were used to identify the critical risk factors and interpret the contributions of each feature to the prediction.
Results: XGBoost was identified as the best-performing model, achieving the highest accuracy, sensitivity, specificity, precision, AUC, and F1 score. SHAP analysis highlighted the most significant predictors of depression: the ratio family income to poverty (PIR), sex, hypertension, serum cotinine and hydroxycotine, BMI, education level, glucose levels, age, marital status, and renal function (eGFR).
Conclusion: We developed ML models to predict depression and utilized SHAP for interpretation. This approach identifies key factors associated with depression, encompassing socioeconomic, demographic, and health-related aspects.
期刊介绍:
BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.