Objective
This study aims to bridge the gap between predictive modeling and causal inference by utilizing lifestyle data from the National Health and Nutrition Examination Survey (NHANES) database to compare the predictive performance of multiple machine learning models for coronary heart disease (CHD). By incorporating Mendelian randomization, the study seeks to validate and identify the lifestyle variables with both predictive power and causal impact on CHD.
Methods
We extracted variables related to demographic characteristics and lifestyle from the NHANES database (2013–2018; n = 29,400). Recursive feature elimination (RFE) was employed to rank variable importance and determine the optimal feature subset. Subsequently, eight machine learning models-including Support Vector Machine (SVM), Neural Network (NN), Naive Bayes (NB), Extreme Gradient Boosting (XGBoost), Multilayer Perceptron (MLP), Generalized Linear Model (GLM), Adaptive Boosting (AdaBoost), and Decision Tree (DT)-were developed for CHD prediction. Model performance was evaluated using metrics such as accuracy, precision, sensitivity, specificity, recall, F1-score, and the Receiver Operating Characteristic (ROC) curve, with variable contributions visualized using Shapley Additive Explanations (SHAP). Additionally, Mendelian randomization (MR) was applied to distinguish associative from causal relationships, validating top predictors via Genome-Wide Association Study (GWAS)-derived genetic instruments.
Results
RFE identified age, sex, fasting blood glucose, body mass index (BMI), total cholesterol (TC) intake, sleep duration, diastolic blood pressure, and smoking as the most significant predictors of CHD. Among the models, SVM outperformed DT, AdaBoost, XGBoost, NN, MLP, NB, and GLM. The SVM model achieved the highest performance with an accuracy of 83.4 % and an AUC value of 0.909, demonstrating clinically actionable predictive power. MR confirmed causal associations for five variables: BMI (OR: 1.01, P < 0.001), TC (OR: 1.01, P < 0.001), insomnia (OR: 1.03, P < 0.001), diastolic blood pressure (OR: 1.20, P < 0.001), and smoking (OR: 1.03, P < 0.001), while fasting glucose showed null causality (P > 0.05).
Conclusion
The SVM machine learning model, based on NHANES data, enables faster and more efficient prediction of CHD. The study identified age, sex, BMI, TC intake, sleep duration, diastolic blood pressure, and smoking as the lifestyle variables with the greatest impact on CHD. This dual approach advances precision prevention by combining predictive accuracy with genetic evidence.
扫码关注我们
求助内容:
应助结果提醒方式:
