Background: Healthcare data volume is increasingly expanding, presenting both challenges and opportunities. Traditional statistical methods applied in epidemiology, such as logistic regression (LR), albeit widely used, holds limited ability to handle the complexity and high dimensionality of modern datasets. In contrast, machine learning (ML) methods can model complex, non-linear relationships and are less constrained by parametric assumptions, ideal for uncovering hidden patterns.
Methods: In this study, we aim to introduce ML applications for epidemiologic research and explore three predictive models: LR as a traditional modeling approach, and least absolute shrinkage and selection operator (LASSO) regression and eXtreme Gradient Boosting (XGBoost) as ML approaches. We demonstrate how ML approaches, particularly XGBoost, can benefit epidemiologic research through a real-world case study. We present common steps: data preprocessing, model creation and evaluation processes. Additionally, we address the "black box" nature of ML models and present post hoc explanation tools to enhance interpretability.
Results: We examined the case of near-centenarianism (reaching age of 95 years or older) prediction using midlife predictors (i.e., demographic, clinical, lifestyle, occupational and dietary variables) in a cohort of approximately 10,000 middle-aged working men recruited in 1963 and followed until death or until 2019. Models were fitted and calibrated on a training set, showing good predictive performances on a separate test set. XGboost, LASSO regression, and LR achieved ROC-AUC values of 0.72 (95 % CI: 0.66-0.75), 0.71 (95 % CI: 0.67-0.74) and 0.69 (95 % CI: 0.66-0.73), respectively. Explainability analysis identified key predictors for longevity, including systolic blood pressure, smoking status, and a history of myocardial infarction; consistent with prior studies.
Conclusions: In conclusion, our findings highlight the potential of ML to enhance epidemiological studies by handling complex interactions and high-dimensional data, suggesting a complementary approach to traditional methods.
Purpose: To explore disparities in cervical cancer diagnosis and outcomes for Asian patients and Native Hawaiian and other Pacific Islanders (NHPIs).
Methods: We extracted cervical cancer patient data collected from the Surveillance, Epidemiology, and End Results 17 database. Odds ratios (ORs) for stage and time ratios (TRs) for survival outcomes were estimated using logistic regression and accelerated failure time models, respectively.
Results: Of 18770 patients, 15,847 (84.4 %) were White; 2618 (13.9 %) were Asian; and 305 (1.6 %) were NHPI. NHPI patients were less likely than White patients to be diagnosed at an early stage (adjusted OR [aOR]: 0.60; 95 % CI, 0.47-0.77), whereas Asian patients had similar stage-at-diagnosis to White patients (aOR: 0.93; 95 % CI, 0.85-1.02). Asian patients, as a group, had significantly longer overall survival (OS) (adjusted TR [aTR]: 1.46; 95 % CI, 1.33-1.61) and disease-specific survival (DSS) (aTR: 1.35; 95 % CI, 1.21-1.51) than White patients; the opposite was true for NHPIs (OS: aTR, 0.80; 95 % CI, 0.64-1.00; DSS: aTR, 0.75; 95 % CI, 0.59-0.97).
Conclusions: We find that NHPI cervical cancer patients tend to be diagnosed later in their disease course than White patients and have shorter survival time post-diagnosis, while Asian patients tend to have longer survival time. These findings support the disaggregation of Asian and NHPI races in cervical cancer investigations.

