{"title":"An interpretable machine learning study for developing a binary classifier for predicting rehospitalization from skilled nursing facilities","authors":"Zhouyang Lou , Zachary Hass , Nan Kong","doi":"10.1016/j.health.2025.100387","DOIUrl":null,"url":null,"abstract":"<div><div>Reducing hospital readmissions for older adults discharged to a skilled nursing facility (SNF) is important to the Unites States (U.S.) both from financial and care quality perspectives. To identify potential risk factors, researchers have used data from claims, national surveys, and administrative databases to train models that predict hospital readmissions that occur within 30 days of discharge. Machine learning techniques hold promise for this binary classification task. However, analysis pipelines are underdeveloped in data balancing, feature selection, and model interpretability. In this paper, we utilized individual resident-level data from the Long-Term Care Minimum Data Set (MDS) collected from SNFs in a midwestern U.S. state (n = 93,058). We further triangulated this data with publicly available facility quality and staffing data from the Nursing Home Compares tool of the Medicare.gov and facility neighborhood data from the National Neighborhood Data Archive. We compared several machine learning models, data balancing techniques, and feature selection methods, for the prediction task. We found that XGBoost, with Synthetic Minority Oversampling Edited Nearest Neighbor (SMOTE-ENN) to balance the data, and hierarchical clustering based on spearman correlation to select the features that produces the best prediction performance. We then used SHapley Additive exPlanations (SHAP) values to identify features that contribute most to the performance and used partial dependence plots to examine curvilinear and moderating relationships between features and the risk of 30-day rehospitalization.</div></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"7 ","pages":"Article 100387"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442525000061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Reducing hospital readmissions for older adults discharged to a skilled nursing facility (SNF) is important to the Unites States (U.S.) both from financial and care quality perspectives. To identify potential risk factors, researchers have used data from claims, national surveys, and administrative databases to train models that predict hospital readmissions that occur within 30 days of discharge. Machine learning techniques hold promise for this binary classification task. However, analysis pipelines are underdeveloped in data balancing, feature selection, and model interpretability. In this paper, we utilized individual resident-level data from the Long-Term Care Minimum Data Set (MDS) collected from SNFs in a midwestern U.S. state (n = 93,058). We further triangulated this data with publicly available facility quality and staffing data from the Nursing Home Compares tool of the Medicare.gov and facility neighborhood data from the National Neighborhood Data Archive. We compared several machine learning models, data balancing techniques, and feature selection methods, for the prediction task. We found that XGBoost, with Synthetic Minority Oversampling Edited Nearest Neighbor (SMOTE-ENN) to balance the data, and hierarchical clustering based on spearman correlation to select the features that produces the best prediction performance. We then used SHapley Additive exPlanations (SHAP) values to identify features that contribute most to the performance and used partial dependence plots to examine curvilinear and moderating relationships between features and the risk of 30-day rehospitalization.