{"title":"Predictive interpretable analytics models for forecasting healthcare costs using open healthcare data","authors":"A. Ravishankar Rao , Raunak Jain , Mrityunjai Singh , Rahul Garg","doi":"10.1016/j.health.2024.100351","DOIUrl":null,"url":null,"abstract":"<div><p>Healthcare expenditure, a considerable proportion of national budgets, has risen rapidly. Consequently, considerable research is devoted to controlling healthcare costs. Many efforts are underway to improve medical price transparency. Price transparency will help patients become better informed, allowing them to shop for care they can afford, eventually leading to efficiency in healthcare markets. This first requires medical pricing data to be made available publicly. Since the raw pricing data can be large and cover multiple conditions, it is necessary to provide an engine to process the data to facilitate its usage and understanding. We recommend creating computational models that predict healthcare costs for various patient conditions and demographics. Patients and providers can interrogate the underlying data to understand the variation of healthcare costs concerning medical conditions and demographic variables of interest, including age. We demonstrate our approach by creating predictive models using recent machine learning techniques. We analyzed anonymous patient data from the New York State Statewide Planning and Research Cooperative System, consisting of 2.34 million records from 2019. We built models to predict costs from over two dozen patient variables, including diagnosis codes, severity of illness, age, and other demographic variables. We investigated three models: regression, decision trees, and random forests. These models are explainable. We analyzed features to determine those that were predictive of total costs. We determined that the diagnosis code, severity of illness, and length of stay were good predictors of total costs, whereas race and gender are not useful in predicting total costs. We obtained the best performance using a catboost regressor, which yielded an R2 score of 0.85, better than the values reported in the literature.</p></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"6 ","pages":"Article 100351"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772442524000534/pdfft?md5=627ca7cad502b1be2f4f25cc21192d35&pid=1-s2.0-S2772442524000534-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442524000534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Healthcare expenditure, a considerable proportion of national budgets, has risen rapidly. Consequently, considerable research is devoted to controlling healthcare costs. Many efforts are underway to improve medical price transparency. Price transparency will help patients become better informed, allowing them to shop for care they can afford, eventually leading to efficiency in healthcare markets. This first requires medical pricing data to be made available publicly. Since the raw pricing data can be large and cover multiple conditions, it is necessary to provide an engine to process the data to facilitate its usage and understanding. We recommend creating computational models that predict healthcare costs for various patient conditions and demographics. Patients and providers can interrogate the underlying data to understand the variation of healthcare costs concerning medical conditions and demographic variables of interest, including age. We demonstrate our approach by creating predictive models using recent machine learning techniques. We analyzed anonymous patient data from the New York State Statewide Planning and Research Cooperative System, consisting of 2.34 million records from 2019. We built models to predict costs from over two dozen patient variables, including diagnosis codes, severity of illness, age, and other demographic variables. We investigated three models: regression, decision trees, and random forests. These models are explainable. We analyzed features to determine those that were predictive of total costs. We determined that the diagnosis code, severity of illness, and length of stay were good predictors of total costs, whereas race and gender are not useful in predicting total costs. We obtained the best performance using a catboost regressor, which yielded an R2 score of 0.85, better than the values reported in the literature.