Predictive interpretable analytics models for forecasting healthcare costs using open healthcare data

A. Ravishankar Rao , Raunak Jain , Mrityunjai Singh , Rahul Garg
{"title":"Predictive interpretable analytics models for forecasting healthcare costs using open healthcare data","authors":"A. Ravishankar Rao ,&nbsp;Raunak Jain ,&nbsp;Mrityunjai Singh ,&nbsp;Rahul Garg","doi":"10.1016/j.health.2024.100351","DOIUrl":null,"url":null,"abstract":"<div><p>Healthcare expenditure, a considerable proportion of national budgets, has risen rapidly. Consequently, considerable research is devoted to controlling healthcare costs. Many efforts are underway to improve medical price transparency. Price transparency will help patients become better informed, allowing them to shop for care they can afford, eventually leading to efficiency in healthcare markets. This first requires medical pricing data to be made available publicly. Since the raw pricing data can be large and cover multiple conditions, it is necessary to provide an engine to process the data to facilitate its usage and understanding. We recommend creating computational models that predict healthcare costs for various patient conditions and demographics. Patients and providers can interrogate the underlying data to understand the variation of healthcare costs concerning medical conditions and demographic variables of interest, including age. We demonstrate our approach by creating predictive models using recent machine learning techniques. We analyzed anonymous patient data from the New York State Statewide Planning and Research Cooperative System, consisting of 2.34 million records from 2019. We built models to predict costs from over two dozen patient variables, including diagnosis codes, severity of illness, age, and other demographic variables. We investigated three models: regression, decision trees, and random forests. These models are explainable. We analyzed features to determine those that were predictive of total costs. We determined that the diagnosis code, severity of illness, and length of stay were good predictors of total costs, whereas race and gender are not useful in predicting total costs. We obtained the best performance using a catboost regressor, which yielded an R2 score of 0.85, better than the values reported in the literature.</p></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772442524000534/pdfft?md5=627ca7cad502b1be2f4f25cc21192d35&pid=1-s2.0-S2772442524000534-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442524000534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Healthcare expenditure, a considerable proportion of national budgets, has risen rapidly. Consequently, considerable research is devoted to controlling healthcare costs. Many efforts are underway to improve medical price transparency. Price transparency will help patients become better informed, allowing them to shop for care they can afford, eventually leading to efficiency in healthcare markets. This first requires medical pricing data to be made available publicly. Since the raw pricing data can be large and cover multiple conditions, it is necessary to provide an engine to process the data to facilitate its usage and understanding. We recommend creating computational models that predict healthcare costs for various patient conditions and demographics. Patients and providers can interrogate the underlying data to understand the variation of healthcare costs concerning medical conditions and demographic variables of interest, including age. We demonstrate our approach by creating predictive models using recent machine learning techniques. We analyzed anonymous patient data from the New York State Statewide Planning and Research Cooperative System, consisting of 2.34 million records from 2019. We built models to predict costs from over two dozen patient variables, including diagnosis codes, severity of illness, age, and other demographic variables. We investigated three models: regression, decision trees, and random forests. These models are explainable. We analyzed features to determine those that were predictive of total costs. We determined that the diagnosis code, severity of illness, and length of stay were good predictors of total costs, whereas race and gender are not useful in predicting total costs. We obtained the best performance using a catboost regressor, which yielded an R2 score of 0.85, better than the values reported in the literature.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用开放式医疗数据预测医疗成本的可解释分析模型
医疗保健支出在国家预算中占有相当大的比例,而且增长迅速。因此,很多研究都致力于控制医疗成本。为了提高医疗价格的透明度,很多人都在努力。价格透明将帮助患者更好地了解信息,使他们能够选择自己负担得起的医疗服务,最终提高医疗市场的效率。这首先需要公开医疗定价数据。由于原始定价数据可能非常庞大,而且涵盖多种疾病,因此有必要提供一个处理数据的引擎,以便于使用和理解这些数据。我们建议创建计算模型,预测不同病症和人口统计的医疗成本。患者和医疗服务提供者可以通过查询基础数据来了解医疗费用在病情和人口统计学变量(包括年龄)方面的变化。我们利用最新的机器学习技术创建了预测模型,展示了我们的方法。我们分析了来自纽约州全州规划与研究合作系统的匿名患者数据,其中包括 2019 年的 234 万条记录。我们根据二十多个患者变量(包括诊断代码、病情严重程度、年龄和其他人口统计学变量)建立了预测成本的模型。我们研究了三种模型:回归、决策树和随机森林。这些模型都是可以解释的。我们对特征进行了分析,以确定哪些特征可预测总费用。我们发现,诊断代码、病情严重程度和住院时间都能很好地预测总费用,而种族和性别则对预测总费用没有帮助。我们使用 catboost 回归器获得了最佳性能,其 R2 值为 0.85,优于文献报道的值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Healthcare analytics (New York, N.Y.)
Healthcare analytics (New York, N.Y.) Applied Mathematics, Modelling and Simulation, Nursing and Health Professions (General)
CiteScore
4.40
自引率
0.00%
发文量
0
审稿时长
79 days
期刊最新文献
An electrocardiogram signal classification using a hybrid machine learning and deep learning approach An inter-hospital performance assessment model for evaluating hospitals performing hip arthroplasty A data envelopment analysis model for optimizing transfer time of ischemic stroke patients under endovascular thrombectomy An investigation of Susceptible–Exposed–Infectious–Recovered (SEIR) tuberculosis model dynamics with pseudo-recovery and psychological effect A novel integrated logistic regression model enhanced with recursive feature elimination and explainable artificial intelligence for dementia prediction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1