Developing Machine Learning Algorithms on Routinely Collected Administrative Health Data - Lessons from Ontario, Canada.

IF 2.2 Q3 HEALTH CARE SCIENCES & SERVICES International Journal of Population Data Science Pub Date : 2022-08-25 DOI:10.23889/ijpds.v7i3.1851

V. Harish, Mathieu Ravaut, S. Yi, Jahir M. Gutierrez, H. Sadeghi, Kin Kwan Leung, T. Watson, K. Kornas, T. Poutanen, M. Volkovs, L. Rosella

{"title":"Developing Machine Learning Algorithms on Routinely Collected Administrative Health Data - Lessons from Ontario, Canada.","authors":"V. Harish, Mathieu Ravaut, S. Yi, Jahir M. Gutierrez, H. Sadeghi, Kin Kwan Leung, T. Watson, K. Kornas, T. Poutanen, M. Volkovs, L. Rosella","doi":"10.23889/ijpds.v7i3.1851","DOIUrl":null,"url":null,"abstract":"There has been considerable growth in the development of machine learning models for clinical applications; however, less attention has been paid to applications at the health systems level. Here, we survey recent models developed using provincial administrative health data holdings in Ontario, Canada to synthesize key learnings across use cases. \nWe have developed four models in the areas of diabetes incidence and complications, hospitalization due to ambulatory care sensitive conditions, and hospitalization due to SARS-CoV-2 infection. Our team was highly multidisciplinary with expertise across clinical medicine, administrative health data, epidemiology, and computer science. We used a “sliding window” approach to aggregate healthcare events across multiple health administrative data sets chronologically and map them dynamically onto a patient timeline. Tree-based algorithms, specifically gradient boosted decision trees, are well suited for the underlying tabular structure of administrative data and were used for each prediction task. \nOur models achieved excellent discrimination, measured by the area under the receiver operating characteristic curve, between 0.77-0.85 at prediction windows between 30 days and 3 years in advance. They were also well-calibrated, both in-the-large and in population subgroups such as older adults, those living in rural areas, and the materially deprived. Measures of feature importance revealed that our models were leveraging predictors across administrative datasets (e.g. demographics, interactions with the healthcare system, medications) in intuitive and defensible ways. Finally, we demonstrated the utility of our models with “recall at top k” metrics - for example, the top 1% of patients predicted at risk of diabetes complications represented a cost of over $400 million to the healthcare system. \nWe identify three key learnings needed for the successful application of machine learning methods to health administrative data: synergy between nature of training data and intended algorithm use, adherence to methodological best practices for rigour and transparency, and multidisciplinary teams with expertise across data provenance, methodological approach, and impact assessment.","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":" ","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v7i3.1851","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

There has been considerable growth in the development of machine learning models for clinical applications; however, less attention has been paid to applications at the health systems level. Here, we survey recent models developed using provincial administrative health data holdings in Ontario, Canada to synthesize key learnings across use cases. We have developed four models in the areas of diabetes incidence and complications, hospitalization due to ambulatory care sensitive conditions, and hospitalization due to SARS-CoV-2 infection. Our team was highly multidisciplinary with expertise across clinical medicine, administrative health data, epidemiology, and computer science. We used a “sliding window” approach to aggregate healthcare events across multiple health administrative data sets chronologically and map them dynamically onto a patient timeline. Tree-based algorithms, specifically gradient boosted decision trees, are well suited for the underlying tabular structure of administrative data and were used for each prediction task. Our models achieved excellent discrimination, measured by the area under the receiver operating characteristic curve, between 0.77-0.85 at prediction windows between 30 days and 3 years in advance. They were also well-calibrated, both in-the-large and in population subgroups such as older adults, those living in rural areas, and the materially deprived. Measures of feature importance revealed that our models were leveraging predictors across administrative datasets (e.g. demographics, interactions with the healthcare system, medications) in intuitive and defensible ways. Finally, we demonstrated the utility of our models with “recall at top k” metrics - for example, the top 1% of patients predicted at risk of diabetes complications represented a cost of over $400 million to the healthcare system. We identify three key learnings needed for the successful application of machine learning methods to health administrative data: synergy between nature of training data and intended algorithm use, adherence to methodological best practices for rigour and transparency, and multidisciplinary teams with expertise across data provenance, methodological approach, and impact assessment.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于常规收集的行政卫生数据开发机器学习算法——来自加拿大安大略省的经验教训。

用于临床应用的机器学习模型的开发有了相当大的增长；然而，对卫生系统层面的应用关注较少。在这里，我们调查了最近使用加拿大安大略省省级行政卫生数据开发的模型，以综合用例中的关键知识。我们在糖尿病发病率和并发症、因门诊护理敏感条件而住院和因严重急性呼吸系统综合征冠状病毒2型感染而住院等领域开发了四个模型。我们的团队具有高度的多学科性，在临床医学、行政健康数据、流行病学和计算机科学方面拥有专业知识。我们使用“滑动窗口”方法按时间顺序聚合多个健康管理数据集的医疗事件，并将其动态映射到患者时间线上。基于树的算法，特别是梯度增强的决策树，非常适合管理数据的底层表格结构，并用于每个预测任务。我们的模型在提前30天到3年的预测窗口中，通过接收器工作特性曲线下的面积测量，在0.77-0.85之间实现了极好的区分。它们也得到了很好的校准，无论是在大型人群还是在人口亚群中，如老年人、生活在农村地区的人和物质贫困者。特征重要性的测量表明，我们的模型以直观和合理的方式利用了管理数据集（如人口统计、与医疗系统的互动、药物）中的预测因素。最后，我们用“最高k召回率”指标证明了我们模型的实用性——例如，预测有糖尿病并发症风险的前1%的患者代表了医疗系统超过4亿美元的成本。我们确定了将机器学习方法成功应用于卫生管理数据所需的三个关键知识：训练数据的性质和预期算法使用之间的协同作用，遵守方法论最佳实践以实现严格性和透明度，以及具有数据来源、方法论方法和影响评估专业知识的多学科团队。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊