使用医疗索赔进行疾病预测的AutoML框架的基准测试。

IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biodata Mining Pub Date : 2022-07-26 DOI:10.1186/s13040-022-00300-2
Roland Albert A Romero, Mariefel Nicole Y Deypalan, Suchit Mehrotra, John Titus Jungao, Natalie E Sheils, Elisabetta Manduchi, Jason H Moore
{"title":"使用医疗索赔进行疾病预测的AutoML框架的基准测试。","authors":"Roland Albert A Romero,&nbsp;Mariefel Nicole Y Deypalan,&nbsp;Suchit Mehrotra,&nbsp;John Titus Jungao,&nbsp;Natalie E Sheils,&nbsp;Elisabetta Manduchi,&nbsp;Jason H Moore","doi":"10.1186/s13040-022-00300-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.</p><p><strong>Materials and methods: </strong>We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics.</p><p><strong>Results: </strong>The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications.</p><p><strong>Discussion: </strong>Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance.</p><p><strong>Conclusion: </strong>Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0000,"publicationDate":"2022-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9327416/pdf/","citationCount":"8","resultStr":"{\"title\":\"Benchmarking AutoML frameworks for disease prediction using medical claims.\",\"authors\":\"Roland Albert A Romero,&nbsp;Mariefel Nicole Y Deypalan,&nbsp;Suchit Mehrotra,&nbsp;John Titus Jungao,&nbsp;Natalie E Sheils,&nbsp;Elisabetta Manduchi,&nbsp;Jason H Moore\",\"doi\":\"10.1186/s13040-022-00300-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.</p><p><strong>Materials and methods: </strong>We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics.</p><p><strong>Results: </strong>The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications.</p><p><strong>Discussion: </strong>Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance.</p><p><strong>Conclusion: </strong>Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.</p>\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2022-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9327416/pdf/\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-022-00300-2\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-022-00300-2","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 8

摘要

目的:确定并比较自动化机器学习(AutoML)工具在大型、高度不平衡的医疗数据集上的性能。材料和方法:我们使用历史去识别的行政索赔生成了一个大型数据集,包括2019年之前四个不同时间窗口的人口统计信息和疾病代码标志。然后,我们在该数据集上训练了三个AutoML工具,以预测2019年的六种不同疾病结果,并根据几个指标评估模型的性能。结果:与基线随机森林模型相比,AutoML工具显示出改进,但彼此之间没有显著差异。所有模型的准确率-召回率曲线下的面积都很低,无法预测真阳性,而真阴性率却很高。模型性能与流行率没有直接关系。我们提供了一个特定的用例来说明如何选择一个阈值,使真阳性率和假阳性率之间达到最佳平衡,因为这是医疗应用中的一个重要考虑因素。讨论:医疗保健数据集对AutoML工具提出了几个挑战,包括大样本量、高度不平衡以及可用功能的限制。可扩展性的改进、不平衡学习重采样和集成方法的组合以及有组织的特征选择可能是实现更好性能的下一步。结论:在研究的三个工具中,没有一个AutoML工具在预测性能方面始终优于其他工具。本研究模型的表现表明,在处理医疗理赔数据方面可能存在改进的空间。最后,最优预测阈值的选择应以具体的实际应用为指导。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

摘要图片

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Benchmarking AutoML frameworks for disease prediction using medical claims.

Objectives: Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.

Materials and methods: We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics.

Results: The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications.

Discussion: Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance.

Conclusion: Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Biodata Mining
Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
7.90
自引率
0.00%
发文量
28
审稿时长
23 weeks
期刊介绍: BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.
期刊最新文献
Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion. Modeling heterogeneity of Sudanese hospital stay in neonatal and maternal unit: non-parametric random effect models with Gamma distribution. Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability. Priority-Elastic net for binary disease outcome prediction based on multi-omics data. A regularized Cox hierarchical model for incorporating annotation information in predictive omic studies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1