Veridical Data Science for Medical Foundation Models

Ahmed Alaa, Bin Yu
{"title":"Veridical Data Science for Medical Foundation Models","authors":"Ahmed Alaa, Bin Yu","doi":"arxiv-2409.10580","DOIUrl":null,"url":null,"abstract":"The advent of foundation models (FMs) such as large language models (LLMs)\nhas led to a cultural shift in data science, both in medicine and beyond. This\nshift involves moving away from specialized predictive models trained for\nspecific, well-defined domain questions to generalist FMs pre-trained on vast\namounts of unstructured data, which can then be adapted to various clinical\ntasks and questions. As a result, the standard data science workflow in\nmedicine has been fundamentally altered; the foundation model lifecycle (FMLC)\nnow includes distinct upstream and downstream processes, in which computational\nresources, model and data access, and decision-making power are distributed\namong multiple stakeholders. At their core, FMs are fundamentally statistical\nmodels, and this new workflow challenges the principles of Veridical Data\nScience (VDS), hindering the rigorous statistical analysis expected in\ntransparent and scientifically reproducible data science practices. We\ncritically examine the medical FMLC in light of the core principles of VDS:\npredictability, computability, and stability (PCS), and explain how it deviates\nfrom the standard data science workflow. Finally, we propose recommendations\nfor a reimagined medical FMLC that expands and refines the PCS principles for\nVDS including considering the computational and accessibility constraints\ninherent to FMs.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10580","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and questions. As a result, the standard data science workflow in medicine has been fundamentally altered; the foundation model lifecycle (FMLC) now includes distinct upstream and downstream processes, in which computational resources, model and data access, and decision-making power are distributed among multiple stakeholders. At their core, FMs are fundamentally statistical models, and this new workflow challenges the principles of Veridical Data Science (VDS), hindering the rigorous statistical analysis expected in transparent and scientifically reproducible data science practices. We critically examine the medical FMLC in light of the core principles of VDS: predictability, computability, and stability (PCS), and explain how it deviates from the standard data science workflow. Finally, we propose recommendations for a reimagined medical FMLC that expands and refines the PCS principles for VDS including considering the computational and accessibility constraints inherent to FMs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
医学基础模型的验证数据科学
大型语言模型(LLM)等基础模型(FM)的出现导致了数据科学在医学及其他领域的文化转变。这种转变包括从针对特定、明确领域问题训练的专业预测模型转向在大量非结构化数据上预先训练的通用 FM,然后再将其调整到各种临床任务和问题上。因此,医疗领域的标准数据科学工作流程发生了根本性变化;基础模型生命周期(FMLC)现在包括不同的上游和下游流程,在这些流程中,计算资源、模型和数据访问以及决策权分布在多个利益相关者之间。从根本上说,FM 是一种统计模型,而这种新的工作流程挑战了数据科学的真实性原则(VDS),阻碍了不透明、科学上可重复的数据科学实践所期望的严格统计分析。我们根据 VDS 的核心原则:可预测性、可计算性和稳定性(PCS)对医学 FMLC 进行了严格审查,并解释了它是如何偏离标准数据科学工作流程的。最后,我们提出了重新构想医学 FMLC 的建议,以扩展和完善 VDS 的 PCS 原则,包括考虑 FM 固有的计算和访问限制。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fitting Multilevel Factor Models Cartan moving frames and the data manifolds Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks Recurrent Interpolants for Probabilistic Time Series Prediction PieClam: A Universal Graph Autoencoder Based on Overlapping Inclusive and Exclusive Communities
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1