Veridical Data Science for Medical Foundation Models

arXiv - STAT - Machine Learning Pub Date : 2024-09-15 DOI:arxiv-2409.10580

Ahmed Alaa, Bin Yu

{"title":"Veridical Data Science for Medical Foundation Models","authors":"Ahmed Alaa, Bin Yu","doi":"arxiv-2409.10580","DOIUrl":null,"url":null,"abstract":"The advent of foundation models (FMs) such as large language models (LLMs)\nhas led to a cultural shift in data science, both in medicine and beyond. This\nshift involves moving away from specialized predictive models trained for\nspecific, well-defined domain questions to generalist FMs pre-trained on vast\namounts of unstructured data, which can then be adapted to various clinical\ntasks and questions. As a result, the standard data science workflow in\nmedicine has been fundamentally altered; the foundation model lifecycle (FMLC)\nnow includes distinct upstream and downstream processes, in which computational\nresources, model and data access, and decision-making power are distributed\namong multiple stakeholders. At their core, FMs are fundamentally statistical\nmodels, and this new workflow challenges the principles of Veridical Data\nScience (VDS), hindering the rigorous statistical analysis expected in\ntransparent and scientifically reproducible data science practices. We\ncritically examine the medical FMLC in light of the core principles of VDS:\npredictability, computability, and stability (PCS), and explain how it deviates\nfrom the standard data science workflow. Finally, we propose recommendations\nfor a reimagined medical FMLC that expands and refines the PCS principles for\nVDS including considering the computational and accessibility constraints\ninherent to FMs.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10580","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and questions. As a result, the standard data science workflow in medicine has been fundamentally altered; the foundation model lifecycle (FMLC) now includes distinct upstream and downstream processes, in which computational resources, model and data access, and decision-making power are distributed among multiple stakeholders. At their core, FMs are fundamentally statistical models, and this new workflow challenges the principles of Veridical Data Science (VDS), hindering the rigorous statistical analysis expected in transparent and scientifically reproducible data science practices. We critically examine the medical FMLC in light of the core principles of VDS: predictability, computability, and stability (PCS), and explain how it deviates from the standard data science workflow. Finally, we propose recommendations for a reimagined medical FMLC that expands and refines the PCS principles for VDS including considering the computational and accessibility constraints inherent to FMs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

医学基础模型的验证数据科学

大型语言模型（LLM）等基础模型（FM）的出现导致了数据科学在医学及其他领域的文化转变。这种转变包括从针对特定、明确领域问题训练的专业预测模型转向在大量非结构化数据上预先训练的通用 FM，然后再将其调整到各种临床任务和问题上。因此，医疗领域的标准数据科学工作流程发生了根本性变化；基础模型生命周期（FMLC）现在包括不同的上游和下游流程，在这些流程中，计算资源、模型和数据访问以及决策权分布在多个利益相关者之间。从根本上说，FM 是一种统计模型，而这种新的工作流程挑战了数据科学的真实性原则（VDS），阻碍了不透明、科学上可重复的数据科学实践所期望的严格统计分析。我们根据 VDS 的核心原则：可预测性、可计算性和稳定性（PCS）对医学 FMLC 进行了严格审查，并解释了它是如何偏离标准数据科学工作流程的。最后，我们提出了重新构想医学 FMLC 的建议，以扩展和完善 VDS 的 PCS 原则，包括考虑 FM 固有的计算和访问限制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - STAT - Machine Learning

自引率

0.00%

发文量

期刊最新文献

Fitting Multilevel Factor Models Cartan moving frames and the data manifolds Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks Recurrent Interpolants for Probabilistic Time Series Prediction PieClam: A Universal Graph Autoencoder Based on Overlapping Inclusive and Exclusive Communities