{"title":"Veridical Data Science for Medical Foundation Models","authors":"Ahmed Alaa, Bin Yu","doi":"arxiv-2409.10580","DOIUrl":null,"url":null,"abstract":"The advent of foundation models (FMs) such as large language models (LLMs)\nhas led to a cultural shift in data science, both in medicine and beyond. This\nshift involves moving away from specialized predictive models trained for\nspecific, well-defined domain questions to generalist FMs pre-trained on vast\namounts of unstructured data, which can then be adapted to various clinical\ntasks and questions. As a result, the standard data science workflow in\nmedicine has been fundamentally altered; the foundation model lifecycle (FMLC)\nnow includes distinct upstream and downstream processes, in which computational\nresources, model and data access, and decision-making power are distributed\namong multiple stakeholders. At their core, FMs are fundamentally statistical\nmodels, and this new workflow challenges the principles of Veridical Data\nScience (VDS), hindering the rigorous statistical analysis expected in\ntransparent and scientifically reproducible data science practices. We\ncritically examine the medical FMLC in light of the core principles of VDS:\npredictability, computability, and stability (PCS), and explain how it deviates\nfrom the standard data science workflow. Finally, we propose recommendations\nfor a reimagined medical FMLC that expands and refines the PCS principles for\nVDS including considering the computational and accessibility constraints\ninherent to FMs.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10580","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The advent of foundation models (FMs) such as large language models (LLMs)
has led to a cultural shift in data science, both in medicine and beyond. This
shift involves moving away from specialized predictive models trained for
specific, well-defined domain questions to generalist FMs pre-trained on vast
amounts of unstructured data, which can then be adapted to various clinical
tasks and questions. As a result, the standard data science workflow in
medicine has been fundamentally altered; the foundation model lifecycle (FMLC)
now includes distinct upstream and downstream processes, in which computational
resources, model and data access, and decision-making power are distributed
among multiple stakeholders. At their core, FMs are fundamentally statistical
models, and this new workflow challenges the principles of Veridical Data
Science (VDS), hindering the rigorous statistical analysis expected in
transparent and scientifically reproducible data science practices. We
critically examine the medical FMLC in light of the core principles of VDS:
predictability, computability, and stability (PCS), and explain how it deviates
from the standard data science workflow. Finally, we propose recommendations
for a reimagined medical FMLC that expands and refines the PCS principles for
VDS including considering the computational and accessibility constraints
inherent to FMs.