{"title":"Visual–language foundation models in medicine","authors":"Chunyu Liu, Yixiao Jin, Zhouyu Guan, Tingyao Li, Yiming Qin, Bo Qian, Zehua Jiang, Yilan Wu, Xiangning Wang, Ying Feng Zheng, Dian Zeng","doi":"10.1007/s00371-024-03579-w","DOIUrl":null,"url":null,"abstract":"<p>By integrating visual and linguistic understanding, visual–language foundation models (VLFMs) have the great potential to advance the interpretation of medical data, thereby enhancing diagnostic precision, treatment planning, and patient management. We reviewed the developmental strategies of VLFMs, detailing the pretraining strategies, and subsequent application across various healthcare facets. The challenges inherent to VLFMs are described, including safeguarding data privacy amidst sensitive medical data usage, ensuring algorithmic transparency, and fostering explainability for trust in clinical decision-making. We underscored the significance of VLFMs in addressing the complexity of multimodal medical data, from visual to textual, and their potential in tasks like image-based disease diagnosis, medicine report synthesis, and longitudinal patient monitoring. It also examines the progress in VLFMs like Med-Flamingo, LLaVA-Med, and their zero-shot learning capabilities, and the exploration of parameter-efficient fine-tuning methods for efficient adaptation. This review concludes by encouraging the community to pursue these emergent and promising directions to strengthen the impact of artificial intelligence and deep learning on healthcare delivery and research.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03579-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
By integrating visual and linguistic understanding, visual–language foundation models (VLFMs) have the great potential to advance the interpretation of medical data, thereby enhancing diagnostic precision, treatment planning, and patient management. We reviewed the developmental strategies of VLFMs, detailing the pretraining strategies, and subsequent application across various healthcare facets. The challenges inherent to VLFMs are described, including safeguarding data privacy amidst sensitive medical data usage, ensuring algorithmic transparency, and fostering explainability for trust in clinical decision-making. We underscored the significance of VLFMs in addressing the complexity of multimodal medical data, from visual to textual, and their potential in tasks like image-based disease diagnosis, medicine report synthesis, and longitudinal patient monitoring. It also examines the progress in VLFMs like Med-Flamingo, LLaVA-Med, and their zero-shot learning capabilities, and the exploration of parameter-efficient fine-tuning methods for efficient adaptation. This review concludes by encouraging the community to pursue these emergent and promising directions to strengthen the impact of artificial intelligence and deep learning on healthcare delivery and research.