{"title":"MiniMedGPT:用于医学视觉问答的高效大视觉语言模型","authors":"Abdel Rahman Alsabbagh , Tariq Mansour , Mohammad Al-Kharabsheh , Abdel Salam Ebdah , Roa’a Al-Emaryeen , Sara Al-Nahhas , Waleed Mahafza , Omar Al-Kadi","doi":"10.1016/j.patrec.2025.01.001","DOIUrl":null,"url":null,"abstract":"<div><div>While Large Vision–Language Models (LVLMs) like GPT-4 and Gemini demonstrate significant potential, their utilization in the medical domain remains largely unexplored. This is due to challenges attributed to prolonged training and language generation issues. Imbalances within medical Visual Question Answering (VQA) datasets further complicate the integration of LVLMs. In this paper, we present a novel approach named <strong>MiniMedGPT</strong> (<strong>Mini Med</strong>ical <strong>G</strong>enerative <strong>P</strong>retrained <strong>T</strong>ransformer). Inspired by MiniGPT4-v2, MiniMedGPT is specifically designed for efficient medical VQA. The framework of MiniMedGPT is built upon both medical and generic pretrained Large Language Models and features an end-to-end versatile fine-tuning pipeline that enables the alignment of medical VQA data in just 30 min within a single-stage framework. To address language generation shortcomings and dataset imbalances, we employ Gemini Vision Pro and MediCap using them as an auxiliary component. Through comprehensive benchmarking and evaluations against 6 prominent medical VQA models across 2 well-known datasets, our approach brings an improved performance with the least number of trainable parameters against competitors across various performance metrics. This work can help train junior clinicians and has the potential to serve as a decision support tool for experienced radiologists.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"189 ","pages":"Pages 8-16"},"PeriodicalIF":3.3000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MiniMedGPT: Efficient Large Vision–Language Model for medical Visual Question Answering\",\"authors\":\"Abdel Rahman Alsabbagh , Tariq Mansour , Mohammad Al-Kharabsheh , Abdel Salam Ebdah , Roa’a Al-Emaryeen , Sara Al-Nahhas , Waleed Mahafza , Omar Al-Kadi\",\"doi\":\"10.1016/j.patrec.2025.01.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>While Large Vision–Language Models (LVLMs) like GPT-4 and Gemini demonstrate significant potential, their utilization in the medical domain remains largely unexplored. This is due to challenges attributed to prolonged training and language generation issues. Imbalances within medical Visual Question Answering (VQA) datasets further complicate the integration of LVLMs. In this paper, we present a novel approach named <strong>MiniMedGPT</strong> (<strong>Mini Med</strong>ical <strong>G</strong>enerative <strong>P</strong>retrained <strong>T</strong>ransformer). Inspired by MiniGPT4-v2, MiniMedGPT is specifically designed for efficient medical VQA. The framework of MiniMedGPT is built upon both medical and generic pretrained Large Language Models and features an end-to-end versatile fine-tuning pipeline that enables the alignment of medical VQA data in just 30 min within a single-stage framework. To address language generation shortcomings and dataset imbalances, we employ Gemini Vision Pro and MediCap using them as an auxiliary component. Through comprehensive benchmarking and evaluations against 6 prominent medical VQA models across 2 well-known datasets, our approach brings an improved performance with the least number of trainable parameters against competitors across various performance metrics. This work can help train junior clinicians and has the potential to serve as a decision support tool for experienced radiologists.<span><span><sup>1</sup></span></span></div></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"189 \",\"pages\":\"Pages 8-16\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167865525000017\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/8 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525000017","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/8 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
MiniMedGPT: Efficient Large Vision–Language Model for medical Visual Question Answering
While Large Vision–Language Models (LVLMs) like GPT-4 and Gemini demonstrate significant potential, their utilization in the medical domain remains largely unexplored. This is due to challenges attributed to prolonged training and language generation issues. Imbalances within medical Visual Question Answering (VQA) datasets further complicate the integration of LVLMs. In this paper, we present a novel approach named MiniMedGPT (Mini Medical Generative Pretrained Transformer). Inspired by MiniGPT4-v2, MiniMedGPT is specifically designed for efficient medical VQA. The framework of MiniMedGPT is built upon both medical and generic pretrained Large Language Models and features an end-to-end versatile fine-tuning pipeline that enables the alignment of medical VQA data in just 30 min within a single-stage framework. To address language generation shortcomings and dataset imbalances, we employ Gemini Vision Pro and MediCap using them as an auxiliary component. Through comprehensive benchmarking and evaluations against 6 prominent medical VQA models across 2 well-known datasets, our approach brings an improved performance with the least number of trainable parameters against competitors across various performance metrics. This work can help train junior clinicians and has the potential to serve as a decision support tool for experienced radiologists.1
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.