Assessment of large language models in medical quizzes for clinical chemistry and laboratory management: implications and applications for healthcare artificial intelligence.
{"title":"Assessment of large language models in medical quizzes for clinical chemistry and laboratory management: implications and applications for healthcare artificial intelligence.","authors":"Won Young Heo, Hyung-Doo Park","doi":"10.1080/00365513.2025.2466054","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) have demonstrated high performance across various fields due to their ability to understand, generate, and manipulate human language. However, their potential in specialized medical domains, such as clinical chemistry and laboratory management, remains underexplored. This study evaluated the performance of nine LLMs using zero-shot prompting on 109 clinical problem-based quizzes from peer-reviewed journal articles in the Laboratory Medicine Online (LMO) database. These quizzes covered topics in clinical chemistry, toxicology, and laboratory management. The models, including GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, along with their earlier or smaller versions, were assigned roles as clinical chemists or laboratory managers to simulate real-world decision-making scenarios. Among the evaluated models, GPT-4o achieved the highest overall accuracy, correctly answering 81.7% of the quizzes, followed by GPT-4 Turbo (76.1%), Claude 3 Opus (74.3%), and Gemini 1.5 Pro (69.7%), while the lowest performance was observed with Gemini 1.0 Pro (51.4%). GPT-4o performed exceptionally well across all quiz types, including single-select, open-ended, and multiple-select questions, and demonstrated particular strength in quizzes involving figures, tables, or calculations. These findings highlight the ability of LLMs to effectively apply their pre-existing knowledge base to specialized clinical chemistry inquiries without additional fine-tuning. Among the evaluated models, GPT-4o exhibited superior performance across different quiz types, underscoring its potential utility in assisting healthcare professionals in clinical decision-making.</p>","PeriodicalId":21474,"journal":{"name":"Scandinavian Journal of Clinical & Laboratory Investigation","volume":" ","pages":"1-8"},"PeriodicalIF":1.3000,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scandinavian Journal of Clinical & Laboratory Investigation","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/00365513.2025.2466054","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) have demonstrated high performance across various fields due to their ability to understand, generate, and manipulate human language. However, their potential in specialized medical domains, such as clinical chemistry and laboratory management, remains underexplored. This study evaluated the performance of nine LLMs using zero-shot prompting on 109 clinical problem-based quizzes from peer-reviewed journal articles in the Laboratory Medicine Online (LMO) database. These quizzes covered topics in clinical chemistry, toxicology, and laboratory management. The models, including GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, along with their earlier or smaller versions, were assigned roles as clinical chemists or laboratory managers to simulate real-world decision-making scenarios. Among the evaluated models, GPT-4o achieved the highest overall accuracy, correctly answering 81.7% of the quizzes, followed by GPT-4 Turbo (76.1%), Claude 3 Opus (74.3%), and Gemini 1.5 Pro (69.7%), while the lowest performance was observed with Gemini 1.0 Pro (51.4%). GPT-4o performed exceptionally well across all quiz types, including single-select, open-ended, and multiple-select questions, and demonstrated particular strength in quizzes involving figures, tables, or calculations. These findings highlight the ability of LLMs to effectively apply their pre-existing knowledge base to specialized clinical chemistry inquiries without additional fine-tuning. Among the evaluated models, GPT-4o exhibited superior performance across different quiz types, underscoring its potential utility in assisting healthcare professionals in clinical decision-making.
期刊介绍:
The Scandinavian Journal of Clinical and Laboratory Investigation is an international scientific journal covering clinically oriented biochemical and physiological research. Since the launch of the journal in 1949, it has been a forum for international laboratory medicine, closely related to, and edited by, The Scandinavian Society for Clinical Chemistry.
The journal contains peer-reviewed articles, editorials, invited reviews, and short technical notes, as well as several supplements each year. Supplements consist of monographs, and symposium and congress reports covering subjects within clinical chemistry and clinical physiology.