Balu Bhasuran, Qiao Jin, Yuzhang Xie, Carl Yang, Karim Hanna, Jennifer Costa, Cindy Shavor, Wenshan Han, Zhiyong Lu, Zhe He
{"title":"Preliminary analysis of the impact of lab results on large language model generated differential diagnoses","authors":"Balu Bhasuran, Qiao Jin, Yuzhang Xie, Carl Yang, Karim Hanna, Jennifer Costa, Cindy Shavor, Wenshan Han, Zhiyong Lu, Zhe He","doi":"10.1038/s41746-025-01556-8","DOIUrl":null,"url":null,"abstract":"<p>Differential diagnosis (DDx) is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study evaluates the influence of lab test results on DDx accuracy generated by large language models (LLMs). Clinical vignettes from 50 randomly selected case reports from PMC-Patients were created, incorporating demographics, symptoms, and lab data. Five LLMs—GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B—were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. Results show that incorporating lab data enhances accuracy by up to 30% across models. GPT-4 achieved the highest performance, with Top 1 accuracy of 55% (0.41–0.69) and lenient accuracy reaching 79% (0.68–0.90). Statistically significant improvements (Holm-adjusted <i>p</i> values < 0.05) were observed, with GPT-4 and Mixtral excelling. Lab tests, including liver function, metabolic/toxicology panels, and serology, were generally interpreted correctly by LLMs for DDx.</p>","PeriodicalId":19349,"journal":{"name":"NPJ Digital Medicine","volume":"55 1","pages":""},"PeriodicalIF":12.4000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NPJ Digital Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41746-025-01556-8","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Differential diagnosis (DDx) is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study evaluates the influence of lab test results on DDx accuracy generated by large language models (LLMs). Clinical vignettes from 50 randomly selected case reports from PMC-Patients were created, incorporating demographics, symptoms, and lab data. Five LLMs—GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B—were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. Results show that incorporating lab data enhances accuracy by up to 30% across models. GPT-4 achieved the highest performance, with Top 1 accuracy of 55% (0.41–0.69) and lenient accuracy reaching 79% (0.68–0.90). Statistically significant improvements (Holm-adjusted p values < 0.05) were observed, with GPT-4 and Mixtral excelling. Lab tests, including liver function, metabolic/toxicology panels, and serology, were generally interpreted correctly by LLMs for DDx.
期刊介绍:
npj Digital Medicine is an online open-access journal that focuses on publishing peer-reviewed research in the field of digital medicine. The journal covers various aspects of digital medicine, including the application and implementation of digital and mobile technologies in clinical settings, virtual healthcare, and the use of artificial intelligence and informatics.
The primary goal of the journal is to support innovation and the advancement of healthcare through the integration of new digital and mobile technologies. When determining if a manuscript is suitable for publication, the journal considers four important criteria: novelty, clinical relevance, scientific rigor, and digital innovation.