大语言模型和普通放射科医生在胸部放射病例中的诊断表现:比较研究。

IF 2 4区 医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Journal of Thoracic Imaging Pub Date : 2024-09-13 DOI:10.1097/RTI.0000000000000805
Yasin Celal Gunes, Turay Cesur
{"title":"大语言模型和普通放射科医生在胸部放射病例中的诊断表现:比较研究。","authors":"Yasin Celal Gunes, Turay Cesur","doi":"10.1097/RTI.0000000000000805","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To investigate and compare the diagnostic performance of 10 different large language models (LLMs) and 2 board-certified general radiologists in thoracic radiology cases published by The Society of Thoracic Radiology.</p><p><strong>Materials and methods: </strong>We collected publicly available 124 \"Case of the Month\" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into LLMs for diagnosis and differential diagnosis, while radiologists independently visually provided their assessments. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or nonspecific for radiologic diagnosis. Diagnostic accuracy and differential diagnosis scores (DDxScore) were analyzed using the χ2, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.</p><p><strong>Results: </strong>Among the 124 cases, Claude 3 Opus showed the highest diagnostic accuracy (70.29%), followed by ChatGPT 4/Google Gemini 1.5 Pro (59.75%), Meta Llama 3 70b (57.3%), ChatGPT 3.5 (53.2%), outperforming radiologists (52.4% and 41.1%) and other LLMs (P<0.05). Claude 3 Opus DDxScore was significantly better than other LLMs and radiologists, except ChatGPT 3.5 (P<0.05). All LLMs and radiologists showed greater accuracy in specific cases (P<0.05), with no DDxScore difference for Perplexity and Google Bard based on specificity (P>0.05). There were no significant differences between LLMs and radiologists in the diagnostic accuracy of anatomic subgroups (P>0.05), except for Meta Llama 3 70b in the vascular cases (P=0.040).</p><p><strong>Conclusions: </strong>Claude 3 Opus outperformed other LLMs and radiologists in text-based thoracic radiology cases. LLMs hold great promise for clinical decision systems under proper medical supervision.</p>","PeriodicalId":49974,"journal":{"name":"Journal of Thoracic Imaging","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study.\",\"authors\":\"Yasin Celal Gunes, Turay Cesur\",\"doi\":\"10.1097/RTI.0000000000000805\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>To investigate and compare the diagnostic performance of 10 different large language models (LLMs) and 2 board-certified general radiologists in thoracic radiology cases published by The Society of Thoracic Radiology.</p><p><strong>Materials and methods: </strong>We collected publicly available 124 \\\"Case of the Month\\\" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into LLMs for diagnosis and differential diagnosis, while radiologists independently visually provided their assessments. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or nonspecific for radiologic diagnosis. Diagnostic accuracy and differential diagnosis scores (DDxScore) were analyzed using the χ2, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.</p><p><strong>Results: </strong>Among the 124 cases, Claude 3 Opus showed the highest diagnostic accuracy (70.29%), followed by ChatGPT 4/Google Gemini 1.5 Pro (59.75%), Meta Llama 3 70b (57.3%), ChatGPT 3.5 (53.2%), outperforming radiologists (52.4% and 41.1%) and other LLMs (P<0.05). Claude 3 Opus DDxScore was significantly better than other LLMs and radiologists, except ChatGPT 3.5 (P<0.05). All LLMs and radiologists showed greater accuracy in specific cases (P<0.05), with no DDxScore difference for Perplexity and Google Bard based on specificity (P>0.05). There were no significant differences between LLMs and radiologists in the diagnostic accuracy of anatomic subgroups (P>0.05), except for Meta Llama 3 70b in the vascular cases (P=0.040).</p><p><strong>Conclusions: </strong>Claude 3 Opus outperformed other LLMs and radiologists in text-based thoracic radiology cases. LLMs hold great promise for clinical decision systems under proper medical supervision.</p>\",\"PeriodicalId\":49974,\"journal\":{\"name\":\"Journal of Thoracic Imaging\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Thoracic Imaging\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/RTI.0000000000000805\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Thoracic Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/RTI.0000000000000805","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

摘要

目的:研究并比较 10 种不同的大型语言模型(LLM)和 2 名经认证的普通放射科医师在胸部放射学会发布的胸部放射病例中的诊断性能:我们从胸部放射学会网站上收集了 2012 年 3 月至 2023 年 12 月期间公开发表的 124 个 "本月病例"。病史和影像学检查结果被输入 LLMs 进行诊断和鉴别诊断,放射科医生则独立进行视觉评估。病例按解剖学分类(实质、气道、纵隔-胸膜-胸壁和血管),并进一步分为特异性和非特异性放射诊断。采用χ2、Kruskal-Wallis、Wilcoxon、McNemar 和 Mann-Whitney U 检验分析诊断准确性和鉴别诊断评分(DDxScore):在 124 个病例中,Claude 3 Opus 的诊断准确率最高(70.29%),其次是 ChatGPT 4/Google Gemini 1.5 Pro(59.75%)、Meta Llama 3 70b(57.3%)和 ChatGPT 3.5(53.2%),优于放射科医生(52.4% 和 41.1%)和其他 LLM(P0.05)。除了血管病例中的 Meta Llama 3 70b 外(P=0.040),其他 LLM 与放射科医生在解剖亚组的诊断准确性方面无明显差异(P>0.05):在基于文本的胸部放射学病例中,Claude 3 Opus 的表现优于其他 LLM 和放射科医生。在适当的医疗监督下,LLM 在临床决策系统中大有可为。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study.

Purpose: To investigate and compare the diagnostic performance of 10 different large language models (LLMs) and 2 board-certified general radiologists in thoracic radiology cases published by The Society of Thoracic Radiology.

Materials and methods: We collected publicly available 124 "Case of the Month" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into LLMs for diagnosis and differential diagnosis, while radiologists independently visually provided their assessments. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or nonspecific for radiologic diagnosis. Diagnostic accuracy and differential diagnosis scores (DDxScore) were analyzed using the χ2, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.

Results: Among the 124 cases, Claude 3 Opus showed the highest diagnostic accuracy (70.29%), followed by ChatGPT 4/Google Gemini 1.5 Pro (59.75%), Meta Llama 3 70b (57.3%), ChatGPT 3.5 (53.2%), outperforming radiologists (52.4% and 41.1%) and other LLMs (P<0.05). Claude 3 Opus DDxScore was significantly better than other LLMs and radiologists, except ChatGPT 3.5 (P<0.05). All LLMs and radiologists showed greater accuracy in specific cases (P<0.05), with no DDxScore difference for Perplexity and Google Bard based on specificity (P>0.05). There were no significant differences between LLMs and radiologists in the diagnostic accuracy of anatomic subgroups (P>0.05), except for Meta Llama 3 70b in the vascular cases (P=0.040).

Conclusions: Claude 3 Opus outperformed other LLMs and radiologists in text-based thoracic radiology cases. LLMs hold great promise for clinical decision systems under proper medical supervision.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Thoracic Imaging
Journal of Thoracic Imaging 医学-核医学
CiteScore
7.10
自引率
9.10%
发文量
87
审稿时长
6-12 weeks
期刊介绍: Journal of Thoracic Imaging (JTI) provides authoritative information on all aspects of the use of imaging techniques in the diagnosis of cardiac and pulmonary diseases. Original articles and analytical reviews published in this timely journal provide the very latest thinking of leading experts concerning the use of chest radiography, computed tomography, magnetic resonance imaging, positron emission tomography, ultrasound, and all other promising imaging techniques in cardiopulmonary radiology. Official Journal of the Society of Thoracic Radiology: Japanese Society of Thoracic Radiology Korean Society of Thoracic Radiology European Society of Thoracic Imaging.
期刊最新文献
Spatial Resolution Fidelity Comparison Between Energy Integrating and Deep Silicon Photon Counting CT: Implications for Pulmonary Imaging. Incidental Apical Pleuroparenchymal Scarring on Computed Tomography: Diagnostic Yield, Progression, Morphologic Features and Clinical Significance. The Relationship Between Cardiac CT-based Left Atrial Structure and Epicardial Adipose Tissue and Postablation Atrial Fibrillation Recurrence Within 2 Years. Left Atrial Strain for Prediction of Left Ventricular Reverse Remodeling After ST-segment Elevation Myocardial Infarction by Cardiac Magnetic Resonance Feature Tracking. Coronary Atherosclerosis Progression Provides Incremental Prognostic Value and Optimizes Risk Reclassification by Computed Tomography Angiography.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1