评估基于超声图像的甲状腺结节分类中 ChatGPT-4o 和 Claude 3-Opus 的可行性。

IF 3 3区 医学 Q2 ENDOCRINOLOGY & METABOLISM Endocrine Pub Date : 2024-10-11 DOI:10.1007/s12020-024-04066-x
Ziman Chen, Nonhlanhla Chambara, Chaoqun Wu, Xina Lo, Shirley Yuk Wah Liu, Simon Takadiyi Gunda, Xinyang Han, Jingguo Qu, Fei Chen, Michael Tin Cheung Ying
{"title":"评估基于超声图像的甲状腺结节分类中 ChatGPT-4o 和 Claude 3-Opus 的可行性。","authors":"Ziman Chen, Nonhlanhla Chambara, Chaoqun Wu, Xina Lo, Shirley Yuk Wah Liu, Simon Takadiyi Gunda, Xinyang Han, Jingguo Qu, Fei Chen, Michael Tin Cheung Ying","doi":"10.1007/s12020-024-04066-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.</p><p><strong>Methods: </strong>This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.</p><p><strong>Results: </strong>ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.</p><p><strong>Conclusion: </strong>While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.</p>","PeriodicalId":49211,"journal":{"name":"Endocrine","volume":" ","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.\",\"authors\":\"Ziman Chen, Nonhlanhla Chambara, Chaoqun Wu, Xina Lo, Shirley Yuk Wah Liu, Simon Takadiyi Gunda, Xinyang Han, Jingguo Qu, Fei Chen, Michael Tin Cheung Ying\",\"doi\":\"10.1007/s12020-024-04066-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.</p><p><strong>Methods: </strong>This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.</p><p><strong>Results: </strong>ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.</p><p><strong>Conclusion: </strong>While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.</p>\",\"PeriodicalId\":49211,\"journal\":{\"name\":\"Endocrine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Endocrine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s12020-024-04066-x\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENDOCRINOLOGY & METABOLISM\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12020-024-04066-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
引用次数: 0

摘要

目的:大型语言模型(LLMs)在人工智能领域举足轻重,在自然语言理解和多模态交互方面表现出先进的能力,在医疗应用方面具有巨大潜力。本研究探讨了大型语言模型(特别是 ChatGPT-4o 和 Claude 3-Opus)使用超声图像对甲状腺结节进行分类的可行性和有效性:这项研究包括 112 名甲状腺结节患者,共 116 个甲状腺结节,其中包括 75 个良性病例和 41 个恶性病例。使用 ChatGPT-4o 和 Claude 3-Opus 对这些结节的超声图像进行分析,以诊断结节的良性或恶性。一名初级放射科医生也进行了独立评估。参考病理诊断结果,使用科恩卡帕和接收器操作特征曲线(ROC)分析评估诊断性能:ChatGPT-4o 与病理结果的一致性较差(Kappa = 0.116),而克劳德 3-Opus 的一致性更低(Kappa = 0.034)。而初级放射科医生的吻合度为中等(Kappa = 0.450)。ChatGPT-4o 的 ROC 曲线下面积(AUC)为 57.0%(95% CI:48.6-65.5%),略高于 Claude 3-Opus(AUC 为 52.0%,95% CI:43.2-60.9%)。相比之下,初级放射医师的 AUC 明显更高,达到 72.4%(95% CI:63.7-81.1%)。ChatGPT-4o 的不必要活检率为 41.4%,Claude 3-Opus 为 43.1%,而初级放射医师为 12.1%:结论:虽然 ChatGPT-4o 和 Claude 3-Opus 等 LLM 在医学影像领域的应用前景广阔,但由于其准确性有限,目前在临床诊断中的使用仍需谨慎。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.

Purpose: Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.

Methods: This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.

Results: ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.

Conclusion: While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Endocrine
Endocrine ENDOCRINOLOGY & METABOLISM-
CiteScore
6.50
自引率
5.40%
发文量
295
审稿时长
1.5 months
期刊介绍: Well-established as a major journal in today’s rapidly advancing experimental and clinical research areas, Endocrine publishes original articles devoted to basic (including molecular, cellular and physiological studies), translational and clinical research in all the different fields of endocrinology and metabolism. Articles will be accepted based on peer-reviews, priority, and editorial decision. Invited reviews, mini-reviews and viewpoints on relevant pathophysiological and clinical topics, as well as Editorials on articles appearing in the Journal, are published. Unsolicited Editorials will be evaluated by the editorial team. Outcomes of scientific meetings, as well as guidelines and position statements, may be submitted. The Journal also considers special feature articles in the field of endocrine genetics and epigenetics, as well as articles devoted to novel methods and techniques in endocrinology. Endocrine covers controversial, clinical endocrine issues. Meta-analyses on endocrine and metabolic topics are also accepted. Descriptions of single clinical cases and/or small patients studies are not published unless of exceptional interest. However, reports of novel imaging studies and endocrine side effects in single patients may be considered. Research letters and letters to the editor related or unrelated to recently published articles can be submitted. Endocrine covers leading topics in endocrinology such as neuroendocrinology, pituitary and hypothalamic peptides, thyroid physiological and clinical aspects, bone and mineral metabolism and osteoporosis, obesity, lipid and energy metabolism and food intake control, insulin, Type 1 and Type 2 diabetes, hormones of male and female reproduction, adrenal diseases pediatric and geriatric endocrinology, endocrine hypertension and endocrine oncology.
期刊最新文献
Correction to: Therapeutic patient education and treatment intensification of diabetes and hypertension in subjects with newly diagnosed type 2 diabetes mellitus: a longitudinal study. Correction: Timing of the repeat thyroid fine-needle aspiration biopsy: does early repeat biopsy change the rate of nondiagnostic or atypia of undetermined significance cytology result? Hematological toxicities with Lutathera® for neuroendocrine neoplasms: post-marketing surveillance data from the US-FDA. SGLT2 inhibitors may reduce non-small cell lung cancer and not increase various neoplasms including several skin cancers. Clarification on the role of thyroid scintigraphy in the era of TIRADS: a response to Trimboli et al. (2024).
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1