Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions.

IF 3.4 2区 医学 Q2 ONCOLOGY BMC Cancer Pub Date : 2025-02-04 DOI:10.1186/s12885-025-13596-0
Efe Cem Erdat, Engin Eren Kavak
{"title":"Benchmarking LLM chatbots' oncological knowledge with the Turkish Society of Medical Oncology's annual board examination questions.","authors":"Efe Cem Erdat, Engin Eren Kavak","doi":"10.1186/s12885-025-13596-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have shown promise in various medical applications, including clinical decision-making and education. In oncology, the increasing complexity of patient care and the vast volume of medical literature require efficient tools to assist practitioners. However, the use of LLMs in oncology education and knowledge assessment remains underexplored. This study aims to evaluate and compare the oncological knowledge of four LLMs using standardized board examination questions.</p><p><strong>Methods: </strong>We assessed the performance of four LLMs-Claude 3.5 Sonnet (Anthropic), ChatGPT 4o (OpenAI), Llama-3 (Meta), and Gemini 1.5 (Google)-using the Turkish Society of Medical Oncology's annual board examination questions from 2016 to 2024. A total of 790 valid multiple-choice questions covering various oncology topics were included. Each model was tested on its ability to answer these questions in Turkish. Performance was analyzed based on the number of correct answers, with statistical comparisons made using chi-square tests and one-way ANOVA.</p><p><strong>Results: </strong>Claude 3.5 Sonnet outperformed the other models, passing all eight exams with an average score of 77.6%. ChatGPT 4o passed seven out of eight exams, with an average score of 67.8%. Llama-3 and Gemini 1.5 showed lower performance, passing four and three exams respectively, with average scores below 50%. Significant differences were observed among the models' performances (F = 17.39, p < 0.001). Claude 3.5 and ChatGPT 4.0 demonstrated higher accuracy across most oncology topics. A decline in performance in recent years, particularly in the 2024 exam, suggests limitations due to outdated training data.</p><p><strong>Conclusions: </strong>Significant differences in oncological knowledge were observed among the four LLMs, with Claude 3.5 Sonnet and ChatGPT 4o demonstrating superior performance. These findings suggest that advanced LLMs have the potential to serve as valuable tools in oncology education and decision support. However, regular updates and enhancements are necessary to maintain their relevance and accuracy, especially to incorporate the latest medical advancements.</p>","PeriodicalId":9131,"journal":{"name":"BMC Cancer","volume":"25 1","pages":"197"},"PeriodicalIF":3.4000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792186/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Cancer","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12885-025-13596-0","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Large language models (LLMs) have shown promise in various medical applications, including clinical decision-making and education. In oncology, the increasing complexity of patient care and the vast volume of medical literature require efficient tools to assist practitioners. However, the use of LLMs in oncology education and knowledge assessment remains underexplored. This study aims to evaluate and compare the oncological knowledge of four LLMs using standardized board examination questions.

Methods: We assessed the performance of four LLMs-Claude 3.5 Sonnet (Anthropic), ChatGPT 4o (OpenAI), Llama-3 (Meta), and Gemini 1.5 (Google)-using the Turkish Society of Medical Oncology's annual board examination questions from 2016 to 2024. A total of 790 valid multiple-choice questions covering various oncology topics were included. Each model was tested on its ability to answer these questions in Turkish. Performance was analyzed based on the number of correct answers, with statistical comparisons made using chi-square tests and one-way ANOVA.

Results: Claude 3.5 Sonnet outperformed the other models, passing all eight exams with an average score of 77.6%. ChatGPT 4o passed seven out of eight exams, with an average score of 67.8%. Llama-3 and Gemini 1.5 showed lower performance, passing four and three exams respectively, with average scores below 50%. Significant differences were observed among the models' performances (F = 17.39, p < 0.001). Claude 3.5 and ChatGPT 4.0 demonstrated higher accuracy across most oncology topics. A decline in performance in recent years, particularly in the 2024 exam, suggests limitations due to outdated training data.

Conclusions: Significant differences in oncological knowledge were observed among the four LLMs, with Claude 3.5 Sonnet and ChatGPT 4o demonstrating superior performance. These findings suggest that advanced LLMs have the potential to serve as valuable tools in oncology education and decision support. However, regular updates and enhancements are necessary to maintain their relevance and accuracy, especially to incorporate the latest medical advancements.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
将 LLM 聊天机器人的肿瘤学知识与土耳其肿瘤内科学会的年度委员会考试问题进行比对。
背景:大型语言模型(LLMs)在包括临床决策和教育在内的各种医学应用中显示出前景。在肿瘤学中,日益复杂的病人护理和大量的医学文献需要有效的工具来协助从业者。然而,法学硕士在肿瘤学教育和知识评估中的应用仍未得到充分探索。本研究旨在评估和比较四位法学硕士的肿瘤学知识,采用标准化的委员会考试问题。方法:我们使用2016年至2024年土耳其肿瘤医学学会年度委员会考试题目评估了四个LLMs-Claude 3.5 Sonnet (Anthropic), ChatGPT 40 (OpenAI), Llama-3 (Meta)和Gemini 1.5(谷歌)的表现。共包括790个有效的多项选择题,涵盖各种肿瘤学主题。每个模型都测试了用土耳其语回答这些问题的能力。根据正确答案的数量对表现进行分析,使用卡方检验和单因素方差分析进行统计比较。结果:克劳德3.5十四行诗成绩优于其他模型,8门考试均通过,平均分77.6%。ChatGPT 40通过了8门考试中的7门,平均成绩为67.8%。羊驼-3和双子座1.5表现较差,分别通过了4次和3次考试,平均分数低于50%。结论:4位LLMs在肿瘤学知识方面存在显著差异,其中Claude 3.5 Sonnet和ChatGPT 40表现较优。这些发现表明,高级法学硕士有潜力作为肿瘤学教育和决策支持的有价值的工具。但是,必须定期更新和改进,以保持其相关性和准确性,特别是要纳入最新的医学进展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Cancer
BMC Cancer 医学-肿瘤学
CiteScore
6.00
自引率
2.60%
发文量
1204
审稿时长
6.8 months
期刊介绍: BMC Cancer is an open access, peer-reviewed journal that considers articles on all aspects of cancer research, including the pathophysiology, prevention, diagnosis and treatment of cancers. The journal welcomes submissions concerning molecular and cellular biology, genetics, epidemiology, and clinical trials.
期刊最新文献
Elevated m6A RNA modifications associate with immune dysregulation and cancer in people with HIV-1. TACE combined with tislelizumab and lenvatinib in the treatment of intermediate-to-advanced hepatocellular carcinoma: a retrospective real-world study. Low handgrip strength, GLIM-defined malnutrition, or their coexistence: which has the greatest impact on health-related quality of life in patients with cancer? CALLY index provides improved prognostic stratification compared with other inflammation-based scores in glioblastoma treated with the stupp protocol. Diagnostic accuracy of an AI-based pathologic response assessment in locally advanced non-small cell lung cancer after neoadjuvant chemo-immunotherapy.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1