Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment.

IF 2.9 2区 医学 Q2 UROLOGY & NEPHROLOGY World Journal of Urology Pub Date : 2025-02-11 DOI:10.1007/s00345-025-05499-3
Mehmet Fatih Şahin, Çağrı Doğan, Erdem Can Topkaç, Serkan Şeramet, Furkan Batuhan Tuncer, Cenk Murat Yazıcı
{"title":"Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment.","authors":"Mehmet Fatih Şahin, Çağrı Doğan, Erdem Can Topkaç, Serkan Şeramet, Furkan Batuhan Tuncer, Cenk Murat Yazıcı","doi":"10.1007/s00345-025-05499-3","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>The European Board of Urology (EBU) In-Service Assessment (ISA) test evaluates urologists' knowledge and interpretation. Artificial Intelligence (AI) chatbots are being used widely by physicians for theoretical information. This research compares five existing chatbots' test performances and questions' knowledge and interpretation.</p><p><strong>Materials and methods: </strong>GPT-4o, Copilot Pro, Gemini Advanced, Claude 3.5, and Sonar Huge chatbots solved 596 questions in 6 exams between 2017 and 2022. The questions were divided into two categories: questions that measure knowledge and require data interpretation. The chatbots' exam performances were compared.</p><p><strong>Results: </strong>Overall, all chatbots except Claude 3.5 passed the examinations with a percentage of 60% overall score. Copilot Pro scored best, and Claude 3.5's score difference was significant (71.6% vs. 56.2%, p = 0.001). When a total of 444 knowledge and 152 analysis questions were compared, Copilot Pro offered the greatest percentage of information, whereas Claude 3.5 provided the least (72.1% vs. 57.4%, p = 0.001). This was also true for analytical skills (70.4% vs. 52.6%, p = 0.019).</p><p><strong>Conclusions: </strong>Four out of five chatbots passed the exams, achieving scores exceeding 60%, while only one did not pass the EBU examination. Copilot Pro performed best in EBU ISA examinations, whereas Claude 3.5 performed worst. Chatbots scored worse on analysis than knowledge questions. Thus, although existing chatbots are successful in terms of theoretical knowledge, their competence in analyzing the questions is questionable.</p>","PeriodicalId":23954,"journal":{"name":"World Journal of Urology","volume":"43 1","pages":"116"},"PeriodicalIF":2.9000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11813998/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Journal of Urology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00345-025-05499-3","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: The European Board of Urology (EBU) In-Service Assessment (ISA) test evaluates urologists' knowledge and interpretation. Artificial Intelligence (AI) chatbots are being used widely by physicians for theoretical information. This research compares five existing chatbots' test performances and questions' knowledge and interpretation.

Materials and methods: GPT-4o, Copilot Pro, Gemini Advanced, Claude 3.5, and Sonar Huge chatbots solved 596 questions in 6 exams between 2017 and 2022. The questions were divided into two categories: questions that measure knowledge and require data interpretation. The chatbots' exam performances were compared.

Results: Overall, all chatbots except Claude 3.5 passed the examinations with a percentage of 60% overall score. Copilot Pro scored best, and Claude 3.5's score difference was significant (71.6% vs. 56.2%, p = 0.001). When a total of 444 knowledge and 152 analysis questions were compared, Copilot Pro offered the greatest percentage of information, whereas Claude 3.5 provided the least (72.1% vs. 57.4%, p = 0.001). This was also true for analytical skills (70.4% vs. 52.6%, p = 0.019).

Conclusions: Four out of five chatbots passed the exams, achieving scores exceeding 60%, while only one did not pass the EBU examination. Copilot Pro performed best in EBU ISA examinations, whereas Claude 3.5 performed worst. Chatbots scored worse on analysis than knowledge questions. Thus, although existing chatbots are successful in terms of theoretical knowledge, their competence in analyzing the questions is questionable.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
目前哪个聊天机器人在泌尿学理论知识方面更胜任?欧洲泌尿外科委员会在职评估的比较分析。
简介:欧洲泌尿科委员会(EBU)在职评估(ISA)测试评估泌尿科医生的知识和解释。人工智能(AI)聊天机器人被医生广泛用于理论信息。本研究比较了五个现有的聊天机器人的测试性能和问题的知识和解释。材料和方法:gpt - 40、Copilot Pro、Gemini Advanced、Claude 3.5和Sonar巨型聊天机器人在2017年至2022年的6次考试中解决了596个问题。这些问题分为两类:衡量知识的问题和需要数据解释的问题。研究人员比较了这些聊天机器人的考试成绩。结果:总体而言,除了Claude 3.5之外,所有聊天机器人都以60%的总分通过了考试。Copilot Pro评分最高,Claude 3.5评分差异显著(71.6% vs. 56.2%, p = 0.001)。在共444个知识题和152个分析题的比较中,Copilot Pro提供的信息量最大,而Claude 3.5提供的信息量最少(72.1%比57.4%,p = 0.001)。分析技能也是如此(70.4% vs. 52.6%, p = 0.019)。结论:5个聊天机器人中有4个通过了考试,得分超过60%,只有1个没有通过EBU考试。Copilot Pro在EBU ISA考试中表现最好,而Claude 3.5表现最差。聊天机器人在分析问题上的得分低于知识问题。因此,虽然现有的聊天机器人在理论知识方面是成功的,但它们分析问题的能力是值得怀疑的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
World Journal of Urology
World Journal of Urology 医学-泌尿学与肾脏学
CiteScore
6.80
自引率
8.80%
发文量
317
审稿时长
4-8 weeks
期刊介绍: The WORLD JOURNAL OF UROLOGY conveys regularly the essential results of urological research and their practical and clinical relevance to a broad audience of urologists in research and clinical practice. In order to guarantee a balanced program, articles are published to reflect the developments in all fields of urology on an internationally advanced level. Each issue treats a main topic in review articles of invited international experts. Free papers are unrelated articles to the main topic.
期刊最新文献
Metabolic abnormalities, recurrence risk, patient and stone characteristics in calcium-based pediatric stone formers: is there any association? Urine culture-guided antibiotic prophylaxis reduces febrile pyelonephritis after ureteral stent removal following radical cystectomy. Subgroups of lower urinary tract symptoms in men and their association with patient-centered outcomes: a cluster analysis of the 2023 Japan community health survey. Procedural factors outweigh anatomical morphometry in predicting postoperative pain following retrograde intrarenal surgery. The association between quality of life, intensity of counseling and health literacy amongst patients with nephrolithiasis.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1