Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models.

IF 2.9 2区医学 Q1 UROLOGY & NEPHROLOGY Journal of endourology Pub Date : 2025-03-18 DOI:10.1089/end.2024.0860

Kun-Peng Li, Li Wang, Shun Wan, Chen-Yang Wang, Si-Yu Chen, Shan-Hui Liu, Li Yang

{"title":"Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models.","authors":"Kun-Peng Li, Li Wang, Shun Wan, Chen-Yang Wang, Si-Yu Chen, Shan-Hui Liu, Li Yang","doi":"10.1089/end.2024.0860","DOIUrl":null,"url":null,"abstract":"Background: With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations. Methods: We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance. Results: In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy. Conclusion: This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo's performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.","PeriodicalId":15723,"journal":{"name":"Journal of endourology","volume":" ","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of endourology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1089/end.2024.0860","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations. Methods: We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance. Results: In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy. Conclusion: This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo's performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

背景：随着人工智能在医疗保健领域的快速发展，大型语言模型（LLMs）在医疗应用中展现出越来越大的潜力。然而，它们在专业肿瘤学领域的表现仍然有限。本研究评估了多个领先的大型语言模型在解决膀胱癌（BLCA）相关临床问题时的表现，并展示了策略优化如何克服这些局限性。方法：我们根据既定指南制定了一套包含 100 个临床问题的综合问题集。这些问题涵盖了膀胱癌管理的流行病学、诊断、治疗、预后和随访等方面。六个 LLM（Claude-3.5-Sonnet、ChatGPT-4.0、Grok-beta、Gemini-1.5-Pro、Mistral-Large-2 和 GPT-3.5-Turbo）通过三个独立试验进行了测试。测试结果与现行临床指南和专家共识进行了验证。我们专门针对 GPT-3.5-Turbo 实施了两阶段训练优化流程，以提高其性能。结果：在初步评估中，Claude-3.5-Sonnet 的准确率最高（89.33% ± 1.53%），其次是 ChatGPT-4（85.67% ± 1.15%）。Grok-beta 的准确率为 84.33% ± 1.53%，而 Gemini-1.5-Pro 和 Mistral-Large-2 的表现类似（分别为 82.00% ± 1.00% 和 81.00% ± 1.00%）。GPT-3.5-Turbo 的准确率最低（74.33% ± 3.06%）。经过第一阶段的训练，GPT-3.5-Turbo 的准确率提高到了 86.67% ± 1.89%。经过第二阶段的优化，该模型的准确率达到了 100%。结论这项研究不仅确定了各种 LLM 在 BLCA 相关查询中的比较性能，还验证了通过有针对性的训练优化来显著提高性能的潜力。GPT-3.5-Turbo 性能的成功提高表明，战略性的模型改进可以克服最初的局限性，并在专业医疗应用中实现最佳准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of endourology 医学-泌尿学与肾脏学

CiteScore

5.50

自引率

14.80%

发文量

254

审稿时长

1 months

期刊介绍： Journal of Endourology, JE Case Reports, and Videourology are the leading peer-reviewed journal, case reports publication, and innovative videojournal companion covering all aspects of minimally invasive urology research, applications, and clinical outcomes. The leading journal of minimally invasive urology for over 30 years, Journal of Endourology is the essential publication for practicing surgeons who want to keep up with the latest surgical technologies in endoscopic, laparoscopic, robotic, and image-guided procedures as they apply to benign and malignant diseases of the genitourinary tract. This flagship journal includes the companion videojournal Videourology™ with every subscription. While Journal of Endourology remains focused on publishing rigorously peer reviewed articles, Videourology accepts original videos containing material that has not been reported elsewhere, except in the form of an abstract or a conference presentation. Journal of Endourology coverage includes: The latest laparoscopic, robotic, endoscopic, and image-guided techniques for treating both benign and malignant conditions Pioneering research articles Controversial cases in endourology Techniques in endourology with accompanying videos Reviews and epochs in endourology Endourology survey section of endourology relevant manuscripts published in other journals.