Kun-Peng Li, Li Wang, Shun Wan, Chen-Yang Wang, Si-Yu Chen, Shan-Hui Liu, Li Yang
{"title":"Enhanced Artificial Intelligence in Bladder Cancer Management: A Comparative Analysis and Optimization Study of Multiple Large Language Models.","authors":"Kun-Peng Li, Li Wang, Shun Wan, Chen-Yang Wang, Si-Yu Chen, Shan-Hui Liu, Li Yang","doi":"10.1089/end.2024.0860","DOIUrl":null,"url":null,"abstract":"<p><p><b><i>Background:</i></b> With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations. <b><i>Methods:</i></b> We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance. <b><i>Results:</i></b> In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy. <b><i>Conclusion:</i></b> This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo's performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.</p>","PeriodicalId":15723,"journal":{"name":"Journal of endourology","volume":" ","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of endourology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1089/end.2024.0860","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations. Methods: We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance. Results: In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo's accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy. Conclusion: This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo's performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.
期刊介绍:
Journal of Endourology, JE Case Reports, and Videourology are the leading peer-reviewed journal, case reports publication, and innovative videojournal companion covering all aspects of minimally invasive urology research, applications, and clinical outcomes.
The leading journal of minimally invasive urology for over 30 years, Journal of Endourology is the essential publication for practicing surgeons who want to keep up with the latest surgical technologies in endoscopic, laparoscopic, robotic, and image-guided procedures as they apply to benign and malignant diseases of the genitourinary tract. This flagship journal includes the companion videojournal Videourology™ with every subscription. While Journal of Endourology remains focused on publishing rigorously peer reviewed articles, Videourology accepts original videos containing material that has not been reported elsewhere, except in the form of an abstract or a conference presentation.
Journal of Endourology coverage includes:
The latest laparoscopic, robotic, endoscopic, and image-guided techniques for treating both benign and malignant conditions
Pioneering research articles
Controversial cases in endourology
Techniques in endourology with accompanying videos
Reviews and epochs in endourology
Endourology survey section of endourology relevant manuscripts published in other journals.