{"title":"Comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google Gemini and ChatGPT-4o.","authors":"Enes Efe Is, Ahmet Kivanc Menekseoglu","doi":"10.1007/s10067-024-07154-5","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study evaluates the performance of AI models, ChatGPT-4o and Google Gemini, in answering rheumatology board-level questions, comparing their effectiveness, reliability, and applicability in clinical practice.</p><p><strong>Method: </strong>A cross-sectional study was conducted using 420 rheumatology questions from the BoardVitals question bank, excluding 27 visual data questions. Both artificial intelligence models categorized the questions according to difficulty (easy, medium, hard) and answered them. In addition, the reliability of the answers was assessed by asking the questions a second time. The accuracy, reliability, and difficulty categorization of the AI models' response to the questions were analyzed.</p><p><strong>Results: </strong>ChatGPT-4o answered 86.9% of the questions correctly, significantly outperforming Google Gemini's 60.2% accuracy (p < 0.001). When the questions were asked a second time, the success rate was 86.7% for ChatGPT-4o and 60.5% for Google Gemini. Both models mainly categorized questions as medium difficulty. ChatGPT-4o showed higher accuracy in various rheumatology subfields, notably in Basic and Clinical Science (p = 0.028), Osteoarthritis (p = 0.023), and Rheumatoid Arthritis (p < 0.001).</p><p><strong>Conclusions: </strong>ChatGPT-4o significantly outperformed Google Gemini in rheumatology board-level questions. This demonstrates the success of ChatGPT-4o in situations requiring complex and specialized knowledge related to rheumatological diseases. The performance of both AI models decreased as the question difficulty increased. This study demonstrates the potential of AI in clinical applications and suggests that its use as a tool to assist clinicians may improve healthcare efficiency in the future. Future studies using real clinical scenarios and real board questions are recommended. Key Points •ChatGPT-4o significantly outperformed Google Gemini in answering rheumatology board-level questions, achieving 86.9% accuracy compared to Google Gemini's 60.2%. •For both AI models, the correct answer rate decreased as the question difficulty increased. •The study demonstrates the potential for AI models to be used in clinical practice as a tool to assist clinicians and improve healthcare efficiency.</p>","PeriodicalId":10482,"journal":{"name":"Clinical Rheumatology","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Rheumatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10067-024-07154-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/28 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: This study evaluates the performance of AI models, ChatGPT-4o and Google Gemini, in answering rheumatology board-level questions, comparing their effectiveness, reliability, and applicability in clinical practice.
Method: A cross-sectional study was conducted using 420 rheumatology questions from the BoardVitals question bank, excluding 27 visual data questions. Both artificial intelligence models categorized the questions according to difficulty (easy, medium, hard) and answered them. In addition, the reliability of the answers was assessed by asking the questions a second time. The accuracy, reliability, and difficulty categorization of the AI models' response to the questions were analyzed.
Results: ChatGPT-4o answered 86.9% of the questions correctly, significantly outperforming Google Gemini's 60.2% accuracy (p < 0.001). When the questions were asked a second time, the success rate was 86.7% for ChatGPT-4o and 60.5% for Google Gemini. Both models mainly categorized questions as medium difficulty. ChatGPT-4o showed higher accuracy in various rheumatology subfields, notably in Basic and Clinical Science (p = 0.028), Osteoarthritis (p = 0.023), and Rheumatoid Arthritis (p < 0.001).
Conclusions: ChatGPT-4o significantly outperformed Google Gemini in rheumatology board-level questions. This demonstrates the success of ChatGPT-4o in situations requiring complex and specialized knowledge related to rheumatological diseases. The performance of both AI models decreased as the question difficulty increased. This study demonstrates the potential of AI in clinical applications and suggests that its use as a tool to assist clinicians may improve healthcare efficiency in the future. Future studies using real clinical scenarios and real board questions are recommended. Key Points •ChatGPT-4o significantly outperformed Google Gemini in answering rheumatology board-level questions, achieving 86.9% accuracy compared to Google Gemini's 60.2%. •For both AI models, the correct answer rate decreased as the question difficulty increased. •The study demonstrates the potential for AI models to be used in clinical practice as a tool to assist clinicians and improve healthcare efficiency.
期刊介绍:
Clinical Rheumatology is an international English-language journal devoted to publishing original clinical investigation and research in the general field of rheumatology with accent on clinical aspects at postgraduate level.
The journal succeeds Acta Rheumatologica Belgica, originally founded in 1945 as the official journal of the Belgian Rheumatology Society. Clinical Rheumatology aims to cover all modern trends in clinical and experimental research as well as the management and evaluation of diagnostic and treatment procedures connected with the inflammatory, immunologic, metabolic, genetic and degenerative soft and hard connective tissue diseases.