Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.

IF 8 1区医学 Q1 GASTROENTEROLOGY & HEPATOLOGY American Journal of Gastroenterology Pub Date : 2024-12-17 DOI:10.14309/ajg.0000000000003255

Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi

{"title":"Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.","authors":"Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi","doi":"10.14309/ajg.0000000000003255","DOIUrl":null,"url":null,"abstract":"Introduction: Recent advancements in Artificial Intelligence (AI), particularly through the deployment of Large Language Models (LLMs), have profoundly impacted healthcare. This study assesses five LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with Chat GPT4 enhanced with Retrieval Augmented Generation (RAG) technology.Methods: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the \"New Chat\" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by four independent investigators to ensure impartiality.Result: ChatGPT 4, augmented with RAG, demonstrated superior performance compared to others, consistently scoring the highest (4.70, 4.89, 4.78) across all three domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.Conclusion: The study highlights Chat GPT 4 +RAG's superior performance compared to other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.","PeriodicalId":7608,"journal":{"name":"American Journal of Gastroenterology","volume":" ","pages":""},"PeriodicalIF":8.0000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Gastroenterology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.14309/ajg.0000000000003255","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Recent advancements in Artificial Intelligence (AI), particularly through the deployment of Large Language Models (LLMs), have profoundly impacted healthcare. This study assesses five LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with Chat GPT4 enhanced with Retrieval Augmented Generation (RAG) technology.

Methods: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by four independent investigators to ensure impartiality.

Result: ChatGPT 4, augmented with RAG, demonstrated superior performance compared to others, consistently scoring the highest (4.70, 4.89, 4.78) across all three domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.

Conclusion: The study highlights Chat GPT 4 +RAG's superior performance compared to other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估人工智能对急性肝衰竭查询的响应：准确性、清晰度和相关性的比较分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

American Journal of Gastroenterology 医学-胃肠肝病学

CiteScore

11.40

自引率

5.10%

发文量

458

审稿时长

12 months

期刊介绍： Published on behalf of the American College of Gastroenterology (ACG), The American Journal of Gastroenterology (AJG) stands as the foremost clinical journal in the fields of gastroenterology and hepatology. AJG offers practical and professional support to clinicians addressing the most prevalent gastroenterological disorders in patients.