Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.

IF 8 1区 医学 Q1 GASTROENTEROLOGY & HEPATOLOGY American Journal of Gastroenterology Pub Date : 2024-12-17 DOI:10.14309/ajg.0000000000003255
Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi
{"title":"Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.","authors":"Sheza Malik, Lewis J Frey, Jason Gutman, Asim Mushtaq, Fatima Warraich, Kamran Qureshi","doi":"10.14309/ajg.0000000000003255","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Recent advancements in Artificial Intelligence (AI), particularly through the deployment of Large Language Models (LLMs), have profoundly impacted healthcare. This study assesses five LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with Chat GPT4 enhanced with Retrieval Augmented Generation (RAG) technology.</p><p><strong>Methods: </strong>Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the \"New Chat\" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by four independent investigators to ensure impartiality.</p><p><strong>Result: </strong>ChatGPT 4, augmented with RAG, demonstrated superior performance compared to others, consistently scoring the highest (4.70, 4.89, 4.78) across all three domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.</p><p><strong>Conclusion: </strong>The study highlights Chat GPT 4 +RAG's superior performance compared to other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.</p>","PeriodicalId":7608,"journal":{"name":"American Journal of Gastroenterology","volume":" ","pages":""},"PeriodicalIF":8.0000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Gastroenterology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.14309/ajg.0000000000003255","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Recent advancements in Artificial Intelligence (AI), particularly through the deployment of Large Language Models (LLMs), have profoundly impacted healthcare. This study assesses five LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with Chat GPT4 enhanced with Retrieval Augmented Generation (RAG) technology.

Methods: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by four independent investigators to ensure impartiality.

Result: ChatGPT 4, augmented with RAG, demonstrated superior performance compared to others, consistently scoring the highest (4.70, 4.89, 4.78) across all three domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.

Conclusion: The study highlights Chat GPT 4 +RAG's superior performance compared to other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估人工智能对急性肝衰竭查询的响应:准确性、清晰度和相关性的比较分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
American Journal of Gastroenterology
American Journal of Gastroenterology 医学-胃肠肝病学
CiteScore
11.40
自引率
5.10%
发文量
458
审稿时长
12 months
期刊介绍: Published on behalf of the American College of Gastroenterology (ACG), The American Journal of Gastroenterology (AJG) stands as the foremost clinical journal in the fields of gastroenterology and hepatology. AJG offers practical and professional support to clinicians addressing the most prevalent gastroenterological disorders in patients.
期刊最新文献
Delisting From Liver Transplant List for Improvement and Re-compensation Among Decompensated Patients at one-year. More Is Not Always Better: Challenging the Dogma of Secondary Prophylaxis for Spontaneous Bacterial Peritonitis. Efficacy and safety of transcutaneous auricular vagus nerve stimulation in patients with constipation-predominant irritable bowel syndrome: a single-center, single-blind, randomized controlled trial. Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance. Proximal polyps are associated with higher incidence of colorectal cancer: Analysis of the Minnesota Colon Cancer Control Study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1