The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study.

IF 2.5 3区医学 Q2 GASTROENTEROLOGY & HEPATOLOGY Expert Review of Gastroenterology & Hepatology Pub Date : 2025-04-01 Epub Date: 2025-02-27 DOI:10.1080/17474124.2025.2471874

Madunil A Niriella, Pathum Premaratna, Mananjala Senanayake, Senerath Kodisinghe, Uditha Dassanayake, Anuradha Dassanayake, Dileepa S Ediriweera, H Janaka de Silva

{"title":"The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study.","authors":"Madunil A Niriella, Pathum Premaratna, Mananjala Senanayake, Senerath Kodisinghe, Uditha Dassanayake, Anuradha Dassanayake, Dileepa S Ediriweera, H Janaka de Silva","doi":"10.1080/17474124.2025.2471874","DOIUrl":null,"url":null,"abstract":"Background: We assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information.Research design and methods: We compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response.Results: The expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, p = 0.811], completeness [H(2) = 3.146, p = 0.207], or quality [H(2) = 3.350, p = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, p = 0.062], completeness [H(2) = 0.104, p = 0.949], or quality [H(2) = 0.420, p = 0.810] between the three raters (R1, R2, R3).Conclusion: Our findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.","PeriodicalId":12257,"journal":{"name":"Expert Review of Gastroenterology & Hepatology","volume":" ","pages":"437-442"},"PeriodicalIF":2.5000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Review of Gastroenterology & Hepatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/17474124.2025.2471874","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/27 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: We assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information.

Research design and methods: We compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response.

Results: The expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, p = 0.811], completeness [H(2) = 3.146, p = 0.207], or quality [H(2) = 3.350, p = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, p = 0.062], completeness [H(2) = 0.104, p = 0.949], or quality [H(2) = 0.420, p = 0.810] between the three raters (R1, R2, R3).

Conclusion: Our findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

自由获取的、基线的、通用的大型语言模型生成肝病常见问题患者信息的可靠性：一项初步横断面研究。

背景：我们评估了ChatGPT-3.5和Gemini等大型语言模型（llm）与人类专家作为患者信息来源的使用情况。研究设计和方法：使用Kruskal-Wallis测试，我们比较了免费获取的、基线的、通用的法学硕士生成的关于肝病的20个常见问题（FAQs）的回答与两位胃肠病学家的回答的准确性、完整性和质量。三位独立的胃肠病学家盲目地给每个回答打分。结果：专家和人工智能生成的回答在所有领域都显示出很高的平均得分，两组之间在准确性[H(2) = 0.421, p = 0.811]、完整性[H(2) = 3.146, p = 0.207]或质量[H(2) = 3.350, p = 0.187]方面没有统计学差异。我们发现三个评分者（R1, R2, R3）在排序总数的准确性[H(2) = 5.559, p = 0.062]、完整性[H(2) = 0.104, p = 0.949]和质量[H(2) = 0.420, p = 0.810]方面没有统计学差异。结论：我们的研究结果概述了免费获取的、基线的、通用的法学硕士在为肝病常见问题提供可靠答案方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Expert Review of Gastroenterology & Hepatology GASTROENTEROLOGY & HEPATOLOGY-

CiteScore

6.80

自引率

2.60%

发文量

审稿时长

6-12 weeks

期刊介绍： The enormous health and economic burden of gastrointestinal disease worldwide warrants a sharp focus on the etiology, epidemiology, prevention, diagnosis, treatment and development of new therapies. By the end of the last century we had seen enormous advances, both in technologies to visualize disease and in curative therapies in areas such as gastric ulcer, with the advent first of the H2-antagonists and then the proton pump inhibitors - clear examples of how advances in medicine can massively benefit the patient. Nevertheless, specialists face ongoing challenges from a wide array of diseases of diverse etiology.