Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis

IF 4 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Journal of Biomedical Informatics Pub Date : 2024-03-01 DOI:10.1016/j.jbi.2024.104620
Qiuhong Wei , Zhengxiong Yao , Ying Cui , Bo Wei , Zhezhen Jin , Ximing Xu
{"title":"Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis","authors":"Qiuhong Wei ,&nbsp;Zhengxiong Yao ,&nbsp;Ying Cui ,&nbsp;Bo Wei ,&nbsp;Zhezhen Jin ,&nbsp;Ximing Xu","doi":"10.1016/j.jbi.2024.104620","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT’s performance in answering medical questions and provide direction for future research.</p></div><div><h3>Methods</h3><p>An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was “ChatGPT,” without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327.</p></div><div><h3>Results</h3><p>A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the <em>meta</em>-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %–60 %, I<sup>2</sup> = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency.</p></div><div><h3>Conclusion</h3><p>This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results’ reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046424000388","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Objective

Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT’s performance in answering medical questions and provide direction for future research.

Methods

An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was “ChatGPT,” without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327.

Results

A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %–60 %, I2 = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency.

Conclusion

This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results’ reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
对 ChatGPT 生成的医疗回复进行评估:系统回顾与荟萃分析。
目的:大型语言模型(LLM)(如 ChatGPT)在医疗领域的应用日益广泛。然而,由于缺乏性能评估的标准指南,导致了方法上的不一致。本研究旨在总结现有的 ChatGPT 在回答医学问题时的性能评估证据,并为未来的研究提供方向:2023 年 6 月 15 日,我们在十个医学数据库中进行了广泛的文献检索。使用的关键词是 "ChatGPT",对出版物类型、语言或日期没有限制。纳入了对 ChatGPT 回答医疗问题的性能进行评估的研究。不包括综述文章、评论、专利、对 ChatGPT 的非医学评估以及预印本研究。提取的数据包括一般研究特征、问题来源、对话过程、评估指标和 ChatGPT 的性能。通过整合所选文献中的见解,提出了医学查询中的 LLM 评估框架。本研究已在 PROSPERO 注册,编号为 CRD42023456327:结果:共发现了 3520 篇文章,本文对其中的 60 篇进行了综述和总结,并将 17 篇纳入了荟萃分析。ChatGPT 在解决医疗疑问方面的总体综合准确率为 56 %(95 % CI:51 %-60 %,I2 = 87 %)。但是,这些研究在问题资源、提问过程和评估指标方面各不相同。根据我们提出的评估框架,许多研究未能报告方法学细节,如查询日期、ChatGPT 版本和评分者之间的一致性:本综述揭示了 ChatGPT 在解决医疗咨询方面的潜力,但研究设计的异质性和报告的不足可能会影响结果的可靠性。我们提出的评估框架为今后研究设计和透明报告 LLM 回答医学问题提供了启示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Biomedical Informatics
Journal of Biomedical Informatics 医学-计算机:跨学科应用
CiteScore
8.90
自引率
6.70%
发文量
243
审稿时长
32 days
期刊介绍: The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.
期刊最新文献
FuseLinker: Leveraging LLM’s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs Clinical outcome-guided deep temporal clustering for disease progression subtyping. Cross-domain visual prompting with spatial proximity knowledge distillation for histological image classification Advancing cancer driver gene identification through an integrative network and pathway approach Proxy endpoints — bridging clinical trials and real world data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1