Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Journal of Biomedical Informatics Pub Date : 2024-03-01 DOI:10.1016/j.jbi.2024.104620

Qiuhong Wei , Zhengxiong Yao , Ying Cui , Bo Wei , Zhezhen Jin , Ximing Xu

{"title":"Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis","authors":"Qiuhong Wei , Zhengxiong Yao , Ying Cui , Bo Wei , Zhezhen Jin , Ximing Xu","doi":"10.1016/j.jbi.2024.104620","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT’s performance in answering medical questions and provide direction for future research.</p></div><div><h3>Methods</h3><p>An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was “ChatGPT,” without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327.</p></div><div><h3>Results</h3><p>A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the <em>meta</em>-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %–60 %, I<sup>2</sup> = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency.</p></div><div><h3>Conclusion</h3><p>This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results’ reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"151 ","pages":"Article 104620"},"PeriodicalIF":4.5000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046424000388","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT’s performance in answering medical questions and provide direction for future research.

Methods

An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was “ChatGPT,” without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327.

Results

A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %–60 %, I² = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency.

Conclusion

This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results’ reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对 ChatGPT 生成的医疗回复进行评估：系统回顾与荟萃分析。

目的：大型语言模型（LLM）（如 ChatGPT）在医疗领域的应用日益广泛。然而，由于缺乏性能评估的标准指南，导致了方法上的不一致。本研究旨在总结现有的 ChatGPT 在回答医学问题时的性能评估证据，并为未来的研究提供方向：2023 年 6 月 15 日，我们在十个医学数据库中进行了广泛的文献检索。使用的关键词是 "ChatGPT"，对出版物类型、语言或日期没有限制。纳入了对 ChatGPT 回答医疗问题的性能进行评估的研究。不包括综述文章、评论、专利、对 ChatGPT 的非医学评估以及预印本研究。提取的数据包括一般研究特征、问题来源、对话过程、评估指标和 ChatGPT 的性能。通过整合所选文献中的见解，提出了医学查询中的 LLM 评估框架。本研究已在 PROSPERO 注册，编号为 CRD42023456327：结果：共发现了 3520 篇文章，本文对其中的 60 篇进行了综述和总结，并将 17 篇纳入了荟萃分析。ChatGPT 在解决医疗疑问方面的总体综合准确率为 56 %（95 % CI：51 %-60 %，I2 = 87 %）。但是，这些研究在问题资源、提问过程和评估指标方面各不相同。根据我们提出的评估框架，许多研究未能报告方法学细节，如查询日期、ChatGPT 版本和评分者之间的一致性：本综述揭示了 ChatGPT 在解决医疗咨询方面的潜力，但研究设计的异质性和报告的不足可能会影响结果的可靠性。我们提出的评估框架为今后研究设计和透明报告 LLM 回答医学问题提供了启示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.