专家、专家编辑的大语言模型 (LLM) 或仅 LLM 对视网膜问题回答的比较研究

IF 3.2 Q1 OPHTHALMOLOGY Ophthalmology science Pub Date : 2024-02-06 DOI:10.1016/j.xops.2024.100485

Prashant D. Tailor MD , Lauren A. Dalvin MD , John J. Chen MD, PhD , Raymond Iezzi MD , Timothy W. Olsen MD , Brittni A. Scruggs MD, PhD , Andrew J. Barkmeier MD , Sophie J. Bakri MD , Edwin H. Ryan MD , Peter H. Tang MD, PhD , D. Wilkin. Parke III MD , Peter J. Belin MD , Jayanth Sridhar MD , David Xu MD , Ajay E. Kuriyan MD , Yoshihiro Yonekawa MD , Matthew R. Starr MD

{"title":"专家、专家编辑的大语言模型 (LLM) 或仅 LLM 对视网膜问题回答的比较研究","authors":"Prashant D. Tailor MD , Lauren A. Dalvin MD , John J. Chen MD, PhD , Raymond Iezzi MD , Timothy W. Olsen MD , Brittni A. Scruggs MD, PhD , Andrew J. Barkmeier MD , Sophie J. Bakri MD , Edwin H. Ryan MD , Peter H. Tang MD, PhD , D. Wilkin. Parke III MD , Peter J. Belin MD , Jayanth Sridhar MD , David Xu MD , Ajay E. Kuriyan MD , Yoshihiro Yonekawa MD , Matthew R. Starr MD","doi":"10.1016/j.xops.2024.100485","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.</p></div><div><h3>Design</h3><p>Randomized, masked multicenter study.</p></div><div><h3>Participants</h3><p>Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.</p></div><div><h3>Methods</h3><p>Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).</p></div><div><h3>Main Outcome</h3><p>Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.</p></div><div><h3>Results</h3><p>There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (<em>P</em> < 0.001, <em>P</em> < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (<em>P</em> < 0.001) and empathy (<em>P</em> < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (<em>P</em> = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (<em>P</em> = 0.35), missing content (<em>P</em> = 0.001), extent of possible harm (<em>P</em> = 0.356), and likelihood of possible harm (<em>P</em> = 0.129).</p></div><div><h3>Conclusions</h3><p>In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.</p></div><div><h3>Financial Disclosure(s)</h3><p>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.</p></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666914524000216/pdfft?md5=5e2ad31ae6fa487a208c371f1e37de20&pid=1-s2.0-S2666914524000216-main.pdf","citationCount":"0","resultStr":"{\"title\":\"A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone\",\"authors\":\"Prashant D. Tailor MD , Lauren A. Dalvin MD , John J. Chen MD, PhD , Raymond Iezzi MD , Timothy W. Olsen MD , Brittni A. Scruggs MD, PhD , Andrew J. Barkmeier MD , Sophie J. Bakri MD , Edwin H. Ryan MD , Peter H. Tang MD, PhD , D. Wilkin. Parke III MD , Peter J. Belin MD , Jayanth Sridhar MD , David Xu MD , Ajay E. Kuriyan MD , Yoshihiro Yonekawa MD , Matthew R. Starr MD\",\"doi\":\"10.1016/j.xops.2024.100485\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><p>To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.</p></div><div><h3>Design</h3><p>Randomized, masked multicenter study.</p></div><div><h3>Participants</h3><p>Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.</p></div><div><h3>Methods</h3><p>Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).</p></div><div><h3>Main Outcome</h3><p>Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.</p></div><div><h3>Results</h3><p>There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (<em>P</em> < 0.001, <em>P</em> < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (<em>P</em> < 0.001) and empathy (<em>P</em> < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (<em>P</em> = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (<em>P</em> = 0.35), missing content (<em>P</em> = 0.001), extent of possible harm (<em>P</em> = 0.356), and likelihood of possible harm (<em>P</em> = 0.129).</p></div><div><h3>Conclusions</h3><p>In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.</p></div><div><h3>Financial Disclosure(s)</h3><p>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.</p></div>\",\"PeriodicalId\":74363,\"journal\":{\"name\":\"Ophthalmology science\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-02-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666914524000216/pdfft?md5=5e2ad31ae6fa487a208c371f1e37de20&pid=1-s2.0-S2666914524000216-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ophthalmology science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666914524000216\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666914524000216","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

方法每位专家创建一个回复（专家），然后编辑一个由 LLM（ChatGPT-4）生成的对该问题的回复（专家 + 人工智能 [AI]），并为这两项任务计时。五个 LLM（ChatGPT-3.5、ChatGPT-4、Claude 2、Bing 和 Bard）也对每个问题生成了回复。原始问题以及经过匿名和随机化处理的专家 + 人工智能、专家和 LLM 回答由其他未对问题做出专家回答的专家进行评估。评估者对质量和同理心（很差、差、可接受、好或很好）以及安全性指标（信息不正确、造成伤害的可能性、伤害程度和内容缺失）进行评判。主要结果平均质量和同理心得分，每种回复类型中信息不正确、造成伤害的可能性、伤害程度和内容缺失的回复比例。结果共收集到 4008 份评分（其中 2608 份为质量和共鸣评分；1400 份为安全指标评分），在质量和共鸣评分方面，LLM 组、专家组和专家+人工智能组之间存在显著差异（P < 0.001，P < 0.001）。在质量方面，专家+人工智能组（3.86 ± 0.85）总体表现最佳，而 GPT-3.5 组（3.75 ± 0.79）则是表现最好的 LLM。在移情方面，GPT-3.5（3.75 ± 0.69）的平均得分最高，其次是专家+人工智能（3.73 ± 0.63）。按平均得分计算，在质量和移情方面，"专家 "和 "人工智能 "的得分分别为 4 分（满分 7 分）和 6 分（满分 7 分）。在质量（P <0.001）和移情（P <0.001）方面，专家编辑的 LLM 回答都优于专家创建的回答。专家编辑的 LLM 回答比专家创建的回答节省时间（P = 0.02）。在不恰当内容（P = 0.35）、缺失内容（P = 0.001）、可能的危害程度（P = 0.356）和可能的危害可能性（P = 0.129）方面，ChatGPT-4 的表现与专家相似。结论在这项随机、蒙面、多中心研究中，LLM 的回答在质量、移情和安全性指标方面与专家不相上下，值得进一步探索其在临床环境中的潜在优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone

Objective

To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.

Design

Randomized, masked multicenter study.

Participants

Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.

Methods

Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).

Main Outcome

Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.

Results

There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (P < 0.001, P < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (P < 0.001) and empathy (P < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (P = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (P = 0.35), missing content (P = 0.001), extent of possible harm (P = 0.356), and likelihood of possible harm (P = 0.129).

Conclusions

In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.