专家、专家编辑的大语言模型 (LLM) 或仅 LLM 对视网膜问题回答的比较研究

IF 3.2 Q1 OPHTHALMOLOGY Ophthalmology science Pub Date : 2024-02-06 DOI:10.1016/j.xops.2024.100485
Prashant D. Tailor MD , Lauren A. Dalvin MD , John J. Chen MD, PhD , Raymond Iezzi MD , Timothy W. Olsen MD , Brittni A. Scruggs MD, PhD , Andrew J. Barkmeier MD , Sophie J. Bakri MD , Edwin H. Ryan MD , Peter H. Tang MD, PhD , D. Wilkin. Parke III MD , Peter J. Belin MD , Jayanth Sridhar MD , David Xu MD , Ajay E. Kuriyan MD , Yoshihiro Yonekawa MD , Matthew R. Starr MD
{"title":"专家、专家编辑的大语言模型 (LLM) 或仅 LLM 对视网膜问题回答的比较研究","authors":"Prashant D. Tailor MD ,&nbsp;Lauren A. Dalvin MD ,&nbsp;John J. Chen MD, PhD ,&nbsp;Raymond Iezzi MD ,&nbsp;Timothy W. Olsen MD ,&nbsp;Brittni A. Scruggs MD, PhD ,&nbsp;Andrew J. Barkmeier MD ,&nbsp;Sophie J. Bakri MD ,&nbsp;Edwin H. Ryan MD ,&nbsp;Peter H. Tang MD, PhD ,&nbsp;D. Wilkin. Parke III MD ,&nbsp;Peter J. Belin MD ,&nbsp;Jayanth Sridhar MD ,&nbsp;David Xu MD ,&nbsp;Ajay E. Kuriyan MD ,&nbsp;Yoshihiro Yonekawa MD ,&nbsp;Matthew R. Starr MD","doi":"10.1016/j.xops.2024.100485","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.</p></div><div><h3>Design</h3><p>Randomized, masked multicenter study.</p></div><div><h3>Participants</h3><p>Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.</p></div><div><h3>Methods</h3><p>Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).</p></div><div><h3>Main Outcome</h3><p>Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.</p></div><div><h3>Results</h3><p>There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (<em>P</em> &lt; 0.001, <em>P</em> &lt; 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (<em>P</em> &lt; 0.001) and empathy (<em>P</em> &lt; 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (<em>P</em> = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (<em>P</em> = 0.35), missing content (<em>P</em> = 0.001), extent of possible harm (<em>P</em> = 0.356), and likelihood of possible harm (<em>P</em> = 0.129).</p></div><div><h3>Conclusions</h3><p>In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.</p></div><div><h3>Financial Disclosure(s)</h3><p>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.</p></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666914524000216/pdfft?md5=5e2ad31ae6fa487a208c371f1e37de20&pid=1-s2.0-S2666914524000216-main.pdf","citationCount":"0","resultStr":"{\"title\":\"A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone\",\"authors\":\"Prashant D. Tailor MD ,&nbsp;Lauren A. Dalvin MD ,&nbsp;John J. Chen MD, PhD ,&nbsp;Raymond Iezzi MD ,&nbsp;Timothy W. Olsen MD ,&nbsp;Brittni A. Scruggs MD, PhD ,&nbsp;Andrew J. Barkmeier MD ,&nbsp;Sophie J. Bakri MD ,&nbsp;Edwin H. Ryan MD ,&nbsp;Peter H. Tang MD, PhD ,&nbsp;D. Wilkin. Parke III MD ,&nbsp;Peter J. Belin MD ,&nbsp;Jayanth Sridhar MD ,&nbsp;David Xu MD ,&nbsp;Ajay E. Kuriyan MD ,&nbsp;Yoshihiro Yonekawa MD ,&nbsp;Matthew R. Starr MD\",\"doi\":\"10.1016/j.xops.2024.100485\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><p>To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.</p></div><div><h3>Design</h3><p>Randomized, masked multicenter study.</p></div><div><h3>Participants</h3><p>Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.</p></div><div><h3>Methods</h3><p>Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).</p></div><div><h3>Main Outcome</h3><p>Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.</p></div><div><h3>Results</h3><p>There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (<em>P</em> &lt; 0.001, <em>P</em> &lt; 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (<em>P</em> &lt; 0.001) and empathy (<em>P</em> &lt; 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (<em>P</em> = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (<em>P</em> = 0.35), missing content (<em>P</em> = 0.001), extent of possible harm (<em>P</em> = 0.356), and likelihood of possible harm (<em>P</em> = 0.129).</p></div><div><h3>Conclusions</h3><p>In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.</p></div><div><h3>Financial Disclosure(s)</h3><p>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.</p></div>\",\"PeriodicalId\":74363,\"journal\":{\"name\":\"Ophthalmology science\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-02-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666914524000216/pdfft?md5=5e2ad31ae6fa487a208c371f1e37de20&pid=1-s2.0-S2666914524000216-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ophthalmology science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666914524000216\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666914524000216","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

方法每位专家创建一个回复(专家),然后编辑一个由 LLM(ChatGPT-4)生成的对该问题的回复(专家 + 人工智能 [AI]),并为这两项任务计时。五个 LLM(ChatGPT-3.5、ChatGPT-4、Claude 2、Bing 和 Bard)也对每个问题生成了回复。原始问题以及经过匿名和随机化处理的专家 + 人工智能、专家和 LLM 回答由其他未对问题做出专家回答的专家进行评估。评估者对质量和同理心(很差、差、可接受、好或很好)以及安全性指标(信息不正确、造成伤害的可能性、伤害程度和内容缺失)进行评判。主要结果平均质量和同理心得分,每种回复类型中信息不正确、造成伤害的可能性、伤害程度和内容缺失的回复比例。结果共收集到 4008 份评分(其中 2608 份为质量和共鸣评分;1400 份为安全指标评分),在质量和共鸣评分方面,LLM 组、专家组和专家+人工智能组之间存在显著差异(P < 0.001,P < 0.001)。在质量方面,专家+人工智能组(3.86 ± 0.85)总体表现最佳,而 GPT-3.5 组(3.75 ± 0.79)则是表现最好的 LLM。在移情方面,GPT-3.5(3.75 ± 0.69)的平均得分最高,其次是专家+人工智能(3.73 ± 0.63)。按平均得分计算,在质量和移情方面,"专家 "和 "人工智能 "的得分分别为 4 分(满分 7 分)和 6 分(满分 7 分)。在质量(P <0.001)和移情(P <0.001)方面,专家编辑的 LLM 回答都优于专家创建的回答。专家编辑的 LLM 回答比专家创建的回答节省时间(P = 0.02)。在不恰当内容(P = 0.35)、缺失内容(P = 0.001)、可能的危害程度(P = 0.356)和可能的危害可能性(P = 0.129)方面,ChatGPT-4 的表现与专家相似。结论在这项随机、蒙面、多中心研究中,LLM 的回答在质量、移情和安全性指标方面与专家不相上下,值得进一步探索其在临床环境中的潜在优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone

Objective

To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.

Design

Randomized, masked multicenter study.

Participants

Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.

Methods

Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).

Main Outcome

Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.

Results

There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (P < 0.001, P < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (P < 0.001) and empathy (P < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (P = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (P = 0.35), missing content (P = 0.001), extent of possible harm (P = 0.356), and likelihood of possible harm (P = 0.129).

Conclusions

In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.

Financial Disclosure(s)

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Ophthalmology science
Ophthalmology science Ophthalmology
CiteScore
3.40
自引率
0.00%
发文量
0
审稿时长
89 days
期刊最新文献
Editorial Board Table of Contents Cover ChatGPT-Assisted Classification of Postoperative Bleeding Following Microinvasive Glaucoma Surgery Using Electronic Health Record Data Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1