区分gpt生成的和人为写的放射住院医师反馈。

IF 1.5 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Current Problems in Diagnostic Radiology Pub Date : 2025-09-01 Epub Date: 2025-02-18 DOI:10.1067/j.cpradiol.2025.02.002
Zier Zhou , Arsalan Rizwan , Nick Rogoza , Andrew D Chung , Benjamin YM Kwan
{"title":"区分gpt生成的和人为写的放射住院医师反馈。","authors":"Zier Zhou ,&nbsp;Arsalan Rizwan ,&nbsp;Nick Rogoza ,&nbsp;Andrew D Chung ,&nbsp;Benjamin YM Kwan","doi":"10.1067/j.cpradiol.2025.02.002","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.</div></div><div><h3>Methods</h3><div>Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019–2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.</div></div><div><h3>Results</h3><div>Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (<em>k</em>=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).</div></div><div><h3>Conclusion</h3><div>Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.</div></div>","PeriodicalId":51617,"journal":{"name":"Current Problems in Diagnostic Radiology","volume":"54 5","pages":"Pages 574-578"},"PeriodicalIF":1.5000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Differentiating between GPT-generated and human-written feedback for radiology residents\",\"authors\":\"Zier Zhou ,&nbsp;Arsalan Rizwan ,&nbsp;Nick Rogoza ,&nbsp;Andrew D Chung ,&nbsp;Benjamin YM Kwan\",\"doi\":\"10.1067/j.cpradiol.2025.02.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Purpose</h3><div>Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.</div></div><div><h3>Methods</h3><div>Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019–2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.</div></div><div><h3>Results</h3><div>Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (<em>k</em>=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).</div></div><div><h3>Conclusion</h3><div>Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.</div></div>\",\"PeriodicalId\":51617,\"journal\":{\"name\":\"Current Problems in Diagnostic Radiology\",\"volume\":\"54 5\",\"pages\":\"Pages 574-578\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current Problems in Diagnostic Radiology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0363018825000465\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/18 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Problems in Diagnostic Radiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0363018825000465","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/18 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

摘要

目的:最近加拿大放射学项目实施的基于能力的医学教育(CBME)要求教师进行更多的评估。CBME中叙述性反馈的兴起,与大型语言模型(llm)的兴起相一致,提出了关于这些模型产生与人类专家和相关挑战相匹配的信息评论的潜力的问题。本研究比较了人类书面反馈和gpt -3.5生成的放射科住院医生反馈,以及评分者如何区分这些来源。方法:评估由28名教师对加拿大诊断放射学项目(2019-2023)的10名住院医生完成。从Elentra中提取评论,去识别,并解析成句子,随机选择110个进行分析。其中11条评论被输入到GPT-3.5中,生成110条合成评论,这些评论与实际评论混合在一起。两名教员评分员和GPT-3.5阅读每条评论,以预测它是人工写的还是gpt生成的。结果:人类的实际评论通常比合成评论更长,更具体,特别是在描述临床程序和患者互动时。当两种反馈类型都同样模糊时,源区分就更加困难。观察到GPT-3.5提供的反应与人类之间的一致性较低(k=-0.237)。与GPT-3.5(50%)相比,人类评分者在识别实际和合成评论方面也更准确(80.5%)。结论:目前,GPT-3.5在为放射科住院医生提供具体、细致的反馈方面无法与人类专家相提并论。与人类相比,GPT-3.5在区分真实评论和合成评论方面的表现也更差。这些见解可以指导更复杂的算法的开发,以产生更高质量的反馈,支持教师的发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Differentiating between GPT-generated and human-written feedback for radiology residents

Purpose

Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.

Methods

Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019–2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.

Results

Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (k=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).

Conclusion

Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Current Problems in Diagnostic Radiology
Current Problems in Diagnostic Radiology RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-
CiteScore
3.00
自引率
0.00%
发文量
113
审稿时长
46 days
期刊介绍: Current Problems in Diagnostic Radiology covers important and controversial topics in radiology. Each issue presents important viewpoints from leading radiologists. High-quality reproductions of radiographs, CT scans, MR images, and sonograms clearly depict what is being described in each article. Also included are valuable updates relevant to other areas of practice, such as medical-legal issues or archiving systems. With new multi-topic format and image-intensive style, Current Problems in Diagnostic Radiology offers an outstanding, time-saving investigation into current topics most relevant to radiologists.
期刊最新文献
Coronary computed tomography angiography without ECG leads; A feasibility study Defining CT subtypes in chronic obstructive pulmonary disease: real world daily practice does not meet guidelines Chest CT findings of follicular bronchiolitis: Comparative analysis according to underlying lung diseases Immunotherapy-induced pulmonary toxicity: A comprehensive radiological review The PE puzzle: Identifying and differentiating mimics of acute and chronic pulmonary embolism on CTPA
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1