区分gpt生成的和人为写的放射住院医师反馈。

IF 1.5 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Current Problems in Diagnostic Radiology Pub Date : 2025-09-01 Epub Date: 2025-02-18 DOI:10.1067/j.cpradiol.2025.02.002

Zier Zhou , Arsalan Rizwan , Nick Rogoza , Andrew D Chung , Benjamin YM Kwan

{"title":"区分gpt生成的和人为写的放射住院医师反馈。","authors":"Zier Zhou , Arsalan Rizwan , Nick Rogoza , Andrew D Chung , Benjamin YM Kwan","doi":"10.1067/j.cpradiol.2025.02.002","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.</div></div><div><h3>Methods</h3><div>Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019–2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.</div></div><div><h3>Results</h3><div>Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (<em>k</em>=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).</div></div><div><h3>Conclusion</h3><div>Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.</div></div>","PeriodicalId":51617,"journal":{"name":"Current Problems in Diagnostic Radiology","volume":"54 5","pages":"Pages 574-578"},"PeriodicalIF":1.5000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Differentiating between GPT-generated and human-written feedback for radiology residents\",\"authors\":\"Zier Zhou , Arsalan Rizwan , Nick Rogoza , Andrew D Chung , Benjamin YM Kwan\",\"doi\":\"10.1067/j.cpradiol.2025.02.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Purpose</h3><div>Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.</div></div><div><h3>Methods</h3><div>Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019–2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.</div></div><div><h3>Results</h3><div>Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (<em>k</em>=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).</div></div><div><h3>Conclusion</h3><div>Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.</div></div>\",\"PeriodicalId\":51617,\"journal\":{\"name\":\"Current Problems in Diagnostic Radiology\",\"volume\":\"54 5\",\"pages\":\"Pages 574-578\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current Problems in Diagnostic Radiology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0363018825000465\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/18 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Problems in Diagnostic Radiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0363018825000465","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/18 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

目的：最近加拿大放射学项目实施的基于能力的医学教育（CBME）要求教师进行更多的评估。CBME中叙述性反馈的兴起，与大型语言模型（llm）的兴起相一致，提出了关于这些模型产生与人类专家和相关挑战相匹配的信息评论的潜力的问题。本研究比较了人类书面反馈和gpt -3.5生成的放射科住院医生反馈，以及评分者如何区分这些来源。方法：评估由28名教师对加拿大诊断放射学项目（2019-2023）的10名住院医生完成。从Elentra中提取评论，去识别，并解析成句子，随机选择110个进行分析。其中11条评论被输入到GPT-3.5中，生成110条合成评论，这些评论与实际评论混合在一起。两名教员评分员和GPT-3.5阅读每条评论，以预测它是人工写的还是gpt生成的。结果：人类的实际评论通常比合成评论更长，更具体，特别是在描述临床程序和患者互动时。当两种反馈类型都同样模糊时，源区分就更加困难。观察到GPT-3.5提供的反应与人类之间的一致性较低（k=-0.237）。与GPT-3.5（50%）相比，人类评分者在识别实际和合成评论方面也更准确（80.5%）。结论：目前，GPT-3.5在为放射科住院医生提供具体、细致的反馈方面无法与人类专家相提并论。与人类相比，GPT-3.5在区分真实评论和合成评论方面的表现也更差。这些见解可以指导更复杂的算法的开发，以产生更高质量的反馈，支持教师的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Differentiating between GPT-generated and human-written feedback for radiology residents

Purpose

Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.

Methods

Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019–2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.

Results

Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (k=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).

Conclusion

Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Current Problems in Diagnostic Radiology RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-

CiteScore

3.00

自引率

0.00%

发文量

113

审稿时长

46 days

期刊介绍： Current Problems in Diagnostic Radiology covers important and controversial topics in radiology. Each issue presents important viewpoints from leading radiologists. High-quality reproductions of radiographs, CT scans, MR images, and sonograms clearly depict what is being described in each article. Also included are valuable updates relevant to other areas of practice, such as medical-legal issues or archiving systems. With new multi-topic format and image-intensive style, Current Problems in Diagnostic Radiology offers an outstanding, time-saving investigation into current topics most relevant to radiologists.