Enhancing radiology training with GPT-4: Pilot analysis of automated feedback in trainee preliminary reports.

Current problems in diagnostic radiology Pub Date : 2024-08-15 DOI:10.1067/j.cpradiol.2024.08.003

Wasif Bala, Hanzhou Li, John Moon, Hari Trivedi, Judy Gichoya, Patricia Balthazar

{"title":"Enhancing radiology training with GPT-4: Pilot analysis of automated feedback in trainee preliminary reports.","authors":"Wasif Bala, Hanzhou Li, John Moon, Hari Trivedi, Judy Gichoya, Patricia Balthazar","doi":"10.1067/j.cpradiol.2024.08.003","DOIUrl":null,"url":null,"abstract":"Rationale and objectives: Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents' preliminary reports.Materials & methods: A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale.Results: The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback.Conclusion: This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.","PeriodicalId":93969,"journal":{"name":"Current problems in diagnostic radiology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current problems in diagnostic radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1067/j.cpradiol.2024.08.003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Rationale and objectives: Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents' preliminary reports.

Materials & methods: A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale.

Results: The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback.

Conclusion: This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用 GPT-4 加强放射学培训：对学员初步报告中的自动反馈进行试点分析。

理由和目标：放射科住院医师在独立调用期间发布的初步报告中收到的反馈通常很有限。本研究旨在确定大语言模型（LLMs）能否通过识别放射科住院医师初步报告中的漏诊来补充传统反馈：从我们的机构数据库中随机抽取了500份（250份训练/250份验证）成对的初步报告和最终报告，时间跨度为2022年12月17日到2023年5月22日。通过 GPT-4 API（gpt-4-0314 模型版本）将提示和报告文本输入 GPT-4 语言模型。对训练/验证集的一个子集进行了迭代提示调整，以指导模型在最终报告中识别初步报告中没有的重要发现。为进行测试，随机选取了 10 份确诊诊断错误的报告子集。14 名有值班经验的住院医师评估了 LLM 生成的差异，并使用 5 点李克特量表完成了他们的经验调查：结果：根据 14 位住院医师的评价，该模型在 10 份测试报告中识别出了 24 个独特的漏诊，模型预测准确率为 i%。用户还发现了另外 5 项诊断，模型灵敏度达到 79.2%。评估后调查显示，LLM 生成反馈的平均满意度为 3.50，感知准确度为 3.64（满分为 5 分）。大多数受访者（71.4%）赞成将 LLM 生成的反馈与传统反馈相结合：这项针对放射科住院医师初步报告使用由实验室管理员生成反馈意见的试点研究在识别漏诊方面表现出了显著的准确性，并获得了积极的反响，凸显了实验室管理员在补充传统反馈方法方面的潜在作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Current problems in diagnostic radiology

自引率

0.00%

发文量