Evaluation of radiology residents' reporting skills using large language models: an observational study.

IF 2.1 4区医学 Japanese Journal of Radiology Pub Date : 2025-07-01 Epub Date: 2025-03-08 DOI:10.1007/s11604-025-01764-y

Natsuko Atsukawa, Hiroyuki Tatekawa, Tatsushi Oura, Shu Matsushita, Daisuke Horiuchi, Hirotaka Takita, Yasuhito Mitsuyama, Ayako Omori, Taro Shimono, Yukio Miki, Daiju Ueda

{"title":"Evaluation of radiology residents' reporting skills using large language models: an observational study.","authors":"Natsuko Atsukawa, Hiroyuki Tatekawa, Tatsushi Oura, Shu Matsushita, Daisuke Horiuchi, Hirotaka Takita, Yasuhito Mitsuyama, Ayako Omori, Taro Shimono, Yukio Miki, Daiju Ueda","doi":"10.1007/s11604-025-01764-y","DOIUrl":null,"url":null,"abstract":"Purpose: Large language models (LLMs) have the potential to objectively evaluate radiology resident reports; however, research on their use for feedback in radiology training and assessment of resident skill development remains limited. This study aimed to assess the effectiveness of LLMs in revising radiology reports by comparing them with reports verified by board-certified radiologists and to analyze the progression of resident's reporting skills over time.Materials and methods: To identify the LLM that best aligned with human radiologists, 100 reports were randomly selected from 7376 reports authored by nine first-year radiology residents. The reports were evaluated based on six criteria: (1) addition of missing positive findings, (2) deletion of findings, (3) addition of negative findings, (4) correction of the expression of findings, (5) correction of the diagnosis, and (6) proposal of additional examinations or treatments. Reports were segmented into four time-based terms, and 900 reports (450 CT and 450 MRI) were randomly chosen from the initial and final terms of the residents' first year. The revised rates for each criterion were compared between the first and last terms using the Wilcoxon Signed-Rank test.Results: Among the three LLMs-ChatGPT-4 Omni (GPT-4o), Claude-3.5 Sonnet, and Claude-3 Opus-GPT-4o demonstrated the highest level of agreement with board-certified radiologists. Significant improvements were noted in Criteria 1-3 when comparing reports from the first and last terms (Criteria 1, 2, and 3; P < 0.001, P = 0.023, and P = 0.004, respectively) using GPT-4o. No significant changes were observed for Criteria 4-6. Despite this, all criteria except for Criteria 6 showed progressive enhancement over time.Conclusion: LLMs can effectively provide feedback on commonly corrected areas in radiology reports, enabling residents to objectively identify and improve their weaknesses and monitor their progress. Additionally, LLMs may help reduce the workload of radiologists' mentors.","PeriodicalId":14691,"journal":{"name":"Japanese Journal of Radiology","volume":" ","pages":"1204-1212"},"PeriodicalIF":2.1000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204868/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Japanese Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11604-025-01764-y","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/8 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Large language models (LLMs) have the potential to objectively evaluate radiology resident reports; however, research on their use for feedback in radiology training and assessment of resident skill development remains limited. This study aimed to assess the effectiveness of LLMs in revising radiology reports by comparing them with reports verified by board-certified radiologists and to analyze the progression of resident's reporting skills over time.

Materials and methods: To identify the LLM that best aligned with human radiologists, 100 reports were randomly selected from 7376 reports authored by nine first-year radiology residents. The reports were evaluated based on six criteria: (1) addition of missing positive findings, (2) deletion of findings, (3) addition of negative findings, (4) correction of the expression of findings, (5) correction of the diagnosis, and (6) proposal of additional examinations or treatments. Reports were segmented into four time-based terms, and 900 reports (450 CT and 450 MRI) were randomly chosen from the initial and final terms of the residents' first year. The revised rates for each criterion were compared between the first and last terms using the Wilcoxon Signed-Rank test.

Results: Among the three LLMs-ChatGPT-4 Omni (GPT-4o), Claude-3.5 Sonnet, and Claude-3 Opus-GPT-4o demonstrated the highest level of agreement with board-certified radiologists. Significant improvements were noted in Criteria 1-3 when comparing reports from the first and last terms (Criteria 1, 2, and 3; P < 0.001, P = 0.023, and P = 0.004, respectively) using GPT-4o. No significant changes were observed for Criteria 4-6. Despite this, all criteria except for Criteria 6 showed progressive enhancement over time.

Conclusion: LLMs can effectively provide feedback on commonly corrected areas in radiology reports, enabling residents to objectively identify and improve their weaknesses and monitor their progress. Additionally, LLMs may help reduce the workload of radiologists' mentors.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用大语言模型评估放射科住院医师报告技能：一项观察性研究。

目的：大型语言模型（LLMs）具有客观评价放射学住院医师报告的潜力；然而，关于它们在放射学培训和住院医师技能发展评估中的反馈应用的研究仍然有限。本研究旨在评估法学硕士在修订放射学报告方面的有效性，将其与经委员会认证的放射科医师验证的报告进行比较，并分析住院医师报告技能随时间的进展。材料和方法：为了确定与人类放射科医师最一致的法学硕士，从9名第一年放射科住院医师撰写的7376份报告中随机选择100份报告。这些报告是根据六个标准进行评估的：(1)添加缺失的阳性发现，(2)删除发现，(3)添加阴性发现，(4)纠正发现的表达，(5)纠正诊断，(6)建议额外的检查或治疗。报告被分为四个基于时间的术语，900份报告（450 CT和450 MRI）从住院医生第一年的初始和最终术语中随机选择。使用Wilcoxon Signed-Rank检验比较每个标准的修订率在第一项和最后一项之间。结果：在三个llms中- chatgpt -4 Omni (gpt - 40), Claude-3.5 Sonnet和Claude-3 opus - gpt - 40与委员会认证的放射科医生的一致性最高。当比较第一学期和最后学期的报告时，在标准1-3中发现了显著的改进(标准1、2和3；P结论：llm可以有效地对放射学报告中常见的纠正区域进行反馈，使住院医师能够客观地识别和改进自己的弱点，并监测自己的进步。此外，法学硕士可能有助于减少放射科医生导师的工作量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Japanese Journal of Radiology Medicine-Radiology, Nuclear Medicine and Imaging

自引率

4.80%

发文量

133

期刊介绍： Japanese Journal of Radiology is a peer-reviewed journal, officially published by the Japan Radiological Society. The main purpose of the journal is to provide a forum for the publication of papers documenting recent advances and new developments in the field of radiology in medicine and biology. The scope of Japanese Journal of Radiology encompasses but is not restricted to diagnostic radiology, interventional radiology, radiation oncology, nuclear medicine, radiation physics, and radiation biology. Additionally, the journal covers technical and industrial innovations. The journal welcomes original articles, technical notes, review articles, pictorial essays and letters to the editor. The journal also provides announcements from the boards and the committees of the society. Membership in the Japan Radiological Society is not a prerequisite for submission. Contributions are welcomed from all parts of the world.