Background: To compare the quality and clinical usefulness of large language model (LLM)-generated lumbar spine magnetic resonance imaging (MRI) reports with radiologist-written ones and assess whether medical professionals can distinguish between them.
Materials and methods: This retrospective observational single-center study was approved by the local ethics committee. A total of 125 lumbar spine MRI reports (104 human-written, 21 LLM-generated using ChatGPT-4o) were anonymized, randomized, and blindly evaluated by five medical professionals (one board-certified radiologist, two radiology residents, one general practitioner, one orthopedic surgeon), all with basic familiarity with LLM. Each report was scored on a five-point Likert scale for clinical relevance, clarity, completeness, diagnostic accuracy, and intelligibility, whereas general practitioner and orthopedic surgeon evaluated intelligibility only. Evaluators also classified each report as AI-generated or human-written. Accuracy was defined as the proportion of correctly classified reports in distinguishing LLM-generated from radiologist-written texts. Mann-Whitney U or Student's t-tests were used.
Results: Radiologists' reports consistently received higher median scores across all domains (p < 0.001). No differences were found in the description of the imaging technique (p > 0.175). No clinically false statements were identified in the LLM-generated reports. Identification accuracy varied widely among evaluators: Board-certified radiologist achieved 88.0% accuracy (sensitivity 66.7%, specificity 92.3%), Resident 1 65.6% (14.3%, 76.0%), Resident 2 94.4% (66.7%, 100%), orthopedic surgeon 78.4% (90.5%, 76.0%) and general practitioner 65.6% (81.0%, 62.5%).
Conclusion: Radiologist-written lumbar spine MRI reports outperform LLM-generated reports in quality and structure. However, some AI-generated reports were indistinguishable from human ones, particularly for non-specialized readers. LLMs may support radiologists in structured reporting and improve workflow efficiency, while maintaining diagnostic reliability.
Relevance statement: Large language models can draft lumbar spine MRI reports, but currently lack the quality and consistency of radiologist reports. With radiologist supervision, large language models may improve reporting efficiency while preserving diagnostic reliability and supporting clinical decision-making.
Key points: LLM-generated reports are clinically coherent and stylistically comparable to those written by expert radiologists. Radiologist-written reports scored significantly higher for clinical relevance, findings, and structure. LLM-generated reports were sometimes misclassified as human-written by clinicians.
扫码关注我们
求助内容:
应助结果提醒方式:
