Objective
To evaluate whether psychiatric discharge summaries (DS) generated with ChatGPT-4 from electronic health records (EHR) can match the quality of DS written by psychiatric residents.
Methods
At a psychiatric primary care hospital, we compared 20 inpatient DS, written by residents, to those written with ChatGPT-4 from pseudonymized residents’ notes of the patients’ EHRs and a standardized prompt. 8 blinded psychiatry specialists rated both versions on a custom Likert scale from 1 to 5 across 15 quality subcategories. The primary outcome was the overall rating difference between the two groups. The secondary outcomes were the rating differences at the level of individual question, case, and rater.
Results
Human-written DS were rated significantly higher than AI (mean ratings: human 3.78, AI 3.12, p < 0.05). They surpassed AI significantly in 12/15 questions and 16/20 cases and were favored significantly by 7/8 raters. For “low expected correction effort”, human DS were rated as 67 % favorable, 19 % neutral, and 14 % unfavorable, whereas AI-DS were rated as 22 % favorable, 33 % neutral, and 45 % unfavorable. Hallucinations were present in 40 % of AI-DS, with 37.5 % deemed highly clinically relevant. Minor content mistakes were found in 30 % of AI and 10 % of human DS. Raters correctly identified AI-DS with 81 % sensitivity and 75 % specificity.
Discussion
Overall, AI-DS did not match the quality of resident-written DS but performed similarly in 20% of cases and were rated as favorable for “low expected correction effort” in 22% of cases. AI-DS lacked most in content specificity, ability to distill key case information, and coherence but performed adequately in conciseness, adherence to formalities, relevance of included content, and form.
Conclusion
LLM-written DS show potential as templates for physicians to finalize, potentially saving time in the future.