Objective: Preventable errors in clinical documentation and decision-making remain a major threat to patient safety, yet the role of open-source large language models (LLMs) as practical "second reviewers" in general Internal Medicine remains unclear.
Methods: We prospectively assembled 102 real-world Emergency Internal Medicine reports (de-identified) and either inserted or confirmed realistic errors across four categories (diagnostics/investigations, medication/therapy, process/communication/follow-up, other). Three LLMs (open-source Deepseek-v3-r1 and GPT-OSS-120b, and closed-source OpenAI-o3) were prompted with a uniform system instruction to (i) localize the predefined error and (ii) recommend corrections. Two blinded Internal Medicine specialists independently graded outputs for error localization (0-1) and recommendation quality (Likert 1-4); disagreements were resolved analytically, and analyses used the more conservative rater. Three human clinicians independently reviewed subsets of the same cases to provide a comparator.
Results: Using the conservative rater, correct error localization was 72.5% (74/102; 95% CI 63.2-80.3) for Deepseek-v3-r1, 79.2% (80/101; 95% CI 70.3-86.0) for o3, and 65.7% (67/102; 95% CI 56.1-74.2) for GPT-OSS-120b (Cochran's Q p = 0.033). Pairwise McNemar tests favored o3 over GPT-OSS-120b (p = 0.020; Holm-adjusted p = 0.060); other contrasts were not significant. Recommendation quality was high for all models (median 4/4), with mean ± SD scores of 3.73 ± 0.49 for Deepseek-v3-r1, 3.65 ± 0.64 for o3, and 3.51 ± 0.73 for GPT-OSS-120b. Inter-rater agreement was excellent for GPT-OSS-120b (κ = 0.94 for detection; κ_w = 0.85 for quality), substantial for Deepseek-v3-r1 (κ = 0.75; κ_w = 0.47), and lower for o3 (κ = 0.31; κ_w = 0.14). All models frequently flagged additional clinically useful issues (≥99% of reports).
Conclusion: In real-world Internal Medicine reports with realistic, expert-defined errors, state-of-the-art open-source LLMs approached the performance of a leading closed model and clearly outperformed clinicians in error detection, while providing predominantly guideline-concordant corrective recommendations. Given their advantages for privacy, customizability, and potential local deployment, open models represent credible candidates for privacy-preserving "second-reviewer" support in Internal Medicine. Prospective, workflow-embedded trials that also quantify specificity on error-free notes, alert burden, and patient outcomes are now warranted.
扫码关注我们
求助内容:
应助结果提醒方式:
