We evaluated three multimodal LLMs, ChatGPT (GPT-5.2), Gemini 3, and Microsoft Copilot, in pediatric ECG interpretation, focusing on clinically significant abnormalities and emergency arrhythmias with likelihood ratios as primary outcome measures. This prospective comparative diagnostic accuracy study (STARD/STARD-AI) included 264 pediatric patients with 12-lead ECGs (November 2024-November 2025). De-identified images were submitted via standardized zero-shot prompt. Three blinded pediatric cardiologists established the reference diagnosis by majority-vote consensus. Cases were classified as Tier 1 (normal), Tier 2 (abnormal, non-urgent), or Tier 3 (urgent). Two binary endpoints were assessed: clinically significant abnormality (Tier 2 + 3 vs Tier 1) and emergency abnormality (Tier 3 vs Tier 1 + 2). Clinically significant abnormalities were present in 54.5% of patients. AUC values ranged from 0.550 to 0.623, reflecting modest discrimination. For the clinically significant endpoint, + LR values were 2.05 (ChatGPT), 1.26 (Gemini), and 1.21 (Copilot); - LR values were 0.68, 0.55, and 0.81, indicating limited rule-in and insufficient rule-out utility. For the emergency endpoint, Gemini achieved 100% sensitivity (95% CI = 85.1-100.0) with - LR 0.07 (95% CI = 0.00-1.12) in a small subgroup (n = 22); however, specificity of 30.2% and + LR of 1.40 indicate overcalling rather than diagnostic precision. No model achieved clinically meaningful rule-in utility for either endpoint.
Conclusions: Current multimodal LLMs showed limited diagnostic utility in pediatric ECG interpretation, with + LR values near 1.0 across both endpoints. Standalone deployment is not supported; these tools may at most serve as adjunctive screening aids under clinician oversight.
What is known: • Deep learning algorithms trained on large ECG datasets perform well in adult populations, but evidence in pediatric ECG interpretation is limited. • General-purpose LLMs show variable accuracy in medical examinations; reliability in subspecialty domains such as pediatric cardiology remains unproven.
What is new: • This is the[FCA1] first head-to-head comparative diagnostic accuracy study of multimodal LLMs in pediatric ECG evaluation, using likelihood ratios as primary outcome measures. • All three LLMs showed limited rule-in utility (+LR near 1.0); Gemini achieved potentially meaningful rule-out performance for emergency arrhythmias (-LR = 0.07), but with wide confidence intervals reflecting the small emergency subgroup (n = 22). • Gemini's 100% sensitivity in the emergency subgroup reflects overcalling (specificity 30.2%) consistent with a triage/screening behavior rather than diagnostic precision.
扫码关注我们
求助内容:
应助结果提醒方式:
