Objective: This study aims to evaluate the efficacy of Large Language Models (LLMs) in generating patient educational content on pediatric cardiothoracic surgical procedures.
Methods: In this comparative observational study we employed five LLMs, ChatGPT 4o, ChatGPT 4, Google Gemini, Perplexity AI and Claude AI, to create educational pamphlets for 24 different pediatric cardiothoracic procedures. Each LLM produced three pamphlets per procedure, resulting in a total of 360 unique pamphlets. These pamphlets were evaluated for accuracy, consistency, and relevance using structured scoring scales. Five reviewers were employed, resulting in 1800 evaluations for accuracy and 600 for consistency. Patient advocates independently reviewed relevance. Readability was assessed using six different metrics.
Results: The study revealed significant differences in accuracy, with Perplexity AI performing best in cardiac procedures (p < 0.00001) and Claude AI excelling in pulmonary procedures (p = 0.001). Consistency across models varied significantly, with ChatGPT 4 having high variability across pamphlets. Readability analysis indicated that Gemini produced the most comprehensible content. The overall relevance of the pamphlets was highest with Perplexity AI (p < 0.00001). Post-hoc analysis revealed that overall, ChatGPT 4 and Perplexity AI tend to have similar levels of readability across the measures.
Conclusion: LLMs demonstrate significant potential in creating educational materials for pediatric cardiothoracic surgery. However, our findings suggest that their effectiveness varies based on the type of procedure and evaluation criteria. Tailoring LLM-generated content to specific contexts, along with physician oversight, is critical. Additionally, readability should be optimized to ensure adequate comprehension by the general public.
扫码关注我们
求助内容:
应助结果提醒方式:
