Background: Large language models (LLMs) are increasingly used in healthcare for patient education and clinical decision support. However, systematic benchmarking in real-world clinical contexts remains limited, particularly for high-risk conditions such as hip fractures.
Objective: To evaluate and compare the performance of three state-of-the-art LLMs-DeepSeek-V3-FW, Gemini 2.0 Flash, and ChatGPT-4.5-in answering standardized patient questions on hip fracture management.
Methods: Thirty standardized questions covering general knowledge, diagnosis, treatment, and rehabilitation were developed by three specialists in orthopedics and traumatology. Each LLM generated responses independently. Three experienced orthopedic surgeons assessed accuracy (4-point scale) and comprehensiveness (5-point scale). Statistical analyses included Kruskal-Wallis and chi-squared tests.
Results: All models demonstrated high reliability, with 96.7% of responses rated "Good" or "Excellent" and none rated "Poor." Mean accuracy scores were comparable across models, and comprehensiveness averaged 4.8/5. DeepSeek-V3-FW tended to provide longer, structured answers and performed best in general knowledge, while Gemini 2.0 Flash excelled in diagnosis and rehabilitation and produced the most concise responses. ChatGPT-4.5 offered shorter, conversational answers with similar accuracy and detail.
Conclusions: The three LLMs showed strong capabilities in delivering accurate and comprehensive information on hip fracture care, highlighting their potential as tools for patient education and clinical support. Differences in style and domain-specific strengths suggest complementary roles. Further research is needed to validate safety and integration into clinical workflows.
扫码关注我们
求助内容:
应助结果提醒方式:
