Background: Artificial intelligence (AI) chatbots are increasingly used for medical information provision. However, systematic evaluations of their accuracy and reliability in orthopaedic surgery, particularly in total knee replacement (TKR), remain limited.
Purpose: To systematically compare and evaluate performances of various AI chatbots, focusing on their ability to provide accurate and reliable information related to TKR.
Study design: Cohort study; Level of evidence, 2.
Methods: A total of 43 clinically relevant TKR-related frequently asked questions (FAQs) were selected based on Google search trends and expert consultation. Questions were categorized into 6 key domains: (1) general/procedure-related information, (2) indications and outcomes, (3) risks and complications, (4) pain and postoperative recovery, (5) specific activities after surgery, and (6) alternatives and variations. Each question was submitted to 5 different chatbot models (GPT-3.5, GPT-4, GPT-4 Omni, Gemini Advanced, and Gemini 1.5) for response generation. Two independent orthopaedic surgeons assessed the chatbot's responses for both accuracy and relevance using a 5-point Likert scale. Responses were anonymized, blinding evaluators to the chatbot identities to prevent bias. Accuracy differences among the chatbot models were analyzed by analysis of variance, and relevance was compared using the Kruskal-Wallis test.
Results: GPT-3.5 (4.8 ± 0.5), GPT-4 (4.9 ± 0.4), GPT-4 Omni (4.9 ± 0.3), and Gemini 1.5 (4.8 ± 0.4) demonstrated high accuracy, whereas Gemini Advanced scored significantly lower (4.1 ± 1.4) (P < .001). However, general/procedure-related information, risks and complications, pain and recovery, and postoperative activities showed no significant differences among chatbots. Gemini Advanced underperformed in indications and outcomes (P = .04) and alternatives and variations (P = .002). Regarding relevance, all chatbots except Gemini Advanced (36/43; 83.7%) achieved a 100% relevance rate (P < .001).
Conclusion: This study demonstrates that GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 can provide highly accurate and relevant responses to TKR-related queries, while Gemini Advanced underperforms.
扫码关注我们
求助内容:
应助结果提醒方式:
