Objectives
Artificial intelligence (AI) has revolutionized public access to extensive information with large language model (LLM)-based chatbots allowing users to receive comprehensive, individualized responses. In this study, we aimed to evaluate the quality of LLM responses to questions about common orthopedic conditions. We hypothesized that both ChatGPT and Gemini would demonstrate high quality, evidence-based responses across evaluation criteria.
Methods
Responses from ChatGPT and Gemini to prompts based on the 14 AAOS Clinical Practice Guidelines for clavicle fracture management were evaluated on six criteria by seven fellowship-trained shoulder and trauma orthopedic surgeons. Statistical analyses including mean scoring, standard deviation and two-sided t-tests were calculated to compare performance between ChatGPT and Gemini. Scores were then evaluated for inter-rater reliability (IRR).
Results
ChatGPT and Gemini demonstrated overall mean scores greater than 3.5 for both platforms. Mean overall score for ChatGPT was highest in evidence-based (4.52 ± 0.16) and lowest in clarity (4.22 ± 0.19). Mean overall score for Gemini was highest in clarity (4.31 ± 0.17) and lowest in evidence-based (3.81 ± 0.22). ChatGPT had significantly better performance in the overall completeness category (4.50 ± 0.17 vs 4.11 ± 0.19, p < 0.005) than Gemini but scores were otherwise not significantly different. Over 70 % of respondents rated the responses of ChatGPT as higher quality than Gemini.
Conclusions
ChatGPT and Gemini produced responses that were generally in line with the 2022 AAOS guidelines on the treatment of clavicle fractures. Scores were comparable in every overall category except completeness, with ChatGPT outperforming Gemini. These results suggest that both LLMs are capable of providing clinically relevant responses to questions related to clavicle fracture management.
扫码关注我们
求助内容:
应助结果提醒方式:
