Silvia Gianola, Silvia Bargeri, Greta Castellini, Chad Cook, Alvisa Palese, Paolo Pillastrini, Silvia Salvalaggio, Andrea Turolla, Giacomo Rossettini
{"title":"Performance of ChatGPT Compared to Clinical Practice Guidelines in Making Informed Decisions for Lumbosacral Radicular Pain: A Cross-sectional Study.","authors":"Silvia Gianola, Silvia Bargeri, Greta Castellini, Chad Cook, Alvisa Palese, Paolo Pillastrini, Silvia Salvalaggio, Andrea Turolla, Giacomo Rossettini","doi":"10.2519/jospt.2024.12151","DOIUrl":null,"url":null,"abstract":"<p><p><b>OBJECTIVE:</b> To compare the accuracy of an artificial intelligence chatbot to clinical practice guidelines (CPGs) recommendations for providing answers to complex clinical questions on lumbosacral radicular pain. <b>DESIGN:</b> Cross-sectional study. <b>METHODS:</b> We extracted recommendations from recent CPGs for diagnosing and treating lumbosacral radicular pain. Relative clinical questions were developed and queried to OpenAI's ChatGPT (GPT-3.5). We compared ChatGPT answers to CPGs recommendations by assessing the (1) internal consistency of ChatGPT answers by measuring the percentage of text wording similarity when a clinical question was posed 3 times, (2) reliability between 2 independent reviewers in grading ChatGPT answers, and (3) accuracy of ChatGPT answers compared to CPGs recommendations. Reliability was estimated using Fleiss' kappa (κ) coefficients, and accuracy by interobserver agreement as the frequency of the agreements among all judgments. <b>RESULTS:</b> We tested 9 clinical questions. The internal consistency of text ChatGPT answers was unacceptable across all 3 trials in all clinical questions (mean percentage of 49%, standard deviation of 15). Intrareliability (reviewer 1: κ = 0.90, standard error [SE] = 0.09; reviewer 2: κ = 0.90, SE = 0.10) and interreliability (κ = 0.85, SE = 0.15) between the 2 reviewers was \"almost perfect.\" Accuracy between ChatGPT answers and CPGs recommendations was slight, demonstrating agreement in 33% of recommendations. <b>CONCLUSION:</b> ChatGPT performed poorly in internal consistency and accuracy of the indications generated compared to clinical practice guideline recommendations for lumbosacral radicular pain. <i>J Orthop Sports Phys Ther 2024;54(3):1-7. Epub 29 January 2024. doi:10.2519/jospt.2024.12151</i>.</p>","PeriodicalId":50099,"journal":{"name":"Journal of Orthopaedic & Sports Physical Therapy","volume":" ","pages":"222-228"},"PeriodicalIF":6.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic & Sports Physical Therapy","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2519/jospt.2024.12151","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
OBJECTIVE: To compare the accuracy of an artificial intelligence chatbot to clinical practice guidelines (CPGs) recommendations for providing answers to complex clinical questions on lumbosacral radicular pain. DESIGN: Cross-sectional study. METHODS: We extracted recommendations from recent CPGs for diagnosing and treating lumbosacral radicular pain. Relative clinical questions were developed and queried to OpenAI's ChatGPT (GPT-3.5). We compared ChatGPT answers to CPGs recommendations by assessing the (1) internal consistency of ChatGPT answers by measuring the percentage of text wording similarity when a clinical question was posed 3 times, (2) reliability between 2 independent reviewers in grading ChatGPT answers, and (3) accuracy of ChatGPT answers compared to CPGs recommendations. Reliability was estimated using Fleiss' kappa (κ) coefficients, and accuracy by interobserver agreement as the frequency of the agreements among all judgments. RESULTS: We tested 9 clinical questions. The internal consistency of text ChatGPT answers was unacceptable across all 3 trials in all clinical questions (mean percentage of 49%, standard deviation of 15). Intrareliability (reviewer 1: κ = 0.90, standard error [SE] = 0.09; reviewer 2: κ = 0.90, SE = 0.10) and interreliability (κ = 0.85, SE = 0.15) between the 2 reviewers was "almost perfect." Accuracy between ChatGPT answers and CPGs recommendations was slight, demonstrating agreement in 33% of recommendations. CONCLUSION: ChatGPT performed poorly in internal consistency and accuracy of the indications generated compared to clinical practice guideline recommendations for lumbosacral radicular pain. J Orthop Sports Phys Ther 2024;54(3):1-7. Epub 29 January 2024. doi:10.2519/jospt.2024.12151.
期刊介绍:
The Journal of Orthopaedic & Sports Physical Therapy® (JOSPT®) publishes scientifically rigorous, clinically relevant content for physical therapists and others in the health care community to advance musculoskeletal and sports-related practice globally. To this end, JOSPT features the latest evidence-based research and clinical cases in musculoskeletal health, injury, and rehabilitation, including physical therapy, orthopaedics, sports medicine, and biomechanics.
With an impact factor of 3.090, JOSPT is among the highest ranked physical therapy journals in Clarivate Analytics''s Journal Citation Reports, Science Edition (2017). JOSPT stands eighth of 65 journals in the category of rehabilitation, twelfth of 77 journals in orthopedics, and fourteenth of 81 journals in sport sciences. JOSPT''s 5-year impact factor is 4.061.