在对腰骶椎痛做出知情决定时，ChatGPT 与临床实践指南的性能比较：一项横断面研究。

IF 6 1区医学 Q1 ORTHOPEDICS Journal of Orthopaedic & Sports Physical Therapy Pub Date : 2024-03-01 DOI:10.2519/jospt.2024.12151

Silvia Gianola, Silvia Bargeri, Greta Castellini, Chad Cook, Alvisa Palese, Paolo Pillastrini, Silvia Salvalaggio, Andrea Turolla, Giacomo Rossettini

{"title":"在对腰骶椎痛做出知情决定时，ChatGPT 与临床实践指南的性能比较：一项横断面研究。","authors":"Silvia Gianola, Silvia Bargeri, Greta Castellini, Chad Cook, Alvisa Palese, Paolo Pillastrini, Silvia Salvalaggio, Andrea Turolla, Giacomo Rossettini","doi":"10.2519/jospt.2024.12151","DOIUrl":null,"url":null,"abstract":"OBJECTIVE: To compare the accuracy of an artificial intelligence chatbot to clinical practice guidelines (CPGs) recommendations for providing answers to complex clinical questions on lumbosacral radicular pain. DESIGN: Cross-sectional study. METHODS: We extracted recommendations from recent CPGs for diagnosing and treating lumbosacral radicular pain. Relative clinical questions were developed and queried to OpenAI's ChatGPT (GPT-3.5). We compared ChatGPT answers to CPGs recommendations by assessing the (1) internal consistency of ChatGPT answers by measuring the percentage of text wording similarity when a clinical question was posed 3 times, (2) reliability between 2 independent reviewers in grading ChatGPT answers, and (3) accuracy of ChatGPT answers compared to CPGs recommendations. Reliability was estimated using Fleiss' kappa (κ) coefficients, and accuracy by interobserver agreement as the frequency of the agreements among all judgments. RESULTS: We tested 9 clinical questions. The internal consistency of text ChatGPT answers was unacceptable across all 3 trials in all clinical questions (mean percentage of 49%, standard deviation of 15). Intrareliability (reviewer 1: κ = 0.90, standard error [SE] = 0.09; reviewer 2: κ = 0.90, SE = 0.10) and interreliability (κ = 0.85, SE = 0.15) between the 2 reviewers was \"almost perfect.\" Accuracy between ChatGPT answers and CPGs recommendations was slight, demonstrating agreement in 33% of recommendations. CONCLUSION: ChatGPT performed poorly in internal consistency and accuracy of the indications generated compared to clinical practice guideline recommendations for lumbosacral radicular pain. J Orthop Sports Phys Ther 2024;54(3):1-7. Epub 29 January 2024. doi:10.2519/jospt.2024.12151.","PeriodicalId":50099,"journal":{"name":"Journal of Orthopaedic & Sports Physical Therapy","volume":" ","pages":"222-228"},"PeriodicalIF":6.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of ChatGPT Compared to Clinical Practice Guidelines in Making Informed Decisions for Lumbosacral Radicular Pain: A Cross-sectional Study.\",\"authors\":\"Silvia Gianola, Silvia Bargeri, Greta Castellini, Chad Cook, Alvisa Palese, Paolo Pillastrini, Silvia Salvalaggio, Andrea Turolla, Giacomo Rossettini\",\"doi\":\"10.2519/jospt.2024.12151\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"OBJECTIVE: To compare the accuracy of an artificial intelligence chatbot to clinical practice guidelines (CPGs) recommendations for providing answers to complex clinical questions on lumbosacral radicular pain. DESIGN: Cross-sectional study. METHODS: We extracted recommendations from recent CPGs for diagnosing and treating lumbosacral radicular pain. Relative clinical questions were developed and queried to OpenAI's ChatGPT (GPT-3.5). We compared ChatGPT answers to CPGs recommendations by assessing the (1) internal consistency of ChatGPT answers by measuring the percentage of text wording similarity when a clinical question was posed 3 times, (2) reliability between 2 independent reviewers in grading ChatGPT answers, and (3) accuracy of ChatGPT answers compared to CPGs recommendations. Reliability was estimated using Fleiss' kappa (κ) coefficients, and accuracy by interobserver agreement as the frequency of the agreements among all judgments. RESULTS: We tested 9 clinical questions. The internal consistency of text ChatGPT answers was unacceptable across all 3 trials in all clinical questions (mean percentage of 49%, standard deviation of 15). Intrareliability (reviewer 1: κ = 0.90, standard error [SE] = 0.09; reviewer 2: κ = 0.90, SE = 0.10) and interreliability (κ = 0.85, SE = 0.15) between the 2 reviewers was \\\"almost perfect.\\\" Accuracy between ChatGPT answers and CPGs recommendations was slight, demonstrating agreement in 33% of recommendations. CONCLUSION: ChatGPT performed poorly in internal consistency and accuracy of the indications generated compared to clinical practice guideline recommendations for lumbosacral radicular pain. J Orthop Sports Phys Ther 2024;54(3):1-7. Epub 29 January 2024. doi:10.2519/jospt.2024.12151.\",\"PeriodicalId\":50099,\"journal\":{\"name\":\"Journal of Orthopaedic & Sports Physical Therapy\",\"volume\":\" \",\"pages\":\"222-228\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Orthopaedic & Sports Physical Therapy\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2519/jospt.2024.12151\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic & Sports Physical Therapy","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2519/jospt.2024.12151","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

摘要

目的：比较人工智能聊天机器人与临床实践指南（CPG）建议在回答腰骶根性疼痛复杂临床问题时的准确性。设计：横断面研究。方法：我们从近期的临床实践指南中提取了诊断和治疗腰骶部疼痛的建议。开发了相关临床问题，并在 Open AI 的 ChatGPT (GPT-3.5) 中进行了查询。我们将 ChatGPT 答案与 CPGs 建议进行了比较，评估方法包括：(i) 当一个临床问题被提出三次时，通过测量文本措辞相似度的百分比来评估 ChatGPT 答案的内部一致性；(ii) 两位独立审查员对 ChatGPT 答案评分的可靠性；(iii) ChatGPT 答案与 CPGs 建议相比的准确性。可靠性采用弗莱斯卡帕（κ）系数估算，准确性采用观察者之间的一致性估算，即所有判断中一致的频率。结果：我们测试了九个临床问题。在所有临床问题中，文本 ChatGPT 答案的内部一致性在所有三项试验中都是不可接受的（平均百分比为 49%，标准差为 15）。两位审阅人之间的内部（审阅人 1：κ=0.90 标准误差 (SE) =0.09；审阅人 2：κ=0.90 SE=0.10）和相互之间的可靠性（κ=0.85 SE=0.15）"几乎完美"。ChatGPT 答案与 CPGs 建议之间的准确性略有差异，33% 的建议一致。结论：与腰骶部根性疼痛临床实践指南建议相比，ChatGPT 生成的适应症在内部一致性和准确性方面表现不佳。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Performance of ChatGPT Compared to Clinical Practice Guidelines in Making Informed Decisions for Lumbosacral Radicular Pain: A Cross-sectional Study.

OBJECTIVE: To compare the accuracy of an artificial intelligence chatbot to clinical practice guidelines (CPGs) recommendations for providing answers to complex clinical questions on lumbosacral radicular pain. DESIGN: Cross-sectional study. METHODS: We extracted recommendations from recent CPGs for diagnosing and treating lumbosacral radicular pain. Relative clinical questions were developed and queried to OpenAI's ChatGPT (GPT-3.5). We compared ChatGPT answers to CPGs recommendations by assessing the (1) internal consistency of ChatGPT answers by measuring the percentage of text wording similarity when a clinical question was posed 3 times, (2) reliability between 2 independent reviewers in grading ChatGPT answers, and (3) accuracy of ChatGPT answers compared to CPGs recommendations. Reliability was estimated using Fleiss' kappa (κ) coefficients, and accuracy by interobserver agreement as the frequency of the agreements among all judgments. RESULTS: We tested 9 clinical questions. The internal consistency of text ChatGPT answers was unacceptable across all 3 trials in all clinical questions (mean percentage of 49%, standard deviation of 15). Intrareliability (reviewer 1: κ = 0.90, standard error [SE] = 0.09; reviewer 2: κ = 0.90, SE = 0.10) and interreliability (κ = 0.85, SE = 0.15) between the 2 reviewers was "almost perfect." Accuracy between ChatGPT answers and CPGs recommendations was slight, demonstrating agreement in 33% of recommendations. CONCLUSION: ChatGPT performed poorly in internal consistency and accuracy of the indications generated compared to clinical practice guideline recommendations for lumbosacral radicular pain. J Orthop Sports Phys Ther 2024;54(3):1-7. Epub 29 January 2024. doi:10.2519/jospt.2024.12151.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Orthopaedic & Sports Physical Therapy 医学-康复医学

CiteScore

8.00

自引率

4.90%

发文量

101

审稿时长

6-12 weeks

期刊介绍： The Journal of Orthopaedic & Sports Physical Therapy® (JOSPT®) publishes scientifically rigorous, clinically relevant content for physical therapists and others in the health care community to advance musculoskeletal and sports-related practice globally. To this end, JOSPT features the latest evidence-based research and clinical cases in musculoskeletal health, injury, and rehabilitation, including physical therapy, orthopaedics, sports medicine, and biomechanics. With an impact factor of 3.090, JOSPT is among the highest ranked physical therapy journals in Clarivate Analytics''s Journal Citation Reports, Science Edition (2017). JOSPT stands eighth of 65 journals in the category of rehabilitation, twelfth of 77 journals in orthopedics, and fourteenth of 81 journals in sport sciences. JOSPT''s 5-year impact factor is 4.061.