GPT-based chatbot tools are still unreliable in the management of prosthetic joint infections.

Q1 Medicine MUSCULOSKELETAL SURGERY Pub Date : 2024-12-01 Epub Date: 2024-07-02 DOI:10.1007/s12306-024-00846-w
M Bortoli, M Fiore, S Tedeschi, V Oliveira, R Sousa, A Bruschi, D A Campanacci, P Viale, M De Paolis, A Sambri
{"title":"GPT-based chatbot tools are still unreliable in the management of prosthetic joint infections.","authors":"M Bortoli, M Fiore, S Tedeschi, V Oliveira, R Sousa, A Bruschi, D A Campanacci, P Viale, M De Paolis, A Sambri","doi":"10.1007/s12306-024-00846-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence chatbot tools responses might discern patterns and correlations that may elude human observation, leading to more accurate and timely interventions. However, their reliability to answer healthcare-related questions is still debated. This study aimed to assess the performance of the three versions of GPT-based chatbots about prosthetic joint infections (PJI).</p><p><strong>Methods: </strong>Thirty questions concerning the diagnosis and treatment of hip and knee PJIs, stratified by a priori established difficulty, were generated by a team of experts, and administered to ChatGPT 3.5, BingChat, and ChatGPT 4.0. Responses were rated by three orthopedic surgeons and two infectious diseases physicians using a five-point Likert-like scale with numerical values to quantify the quality of responses. Inter-rater reliability was assessed by interclass correlation statistics.</p><p><strong>Results: </strong>Responses averaged \"good-to-very good\" for all chatbots examined, both in diagnosis and treatment, with no significant differences according to the difficulty of the questions. However, BingChat ratings were significantly lower in the treatment setting (p = 0.025), particularly in terms of accuracy (p = 0.02) and completeness (p = 0.004). Agreement in ratings among examiners appeared to be very poor.</p><p><strong>Conclusions: </strong>On average, the quality of responses is rated positively by experts, but with ratings that frequently may vary widely. This currently suggests that AI chatbot tools are still unreliable in the management of PJI.</p>","PeriodicalId":18875,"journal":{"name":"MUSCULOSKELETAL SURGERY","volume":" ","pages":"459-466"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MUSCULOSKELETAL SURGERY","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12306-024-00846-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/2 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Artificial intelligence chatbot tools responses might discern patterns and correlations that may elude human observation, leading to more accurate and timely interventions. However, their reliability to answer healthcare-related questions is still debated. This study aimed to assess the performance of the three versions of GPT-based chatbots about prosthetic joint infections (PJI).

Methods: Thirty questions concerning the diagnosis and treatment of hip and knee PJIs, stratified by a priori established difficulty, were generated by a team of experts, and administered to ChatGPT 3.5, BingChat, and ChatGPT 4.0. Responses were rated by three orthopedic surgeons and two infectious diseases physicians using a five-point Likert-like scale with numerical values to quantify the quality of responses. Inter-rater reliability was assessed by interclass correlation statistics.

Results: Responses averaged "good-to-very good" for all chatbots examined, both in diagnosis and treatment, with no significant differences according to the difficulty of the questions. However, BingChat ratings were significantly lower in the treatment setting (p = 0.025), particularly in terms of accuracy (p = 0.02) and completeness (p = 0.004). Agreement in ratings among examiners appeared to be very poor.

Conclusions: On average, the quality of responses is rated positively by experts, but with ratings that frequently may vary widely. This currently suggests that AI chatbot tools are still unreliable in the management of PJI.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于 GPT 的聊天机器人工具在假体关节感染管理方面仍不可靠。
背景:人工智能聊天机器人工具的回复可能会辨别出人类无法观察到的模式和相关性,从而进行更准确、更及时的干预。然而,它们回答医疗保健相关问题的可靠性仍存在争议。本研究旨在评估三个版本的基于 GPT 的人工关节感染(PJI)聊天机器人的性能:由一个专家团队生成了 30 个有关髋关节和膝关节假体关节感染的诊断和治疗的问题,这些问题按事先确定的难度进行了分层,并在 ChatGPT 3.5、BingChat 和 ChatGPT 4.0 上进行了测试。三位骨科医生和两位传染病医生采用五点李克特量表对回答进行评分,并用数值量化回答的质量。评分者之间的可靠性通过类间相关统计进行评估:结果:在诊断和治疗方面,所有接受检查的聊天机器人的平均回复质量均为 "好到非常好",与问题的难度没有明显差异。但是,BingChat 在治疗方面的评分明显较低(p = 0.025),尤其是在准确性(p = 0.02)和完整性(p = 0.004)方面。检查人员之间的评分一致性似乎很差:平均而言,专家对回复质量的评价是正面的,但评分经常会有很大差异。目前这表明,人工智能聊天机器人工具在PJI管理方面仍不可靠。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
MUSCULOSKELETAL SURGERY
MUSCULOSKELETAL SURGERY Medicine-Surgery
CiteScore
4.50
自引率
0.00%
发文量
35
期刊介绍: Musculoskeletal Surgery – Formerly La Chirurgia degli Organi di Movimento, founded in 1917 at the Istituto Ortopedico Rizzoli, is a peer-reviewed journal published three times a year. The journal provides up-to-date information to clinicians and scientists through the publication of original papers, reviews, case reports, and brief communications dealing with the pathogenesis and treatment of orthopaedic conditions.An electronic version is also available at http://www.springerlink.com.The journal is open for publication of supplements and for publishing abstracts of scientific meetings; conditions can be obtained from the Editors-in-Chief or the Publisher.
期刊最新文献
To cast or not to cast? Postoperative care of ankle fractures: a meta-analysis of randomized controlled trials. Use of calcaneal locking plate in surgical treatment of quadrilateral plate fractures of the acetabulum. Role of tranexamic acid in reducing peri-operative blood loss in open spine surgeries. Magnum metal-on-metal uncemented total hip replacement: 8- to 18-year outcomes of 211 cases. Oxidised cellulose in musculoskeletal oncology procedure: Does it reduce postoperative blood loss?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1