M Bortoli, M Fiore, S Tedeschi, V Oliveira, R Sousa, A Bruschi, D A Campanacci, P Viale, M De Paolis, A Sambri
{"title":"GPT-based chatbot tools are still unreliable in the management of prosthetic joint infections.","authors":"M Bortoli, M Fiore, S Tedeschi, V Oliveira, R Sousa, A Bruschi, D A Campanacci, P Viale, M De Paolis, A Sambri","doi":"10.1007/s12306-024-00846-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence chatbot tools responses might discern patterns and correlations that may elude human observation, leading to more accurate and timely interventions. However, their reliability to answer healthcare-related questions is still debated. This study aimed to assess the performance of the three versions of GPT-based chatbots about prosthetic joint infections (PJI).</p><p><strong>Methods: </strong>Thirty questions concerning the diagnosis and treatment of hip and knee PJIs, stratified by a priori established difficulty, were generated by a team of experts, and administered to ChatGPT 3.5, BingChat, and ChatGPT 4.0. Responses were rated by three orthopedic surgeons and two infectious diseases physicians using a five-point Likert-like scale with numerical values to quantify the quality of responses. Inter-rater reliability was assessed by interclass correlation statistics.</p><p><strong>Results: </strong>Responses averaged \"good-to-very good\" for all chatbots examined, both in diagnosis and treatment, with no significant differences according to the difficulty of the questions. However, BingChat ratings were significantly lower in the treatment setting (p = 0.025), particularly in terms of accuracy (p = 0.02) and completeness (p = 0.004). Agreement in ratings among examiners appeared to be very poor.</p><p><strong>Conclusions: </strong>On average, the quality of responses is rated positively by experts, but with ratings that frequently may vary widely. This currently suggests that AI chatbot tools are still unreliable in the management of PJI.</p>","PeriodicalId":18875,"journal":{"name":"MUSCULOSKELETAL SURGERY","volume":" ","pages":"459-466"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MUSCULOSKELETAL SURGERY","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12306-024-00846-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/2 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Artificial intelligence chatbot tools responses might discern patterns and correlations that may elude human observation, leading to more accurate and timely interventions. However, their reliability to answer healthcare-related questions is still debated. This study aimed to assess the performance of the three versions of GPT-based chatbots about prosthetic joint infections (PJI).
Methods: Thirty questions concerning the diagnosis and treatment of hip and knee PJIs, stratified by a priori established difficulty, were generated by a team of experts, and administered to ChatGPT 3.5, BingChat, and ChatGPT 4.0. Responses were rated by three orthopedic surgeons and two infectious diseases physicians using a five-point Likert-like scale with numerical values to quantify the quality of responses. Inter-rater reliability was assessed by interclass correlation statistics.
Results: Responses averaged "good-to-very good" for all chatbots examined, both in diagnosis and treatment, with no significant differences according to the difficulty of the questions. However, BingChat ratings were significantly lower in the treatment setting (p = 0.025), particularly in terms of accuracy (p = 0.02) and completeness (p = 0.004). Agreement in ratings among examiners appeared to be very poor.
Conclusions: On average, the quality of responses is rated positively by experts, but with ratings that frequently may vary widely. This currently suggests that AI chatbot tools are still unreliable in the management of PJI.
期刊介绍:
Musculoskeletal Surgery – Formerly La Chirurgia degli Organi di Movimento, founded in 1917 at the Istituto Ortopedico Rizzoli, is a peer-reviewed journal published three times a year. The journal provides up-to-date information to clinicians and scientists through the publication of original papers, reviews, case reports, and brief communications dealing with the pathogenesis and treatment of orthopaedic conditions.An electronic version is also available at http://www.springerlink.com.The journal is open for publication of supplements and for publishing abstracts of scientific meetings; conditions can be obtained from the Editors-in-Chief or the Publisher.