Mohammadreza Mohammadi, Sara Parviz, Parinaz Parvaz, Mohammad Mahdi Pirmoradi, Mohammad Afzalimoghaddam, Hadi Mirfazaelian
{"title":"Diagnostic performance of ChatGPT in tibial plateau fracture in knee X-ray.","authors":"Mohammadreza Mohammadi, Sara Parviz, Parinaz Parvaz, Mohammad Mahdi Pirmoradi, Mohammad Afzalimoghaddam, Hadi Mirfazaelian","doi":"10.1007/s10140-024-02298-y","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Tibial plateau fractures are relatively common and require accurate diagnosis. Chat Generative Pre-Trained Transformer (ChatGPT) has emerged as a tool to improve medical diagnosis. This study aims to investigate the accuracy of this tool in diagnosing tibial plateau fractures.</p><p><strong>Methods: </strong>A secondary analysis was performed on 111 knee radiographs from emergency department patients, with 29 confirmed fractures by computed tomography (CT) imaging. The X-rays were reviewed by a board-certified emergency physician (EP) and radiologist and then analyzed by ChatGPT-4 and ChatGPT-4o. The diagnostic performances were compared using the area under the receiver operating characteristic curve (AUC). Sensitivity, specificity, and likelihood ratios were also calculated.</p><p><strong>Results: </strong>The results indicated a sensitivity and negative likelihood ratio of 58.6% (95% CI: 38.9 - 76.4%) and 0.4 (95% CI: 0.3-0.7) for the EP, 72.4% (95% CI: 52.7 - 87.2%) and 0.3 (95% CI: 0.2-0.6) for the radiologist, 27.5% (95% CI: 12.7 - 47.2%) and 0.7 (95% CI: 0.6-0.9) for ChatGPT-4, and 55.1% (95% CI: 35.6 - 73.5%) and 0.4 (95% CI: 0.3-0.7) for ChatGPT4o. The specificity and positive likelihood ratio were 85.3% (95% CI: 75.8 - 92.2%) and 4.0 (95% CI: 2.1-7.3) for the EP, 76.8% (95% CI: 66.2 - 85.4%) and 3.1 (95% CI: 1.9-4.9) for the radiologist, 95.1% (95% CI: 87.9 - 98.6%) and 5.6 (95% CI: 1.8-17.3) for ChatGPT-4, and 93.9% (95% CI: 86.3 - 97.9%) and 9.0 (95% CI: 3.6-22.4) for ChatGPT4o. The area under the receiver operating characteristic curve (AUC) was 0.72 (95% CI: 0.6-0.8) for the EP, 0.75 (95% CI: 0.6-0.8) for the radiologist, 0.61 (95% CI: 0.4-0.7) for ChatGPT-4, and 0.74 (95% CI: 0.6-0.8) for ChatGPT4-o. The EP and radiologist significantly outperformed ChatGPT-4 (P value = 0.02 and 0.01, respectively), whereas there was no significant difference between the EP, ChatGPT-4o, and radiologist.</p><p><strong>Conclusion: </strong>ChatGPT-4o matched the physicians' performance and also had the highest specificity. Similar to the physicians, ChatGPT chatbots were not suitable for ruling out the fracture.</p>","PeriodicalId":11623,"journal":{"name":"Emergency Radiology","volume":" ","pages":"59-64"},"PeriodicalIF":1.7000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Emergency Radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10140-024-02298-y","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/30 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Tibial plateau fractures are relatively common and require accurate diagnosis. Chat Generative Pre-Trained Transformer (ChatGPT) has emerged as a tool to improve medical diagnosis. This study aims to investigate the accuracy of this tool in diagnosing tibial plateau fractures.
Methods: A secondary analysis was performed on 111 knee radiographs from emergency department patients, with 29 confirmed fractures by computed tomography (CT) imaging. The X-rays were reviewed by a board-certified emergency physician (EP) and radiologist and then analyzed by ChatGPT-4 and ChatGPT-4o. The diagnostic performances were compared using the area under the receiver operating characteristic curve (AUC). Sensitivity, specificity, and likelihood ratios were also calculated.
Results: The results indicated a sensitivity and negative likelihood ratio of 58.6% (95% CI: 38.9 - 76.4%) and 0.4 (95% CI: 0.3-0.7) for the EP, 72.4% (95% CI: 52.7 - 87.2%) and 0.3 (95% CI: 0.2-0.6) for the radiologist, 27.5% (95% CI: 12.7 - 47.2%) and 0.7 (95% CI: 0.6-0.9) for ChatGPT-4, and 55.1% (95% CI: 35.6 - 73.5%) and 0.4 (95% CI: 0.3-0.7) for ChatGPT4o. The specificity and positive likelihood ratio were 85.3% (95% CI: 75.8 - 92.2%) and 4.0 (95% CI: 2.1-7.3) for the EP, 76.8% (95% CI: 66.2 - 85.4%) and 3.1 (95% CI: 1.9-4.9) for the radiologist, 95.1% (95% CI: 87.9 - 98.6%) and 5.6 (95% CI: 1.8-17.3) for ChatGPT-4, and 93.9% (95% CI: 86.3 - 97.9%) and 9.0 (95% CI: 3.6-22.4) for ChatGPT4o. The area under the receiver operating characteristic curve (AUC) was 0.72 (95% CI: 0.6-0.8) for the EP, 0.75 (95% CI: 0.6-0.8) for the radiologist, 0.61 (95% CI: 0.4-0.7) for ChatGPT-4, and 0.74 (95% CI: 0.6-0.8) for ChatGPT4-o. The EP and radiologist significantly outperformed ChatGPT-4 (P value = 0.02 and 0.01, respectively), whereas there was no significant difference between the EP, ChatGPT-4o, and radiologist.
Conclusion: ChatGPT-4o matched the physicians' performance and also had the highest specificity. Similar to the physicians, ChatGPT chatbots were not suitable for ruling out the fracture.
期刊介绍:
To advance and improve the radiologic aspects of emergency careTo establish Emergency Radiology as an area of special interest in the field of diagnostic imagingTo improve methods of education in Emergency RadiologyTo provide, through formal meetings, a mechanism for presentation of scientific papers on various aspects of Emergency Radiology and continuing educationTo promote research in Emergency Radiology by clinical and basic science investigators, including residents and other traineesTo act as the resource body on Emergency Radiology for those interested in emergency patient care Members of the American Society of Emergency Radiology (ASER) receive the Emergency Radiology journal as a benefit of membership!