Marina Alvarez-Estape, Ivan Cano, Rosa Pino, Carla González Grado, Andrea Aldemira-Liz, Javier Gonzálvez-Ortuño, Juanjo do Olmo, Javier Logroño, Marcelo Martínez, Carlos Mascías, Julián Isla, Jordi Martínez Roldán, Cristian Launes, Francesc Garcia-Cuyas, Paula Esteller-Cucala
{"title":"Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience","authors":"Marina Alvarez-Estape, Ivan Cano, Rosa Pino, Carla González Grado, Andrea Aldemira-Liz, Javier Gonzálvez-Ortuño, Juanjo do Olmo, Javier Logroño, Marcelo Martínez, Carlos Mascías, Julián Isla, Jordi Martínez Roldán, Cristian Launes, Francesc Garcia-Cuyas, Paula Esteller-Cucala","doi":"10.1101/2024.07.23.24310847","DOIUrl":null,"url":null,"abstract":"<strong>Importance</strong> The time to accurately diagnose rare pediatric diseases often spans years. Assessing the diagnostic accuracy of an LLM-based tool on real pediatric cases can help reduce this time, providing quicker diagnoses for patients and their families. <strong>Objective</strong> To evaluate the clinical utility of DxGPT as a support tool for differential diagnosis of both common and rare diseases. <strong>Design</strong> Unicentric descriptive cross-sectional exploratory study. Anonymized data from 50 pediatric patients' medical histories, covering common and rare pathologies, were used to generate clinical case notes. Each clinical case included essential data, with some expanded by complementary tests. <strong>Setting</strong> This study was conducted at a reference pediatric hospital, Sant Joan de Déu Barcelona Children′s Hospital. <strong>Participants</strong> A total of 50 clinical cases were diagnosed by 78 volunteer doctors (medical diagnostic team) with varying experience, each reviewing 3 clinical cases. <strong>Interventions</strong> Each clinician listed up to five diagnoses per clinical case note. The same was done on the DxGPT web platform, obtaining the Top-5 diagnostic proposals. To evaluate DxGPT's variability, each note was queried three times. <strong>Main Outcome(s) and Measure(s)</strong> The study mainly focused on comparing diagnostic accuracy, defined as the percentage of cases with the correct diagnosis, between the medical diagnostic team and DxGPT. Other evaluation criteria included qualitative assessments. The medical diagnostic team also completed a survey on their user experience with DxGPT.\n<strong>Results</strong> Top-5 diagnostic accuracy was 65% for clinicians and 60% for DxGPT, with no significant differences. Accuracies for common diseases were higher (Clinicians: 79%, DxGPT: 71%) than for rare diseases (Clinicians: 50%, DxGPT: 49%). Accuracy increased similarly in both groups with expanded information, but this increase was only stastically significant in clinicians (simple 52% vs. expanded 69%; p=0.03). DxGPT′s response variability affected less than 5% of clinical case notes. A survey of 48 clinicians rated the DxGPT platform 3.9/5 overall, 4.1/5 for usefulness, and 4.5/5 for usability. <strong>Conclusions and Relevance</strong> DxGPT showed diagnostic accuracies similar to medical staff from a pediatric hospital, indicating its potential for supporting differential diagnosis in other settings. Clinicians praised its usability and simplicity. These tools could provide new insights for challenging diagnostic cases.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"93 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.23.24310847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Importance The time to accurately diagnose rare pediatric diseases often spans years. Assessing the diagnostic accuracy of an LLM-based tool on real pediatric cases can help reduce this time, providing quicker diagnoses for patients and their families. Objective To evaluate the clinical utility of DxGPT as a support tool for differential diagnosis of both common and rare diseases. Design Unicentric descriptive cross-sectional exploratory study. Anonymized data from 50 pediatric patients' medical histories, covering common and rare pathologies, were used to generate clinical case notes. Each clinical case included essential data, with some expanded by complementary tests. Setting This study was conducted at a reference pediatric hospital, Sant Joan de Déu Barcelona Children′s Hospital. Participants A total of 50 clinical cases were diagnosed by 78 volunteer doctors (medical diagnostic team) with varying experience, each reviewing 3 clinical cases. Interventions Each clinician listed up to five diagnoses per clinical case note. The same was done on the DxGPT web platform, obtaining the Top-5 diagnostic proposals. To evaluate DxGPT's variability, each note was queried three times. Main Outcome(s) and Measure(s) The study mainly focused on comparing diagnostic accuracy, defined as the percentage of cases with the correct diagnosis, between the medical diagnostic team and DxGPT. Other evaluation criteria included qualitative assessments. The medical diagnostic team also completed a survey on their user experience with DxGPT.
Results Top-5 diagnostic accuracy was 65% for clinicians and 60% for DxGPT, with no significant differences. Accuracies for common diseases were higher (Clinicians: 79%, DxGPT: 71%) than for rare diseases (Clinicians: 50%, DxGPT: 49%). Accuracy increased similarly in both groups with expanded information, but this increase was only stastically significant in clinicians (simple 52% vs. expanded 69%; p=0.03). DxGPT′s response variability affected less than 5% of clinical case notes. A survey of 48 clinicians rated the DxGPT platform 3.9/5 overall, 4.1/5 for usefulness, and 4.5/5 for usability. Conclusions and Relevance DxGPT showed diagnostic accuracies similar to medical staff from a pediatric hospital, indicating its potential for supporting differential diagnosis in other settings. Clinicians praised its usability and simplicity. These tools could provide new insights for challenging diagnostic cases.