Wesley D Kufel, Kathleen D Hanrahan, Robert W Seabury, Katie A Parsels, Jason C Gallagher, Conan MacDougall, Elizabeth W Covington, Elias B Chahine, Rachel S Britt, Jeffrey M Steele
{"title":"Let's Have a Chat: How Well Does an Artificial Intelligence Chatbot Answer Clinical Infectious Diseases Pharmacotherapy Questions?","authors":"Wesley D Kufel, Kathleen D Hanrahan, Robert W Seabury, Katie A Parsels, Jason C Gallagher, Conan MacDougall, Elizabeth W Covington, Elias B Chahine, Rachel S Britt, Jeffrey M Steele","doi":"10.1093/ofid/ofae641","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>It is unknown whether ChatGPT provides quality responses to infectious diseases (ID) pharmacotherapy questions. This study surveyed ID pharmacist subject matter experts (SMEs) to assess the quality of ChatGPT version 3.5 (GPT-3.5) responses.</p><p><strong>Methods: </strong>The primary outcome was the percentage of GPT-3.5 responses considered useful by SME rating. Secondary outcomes were SMEs' ratings of correctness, completeness, and safety. Rating definitions were based on literature review. One hundred ID pharmacotherapy questions were entered into GPT-3.5 without custom instructions or additional prompts, and responses were recorded. A 0-10 rating scale for correctness, completeness, and safety was developed and validated for interrater reliability. Continuous and categorical variables were assessed for interrater reliability via average measures intraclass correlation coefficient and Fleiss multirater kappa, respectively. SMEs' responses were compared by the Kruskal-Wallis test and chi-square test for continuous and categorical variables.</p><p><strong>Results: </strong>SMEs considered 41.8% of responses useful. Median (IQR) ratings for correctness, completeness, and safety were 7 (4-9), 5 (3-8), and 8 (4-10), respectively. The Fleiss multirater kappa for usefulness was 0.379 (95% CI, .317-.441) indicating fair agreement, and intraclass correlation coefficients were 0.820 (95% CI, .758-.870), 0.745 (95% CI, .656-.816), and 0.833 (95% CI, .775-.880) for correctness, completeness, and safety, indicating at least substantial agreement. No significant difference was observed among SME responses for percentage of responses considered useful.</p><p><strong>Conclusions: </strong>Fewer than 50% of GPT-3.5 responses were considered useful by SMEs. Responses were mostly considered correct and safe but were often incomplete, suggesting that GPT-3.5 responses may not replace an ID pharmacist's responses.</p>","PeriodicalId":19517,"journal":{"name":"Open Forum Infectious Diseases","volume":"11 11","pages":"ofae641"},"PeriodicalIF":3.8000,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11551448/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Forum Infectious Diseases","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/ofid/ofae641","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"IMMUNOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: It is unknown whether ChatGPT provides quality responses to infectious diseases (ID) pharmacotherapy questions. This study surveyed ID pharmacist subject matter experts (SMEs) to assess the quality of ChatGPT version 3.5 (GPT-3.5) responses.
Methods: The primary outcome was the percentage of GPT-3.5 responses considered useful by SME rating. Secondary outcomes were SMEs' ratings of correctness, completeness, and safety. Rating definitions were based on literature review. One hundred ID pharmacotherapy questions were entered into GPT-3.5 without custom instructions or additional prompts, and responses were recorded. A 0-10 rating scale for correctness, completeness, and safety was developed and validated for interrater reliability. Continuous and categorical variables were assessed for interrater reliability via average measures intraclass correlation coefficient and Fleiss multirater kappa, respectively. SMEs' responses were compared by the Kruskal-Wallis test and chi-square test for continuous and categorical variables.
Results: SMEs considered 41.8% of responses useful. Median (IQR) ratings for correctness, completeness, and safety were 7 (4-9), 5 (3-8), and 8 (4-10), respectively. The Fleiss multirater kappa for usefulness was 0.379 (95% CI, .317-.441) indicating fair agreement, and intraclass correlation coefficients were 0.820 (95% CI, .758-.870), 0.745 (95% CI, .656-.816), and 0.833 (95% CI, .775-.880) for correctness, completeness, and safety, indicating at least substantial agreement. No significant difference was observed among SME responses for percentage of responses considered useful.
Conclusions: Fewer than 50% of GPT-3.5 responses were considered useful by SMEs. Responses were mostly considered correct and safe but were often incomplete, suggesting that GPT-3.5 responses may not replace an ID pharmacist's responses.
期刊介绍:
Open Forum Infectious Diseases provides a global forum for the publication of clinical, translational, and basic research findings in a fully open access, online journal environment. The journal reflects the broad diversity of the field of infectious diseases, and focuses on the intersection of biomedical science and clinical practice, with a particular emphasis on knowledge that holds the potential to improve patient care in populations around the world. Fully peer-reviewed, OFID supports the international community of infectious diseases experts by providing a venue for articles that further the understanding of all aspects of infectious diseases.