Wesley D Kufel, Kathleen D Hanrahan, Robert W Seabury, Katie A Parsels, Jason C Gallagher, Conan MacDougall, Elizabeth W Covington, Elias B Chahine, Rachel S Britt, Jeffrey M Steele
{"title":"我们来聊聊人工智能聊天机器人如何回答临床传染病药物治疗问题?","authors":"Wesley D Kufel, Kathleen D Hanrahan, Robert W Seabury, Katie A Parsels, Jason C Gallagher, Conan MacDougall, Elizabeth W Covington, Elias B Chahine, Rachel S Britt, Jeffrey M Steele","doi":"10.1093/ofid/ofae641","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>It is unknown whether ChatGPT provides quality responses to infectious diseases (ID) pharmacotherapy questions. This study surveyed ID pharmacist subject matter experts (SMEs) to assess the quality of ChatGPT version 3.5 (GPT-3.5) responses.</p><p><strong>Methods: </strong>The primary outcome was the percentage of GPT-3.5 responses considered useful by SME rating. Secondary outcomes were SMEs' ratings of correctness, completeness, and safety. Rating definitions were based on literature review. One hundred ID pharmacotherapy questions were entered into GPT-3.5 without custom instructions or additional prompts, and responses were recorded. A 0-10 rating scale for correctness, completeness, and safety was developed and validated for interrater reliability. Continuous and categorical variables were assessed for interrater reliability via average measures intraclass correlation coefficient and Fleiss multirater kappa, respectively. SMEs' responses were compared by the Kruskal-Wallis test and chi-square test for continuous and categorical variables.</p><p><strong>Results: </strong>SMEs considered 41.8% of responses useful. Median (IQR) ratings for correctness, completeness, and safety were 7 (4-9), 5 (3-8), and 8 (4-10), respectively. The Fleiss multirater kappa for usefulness was 0.379 (95% CI, .317-.441) indicating fair agreement, and intraclass correlation coefficients were 0.820 (95% CI, .758-.870), 0.745 (95% CI, .656-.816), and 0.833 (95% CI, .775-.880) for correctness, completeness, and safety, indicating at least substantial agreement. No significant difference was observed among SME responses for percentage of responses considered useful.</p><p><strong>Conclusions: </strong>Fewer than 50% of GPT-3.5 responses were considered useful by SMEs. Responses were mostly considered correct and safe but were often incomplete, suggesting that GPT-3.5 responses may not replace an ID pharmacist's responses.</p>","PeriodicalId":19517,"journal":{"name":"Open Forum Infectious Diseases","volume":"11 11","pages":"ofae641"},"PeriodicalIF":3.8000,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11551448/pdf/","citationCount":"0","resultStr":"{\"title\":\"Let's Have a Chat: How Well Does an Artificial Intelligence Chatbot Answer Clinical Infectious Diseases Pharmacotherapy Questions?\",\"authors\":\"Wesley D Kufel, Kathleen D Hanrahan, Robert W Seabury, Katie A Parsels, Jason C Gallagher, Conan MacDougall, Elizabeth W Covington, Elias B Chahine, Rachel S Britt, Jeffrey M Steele\",\"doi\":\"10.1093/ofid/ofae641\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>It is unknown whether ChatGPT provides quality responses to infectious diseases (ID) pharmacotherapy questions. This study surveyed ID pharmacist subject matter experts (SMEs) to assess the quality of ChatGPT version 3.5 (GPT-3.5) responses.</p><p><strong>Methods: </strong>The primary outcome was the percentage of GPT-3.5 responses considered useful by SME rating. Secondary outcomes were SMEs' ratings of correctness, completeness, and safety. Rating definitions were based on literature review. One hundred ID pharmacotherapy questions were entered into GPT-3.5 without custom instructions or additional prompts, and responses were recorded. A 0-10 rating scale for correctness, completeness, and safety was developed and validated for interrater reliability. Continuous and categorical variables were assessed for interrater reliability via average measures intraclass correlation coefficient and Fleiss multirater kappa, respectively. SMEs' responses were compared by the Kruskal-Wallis test and chi-square test for continuous and categorical variables.</p><p><strong>Results: </strong>SMEs considered 41.8% of responses useful. Median (IQR) ratings for correctness, completeness, and safety were 7 (4-9), 5 (3-8), and 8 (4-10), respectively. The Fleiss multirater kappa for usefulness was 0.379 (95% CI, .317-.441) indicating fair agreement, and intraclass correlation coefficients were 0.820 (95% CI, .758-.870), 0.745 (95% CI, .656-.816), and 0.833 (95% CI, .775-.880) for correctness, completeness, and safety, indicating at least substantial agreement. No significant difference was observed among SME responses for percentage of responses considered useful.</p><p><strong>Conclusions: </strong>Fewer than 50% of GPT-3.5 responses were considered useful by SMEs. Responses were mostly considered correct and safe but were often incomplete, suggesting that GPT-3.5 responses may not replace an ID pharmacist's responses.</p>\",\"PeriodicalId\":19517,\"journal\":{\"name\":\"Open Forum Infectious Diseases\",\"volume\":\"11 11\",\"pages\":\"ofae641\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2024-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11551448/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Open Forum Infectious Diseases\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1093/ofid/ofae641\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/11/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"IMMUNOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Forum Infectious Diseases","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/ofid/ofae641","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"IMMUNOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
背景:目前尚不清楚 ChatGPT 是否能高质量地回答传染病(ID)药物治疗问题。本研究对 ID 药剂师主题专家(SMEs)进行了调查,以评估 ChatGPT 3.5 版(GPT-3.5)回复的质量:主要结果是 SME 评级认为有用的 GPT-3.5 回答的百分比。次要结果是中小企业对正确性、完整性和安全性的评分。评分定义基于文献综述。在 GPT-3.5 中输入了 100 个 ID 药物治疗问题,没有自定义说明或额外提示,并记录了回复。针对正确性、完整性和安全性制定了 0-10 级评分表,并对评分者之间的可靠性进行了验证。连续变量和分类变量分别通过平均测量类内相关系数和弗莱斯多变量卡帕评估研究者之间的可靠性。通过 Kruskal-Wallis 检验和卡方检验对中小型企业的连续变量和分类变量进行比较:中小型企业认为 41.8%的答复有用。正确性、完整性和安全性的评分中位数(IQR)分别为 7(4-9)、5(3-8)和 8(4-10)。有用性的弗莱克斯多方卡帕值为 0.379(95% CI,.317-.441),表明一致性尚可;正确性、完整性和安全性的类内相关系数分别为 0.820(95% CI,.758-.870)、0.745(95% CI,.656-.816)和 0.833(95% CI,.775-.880),表明至少基本一致。在被认为有用的回答百分比方面,中小企业的回答没有明显差异:结论:中小型企业认为有用的 GPT-3.5 答复不到 50%。大多数答复被认为是正确和安全的,但往往不完整,这表明 GPT-3.5 答复可能无法取代 ID 药剂师的答复。
Let's Have a Chat: How Well Does an Artificial Intelligence Chatbot Answer Clinical Infectious Diseases Pharmacotherapy Questions?
Background: It is unknown whether ChatGPT provides quality responses to infectious diseases (ID) pharmacotherapy questions. This study surveyed ID pharmacist subject matter experts (SMEs) to assess the quality of ChatGPT version 3.5 (GPT-3.5) responses.
Methods: The primary outcome was the percentage of GPT-3.5 responses considered useful by SME rating. Secondary outcomes were SMEs' ratings of correctness, completeness, and safety. Rating definitions were based on literature review. One hundred ID pharmacotherapy questions were entered into GPT-3.5 without custom instructions or additional prompts, and responses were recorded. A 0-10 rating scale for correctness, completeness, and safety was developed and validated for interrater reliability. Continuous and categorical variables were assessed for interrater reliability via average measures intraclass correlation coefficient and Fleiss multirater kappa, respectively. SMEs' responses were compared by the Kruskal-Wallis test and chi-square test for continuous and categorical variables.
Results: SMEs considered 41.8% of responses useful. Median (IQR) ratings for correctness, completeness, and safety were 7 (4-9), 5 (3-8), and 8 (4-10), respectively. The Fleiss multirater kappa for usefulness was 0.379 (95% CI, .317-.441) indicating fair agreement, and intraclass correlation coefficients were 0.820 (95% CI, .758-.870), 0.745 (95% CI, .656-.816), and 0.833 (95% CI, .775-.880) for correctness, completeness, and safety, indicating at least substantial agreement. No significant difference was observed among SME responses for percentage of responses considered useful.
Conclusions: Fewer than 50% of GPT-3.5 responses were considered useful by SMEs. Responses were mostly considered correct and safe but were often incomplete, suggesting that GPT-3.5 responses may not replace an ID pharmacist's responses.
期刊介绍:
Open Forum Infectious Diseases provides a global forum for the publication of clinical, translational, and basic research findings in a fully open access, online journal environment. The journal reflects the broad diversity of the field of infectious diseases, and focuses on the intersection of biomedical science and clinical practice, with a particular emphasis on knowledge that holds the potential to improve patient care in populations around the world. Fully peer-reviewed, OFID supports the international community of infectious diseases experts by providing a venue for articles that further the understanding of all aspects of infectious diseases.