PICOT 问题和搜索策略的制定：使用人工智能自动化的新方法。

IF 2.4 3区医学 Q1 NURSING Journal of Nursing Scholarship Pub Date : 2024-11-24 DOI:10.1111/jnu.13036

Lucija Gosak, Gregor Štiglic, Lisiane Pruinelli, Dominika Vrbnjak

{"title":"PICOT 问题和搜索策略的制定：使用人工智能自动化的新方法。","authors":"Lucija Gosak, Gregor Štiglic, Lisiane Pruinelli, Dominika Vrbnjak","doi":"10.1111/jnu.13036","DOIUrl":null,"url":null,"abstract":"Aim: The aim of this study was to evaluate and compare artificial intelligence (AI)-based large language models (LLMs) (ChatGPT-3.5, Bing, and Bard) with human-based formulations in generating relevant clinical queries, using comprehensive methodological evaluations.Methods: To interact with the major LLMs ChatGPT-3.5, Bing Chat, and Google Bard, scripts and prompts were designed to formulate PICOT (population, intervention, comparison, outcome, time) clinical questions and search strategies. Quality of the LLMs responses was assessed using a descriptive approach and independent assessment by two researchers. To determine the number of hits, PubMed, Web of Science, Cochrane Library, and CINAHL Ultimate search results were imported separately, without search restrictions, with the search strings generated by the three LLMs and an additional one by the expert. Hits from one of the scenarios were also exported for relevance evaluation. The use of a single scenario was chosen to provide a focused analysis. Cronbach's alpha and intraclass correlation coefficient (ICC) were also calculated.Results: In five different scenarios, ChatGPT-3.5 generated 11,859 hits, Bing 1,376,854, Bard 16,583, and an expert 5919 hits. We then used the first scenario to assess the relevance of the obtained results. The human expert search approach resulted in 65.22% (56/105) relevant articles. Bing was the most accurate AI-based LLM with 70.79% (63/89), followed by ChatGPT-3.5 with 21.05% (12/45), and Bard with 13.29% (42/316) relevant hits. Based on the assessment of two evaluators, ChatGPT-3.5 received the highest score (M = 48.50; SD = 0.71). Results showed a high level of agreement between the two evaluators. Although ChatGPT-3.5 showed a lower percentage of relevant hits compared to Bing, this reflects the nuanced evaluation criteria, where the subjective evaluation prioritized contextual accuracy and quality over mere relevance.Conclusion: This study provides valuable insights into the ability of LLMs to formulate PICOT clinical questions and search strategies. AI-based LLMs, such as ChatGPT-3.5, demonstrate significant potential for augmenting clinical workflows, improving clinical query development, and supporting search strategies. However, the findings also highlight limitations that necessitate further refinement and continued human oversight.Clinical relevance: AI could assist nurses in formulating PICOT clinical questions and search strategies. AI-based LLMs offer valuable support to healthcare professionals by improving the structure of clinical questions and enhancing search strategies, thereby significantly increasing the efficiency of information retrieval.","PeriodicalId":51091,"journal":{"name":"Journal of Nursing Scholarship","volume":" ","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PICOT questions and search strategies formulation: A novel approach using artificial intelligence automation.\",\"authors\":\"Lucija Gosak, Gregor Štiglic, Lisiane Pruinelli, Dominika Vrbnjak\",\"doi\":\"10.1111/jnu.13036\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Aim: The aim of this study was to evaluate and compare artificial intelligence (AI)-based large language models (LLMs) (ChatGPT-3.5, Bing, and Bard) with human-based formulations in generating relevant clinical queries, using comprehensive methodological evaluations.Methods: To interact with the major LLMs ChatGPT-3.5, Bing Chat, and Google Bard, scripts and prompts were designed to formulate PICOT (population, intervention, comparison, outcome, time) clinical questions and search strategies. Quality of the LLMs responses was assessed using a descriptive approach and independent assessment by two researchers. To determine the number of hits, PubMed, Web of Science, Cochrane Library, and CINAHL Ultimate search results were imported separately, without search restrictions, with the search strings generated by the three LLMs and an additional one by the expert. Hits from one of the scenarios were also exported for relevance evaluation. The use of a single scenario was chosen to provide a focused analysis. Cronbach's alpha and intraclass correlation coefficient (ICC) were also calculated.Results: In five different scenarios, ChatGPT-3.5 generated 11,859 hits, Bing 1,376,854, Bard 16,583, and an expert 5919 hits. We then used the first scenario to assess the relevance of the obtained results. The human expert search approach resulted in 65.22% (56/105) relevant articles. Bing was the most accurate AI-based LLM with 70.79% (63/89), followed by ChatGPT-3.5 with 21.05% (12/45), and Bard with 13.29% (42/316) relevant hits. Based on the assessment of two evaluators, ChatGPT-3.5 received the highest score (M = 48.50; SD = 0.71). Results showed a high level of agreement between the two evaluators. Although ChatGPT-3.5 showed a lower percentage of relevant hits compared to Bing, this reflects the nuanced evaluation criteria, where the subjective evaluation prioritized contextual accuracy and quality over mere relevance.Conclusion: This study provides valuable insights into the ability of LLMs to formulate PICOT clinical questions and search strategies. AI-based LLMs, such as ChatGPT-3.5, demonstrate significant potential for augmenting clinical workflows, improving clinical query development, and supporting search strategies. However, the findings also highlight limitations that necessitate further refinement and continued human oversight.Clinical relevance: AI could assist nurses in formulating PICOT clinical questions and search strategies. AI-based LLMs offer valuable support to healthcare professionals by improving the structure of clinical questions and enhancing search strategies, thereby significantly increasing the efficiency of information retrieval.\",\"PeriodicalId\":51091,\"journal\":{\"name\":\"Journal of Nursing Scholarship\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-11-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Nursing Scholarship\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1111/jnu.13036\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"NURSING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Nursing Scholarship","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/jnu.13036","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"NURSING","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究旨在利用综合方法评估和比较基于人工智能（AI）的大型语言模型（LLMs）（ChatGPT-3.5、Bing 和 Bard）与基于人类的表述在生成相关临床查询方面的作用：为了与主要的 LLMs（ChatGPT-3.5、Bing Chat 和 Google Bard）进行交互，设计了脚本和提示，以提出 PICOT（人群、干预、比较、结果、时间）临床问题和搜索策略。采用描述性方法评估 LLMs 回复的质量，并由两名研究人员进行独立评估。为了确定检索结果的数量，我们分别导入了 PubMed、Web of Science、Cochrane Library 和 CINAHL 的终极检索结果，没有检索限制，检索字符串由三位 LLM 生成，另外一位由专家生成。其中一种情况下的点击也被导出进行相关性评估。选择使用单一场景是为了进行重点分析。同时还计算了 Cronbach's alpha 和类内相关系数 (ICC)：在五个不同的场景中，ChatGPT-3.5 生成了 11,859 次点击，必应生成了 1,376,854 次点击，巴德生成了 16,583 次点击，专家生成了 5919 次点击。然后，我们使用第一种情况来评估所获得结果的相关性。人类专家搜索方法得到了 65.22% （56/105）的相关文章。必应是最准确的基于人工智能的 LLM，准确率为 70.79%（63/89），其次是 ChatGPT-3.5，准确率为 21.05%（12/45），最后是 Bard，准确率为 13.29%（42/316）。根据两名评估人员的评估，ChatGPT-3.5 获得了最高分（M = 48.50；SD = 0.71）。结果显示，两位评估者的意见高度一致。虽然与必应相比，ChatGPT-3.5 显示的相关点击率较低，但这反映了评价标准的细微差别，即主观评价优先考虑上下文的准确性和质量，而非单纯的相关性：本研究为 LLMs 制定 PICOT 临床问题和搜索策略的能力提供了宝贵的见解。基于人工智能的 LLM（如 ChatGPT-3.5）在增强临床工作流程、改进临床查询开发和支持搜索策略方面展现出了巨大的潜力。然而，研究结果也凸显了其局限性，因此有必要进一步完善并继续进行人工监督：人工智能可以帮助护士制定 PICOT 临床问题和搜索策略。基于人工智能的 LLM 通过改进临床问题的结构和加强搜索策略，为医护人员提供了宝贵的支持，从而显著提高了信息检索的效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PICOT questions and search strategies formulation: A novel approach using artificial intelligence automation.

Aim: The aim of this study was to evaluate and compare artificial intelligence (AI)-based large language models (LLMs) (ChatGPT-3.5, Bing, and Bard) with human-based formulations in generating relevant clinical queries, using comprehensive methodological evaluations.

Methods: To interact with the major LLMs ChatGPT-3.5, Bing Chat, and Google Bard, scripts and prompts were designed to formulate PICOT (population, intervention, comparison, outcome, time) clinical questions and search strategies. Quality of the LLMs responses was assessed using a descriptive approach and independent assessment by two researchers. To determine the number of hits, PubMed, Web of Science, Cochrane Library, and CINAHL Ultimate search results were imported separately, without search restrictions, with the search strings generated by the three LLMs and an additional one by the expert. Hits from one of the scenarios were also exported for relevance evaluation. The use of a single scenario was chosen to provide a focused analysis. Cronbach's alpha and intraclass correlation coefficient (ICC) were also calculated.

Results: In five different scenarios, ChatGPT-3.5 generated 11,859 hits, Bing 1,376,854, Bard 16,583, and an expert 5919 hits. We then used the first scenario to assess the relevance of the obtained results. The human expert search approach resulted in 65.22% (56/105) relevant articles. Bing was the most accurate AI-based LLM with 70.79% (63/89), followed by ChatGPT-3.5 with 21.05% (12/45), and Bard with 13.29% (42/316) relevant hits. Based on the assessment of two evaluators, ChatGPT-3.5 received the highest score (M = 48.50; SD = 0.71). Results showed a high level of agreement between the two evaluators. Although ChatGPT-3.5 showed a lower percentage of relevant hits compared to Bing, this reflects the nuanced evaluation criteria, where the subjective evaluation prioritized contextual accuracy and quality over mere relevance.

Conclusion: This study provides valuable insights into the ability of LLMs to formulate PICOT clinical questions and search strategies. AI-based LLMs, such as ChatGPT-3.5, demonstrate significant potential for augmenting clinical workflows, improving clinical query development, and supporting search strategies. However, the findings also highlight limitations that necessitate further refinement and continued human oversight.

Clinical relevance: AI could assist nurses in formulating PICOT clinical questions and search strategies. AI-based LLMs offer valuable support to healthcare professionals by improving the structure of clinical questions and enhancing search strategies, thereby significantly increasing the efficiency of information retrieval.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Nursing Scholarship 医学-护理

CiteScore

6.30

自引率

5.90%

发文量

审稿时长

6-12 weeks

期刊介绍： This widely read and respected journal features peer-reviewed, thought-provoking articles representing research by some of the world’s leading nurse researchers. Reaching health professionals, faculty and students in 103 countries, the Journal of Nursing Scholarship is focused on health of people throughout the world. It is the official journal of Sigma Theta Tau International and it reflects the society’s dedication to providing the tools necessary to improve nursing care around the world.