首页 > 最新文献

JMIR AI最新文献

英文 中文
AI Awareness and Tobacco Policy Messaging Among US Adults: Electronic Experimental Study. 美国成年人的人工智能意识和烟草政策信息:电子实验研究。
IF 2 Pub Date : 2025-10-27 DOI: 10.2196/72987
Julia Mary Alber, David Askay, Anuraj Dhillon, Lauren Sandoval, Sofia Ramos, Katharine Santilena

Background: Despite public health efforts, tobacco use remains the leading cause of preventable death in the United States and continues to disproportionately affect underrepresented populations. Public policies are needed to improve health equity in tobacco-related health outcomes. One strategy for promoting public support for these policies is through health messaging. Improvements in artificial intelligence (AI) technology offer new opportunities to create tailored policy messages quickly; however, there is limited research on how the public might perceive the use of AI for public health messages.

Objective: This study aimed to examine how knowledge of AI use impacts perceptions of a tobacco control policy video.

Methods: A national sample of US adults (N=500) was shown the same AI-generated video that focused on a tobacco control policy. Participants were then randomly assigned to 1 of 4 conditions where they were (1) told the narrator of the video was AI, (2) told the narrator of the video was human, (3) told it was unknown whether the narrator was AI or human, or (4) not provided any information about the narrator.

Results: Perceived video rating, effectiveness, and credibility did not significantly differ among the conditions. However, the mean speaker rating was significantly higher (P=.001) when participants were told the narrator of the health message was human (mean 3.65, SD 0.91) compared to the other conditions. Notably, positive attitudes toward AI were highest among those not provided information about the narrator; however, this difference was not statistically significant (mean 3.04, SD 0.90).

Conclusions: Results suggest that AI may impact perceptions of the speaker of a video; however, more research is needed to understand if these impacts would occur over time and after multiple exposures to content. Further qualitative research may help explain why potential differences may have occurred in speaker ratings. Public health professionals and researchers should further consider the ethics and cost-effectiveness of using AI for health messaging.

背景:尽管公共卫生努力,烟草使用仍然是美国可预防性死亡的主要原因,并继续不成比例地影响代表性不足的人口。需要制定公共政策来改善与烟草有关的健康结果方面的卫生公平性。促进公众支持这些政策的一项战略是通过卫生信息传递。人工智能(AI)技术的改进为快速创建量身定制的政策信息提供了新的机会;然而,关于公众如何看待使用人工智能传播公共卫生信息的研究有限。目的:本研究旨在研究人工智能使用的知识如何影响对烟草控制政策视频的看法。方法:向美国成年人的全国样本(N=500)展示了相同的人工智能生成的视频,该视频侧重于烟草控制政策。然后,参与者被随机分配到以下四种情况中的一种:(1)被告知视频的解说员是人工智能,(2)被告知视频的解说员是人类,(3)被告知不知道解说员是人工智能还是人类,或者(4)没有提供任何关于解说员的信息。结果:感知视频评分、有效性和可信度在不同条件下无显著差异。然而,与其他条件相比,当参与者被告知健康信息的讲述者是人类时(平均3.65,标准差0.91),平均讲述者评级显着更高(P=.001)。值得注意的是,那些没有提供叙述者信息的人对人工智能的积极态度最高;但差异无统计学意义(平均3.04,标准差0.90)。结论:研究结果表明,人工智能可能会影响视频说话者的感知;然而,需要更多的研究来了解这些影响是否会随着时间的推移和多次接触内容而发生。进一步的定性研究可能有助于解释为什么潜在的差异可能会出现在说话者的评级。公共卫生专业人员和研究人员应进一步考虑使用人工智能进行卫生信息传递的伦理和成本效益。
{"title":"AI Awareness and Tobacco Policy Messaging Among US Adults: Electronic Experimental Study.","authors":"Julia Mary Alber, David Askay, Anuraj Dhillon, Lauren Sandoval, Sofia Ramos, Katharine Santilena","doi":"10.2196/72987","DOIUrl":"10.2196/72987","url":null,"abstract":"<p><strong>Background: </strong>Despite public health efforts, tobacco use remains the leading cause of preventable death in the United States and continues to disproportionately affect underrepresented populations. Public policies are needed to improve health equity in tobacco-related health outcomes. One strategy for promoting public support for these policies is through health messaging. Improvements in artificial intelligence (AI) technology offer new opportunities to create tailored policy messages quickly; however, there is limited research on how the public might perceive the use of AI for public health messages.</p><p><strong>Objective: </strong>This study aimed to examine how knowledge of AI use impacts perceptions of a tobacco control policy video.</p><p><strong>Methods: </strong>A national sample of US adults (N=500) was shown the same AI-generated video that focused on a tobacco control policy. Participants were then randomly assigned to 1 of 4 conditions where they were (1) told the narrator of the video was AI, (2) told the narrator of the video was human, (3) told it was unknown whether the narrator was AI or human, or (4) not provided any information about the narrator.</p><p><strong>Results: </strong>Perceived video rating, effectiveness, and credibility did not significantly differ among the conditions. However, the mean speaker rating was significantly higher (P=.001) when participants were told the narrator of the health message was human (mean 3.65, SD 0.91) compared to the other conditions. Notably, positive attitudes toward AI were highest among those not provided information about the narrator; however, this difference was not statistically significant (mean 3.04, SD 0.90).</p><p><strong>Conclusions: </strong>Results suggest that AI may impact perceptions of the speaker of a video; however, more research is needed to understand if these impacts would occur over time and after multiple exposures to content. Further qualitative research may help explain why potential differences may have occurred in speaker ratings. Public health professionals and researchers should further consider the ethics and cost-effectiveness of using AI for health messaging.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e72987"},"PeriodicalIF":2.0,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12558419/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145380050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Development and Evaluation of a Retrieval-Augmented Generation Chatbot for Orthopedic and Trauma Surgery Patient Education: Mixed-Methods Study. 用于骨科和创伤外科患者教育的检索增强一代聊天机器人的开发和评估:混合方法研究。
IF 2 Pub Date : 2025-10-23 DOI: 10.2196/75262
David Baur, Jörg Ansorg, Christoph-Eckhard Heyde, Anna Voelker
<p><strong>Background: </strong>Large language models are increasingly applied in health care for documentation, patient education, and clinical decision support. However, their factual reliability can be compromised by hallucinations and a lack of source traceability. Retrieval-augmented generation (RAG) enhances response accuracy by combining generative models with document retrieval mechanisms. While promising in medical contexts, RAG-based systems remain underexplored in orthopedic and trauma surgery patient education, particularly in non-English settings.</p><p><strong>Objective: </strong>This study aimed to develop and evaluate a RAG-based chatbot that provides German-language, evidence-based information on common orthopedic conditions. We assessed the system's performance in terms of response accuracy, contextual precision, and alignment with retrieved sources. In addition, we examined user satisfaction, usability, and perceived trustworthiness.</p><p><strong>Methods: </strong>The chatbot integrated OpenAI's GPT language model with a Qdrant vector database for semantic search. Its corpus consisted of 899 curated German-language documents, including national orthopedic guidelines and patient education content from the Orthinform platform of the German Society of Orthopedics and Trauma Surgery. After preprocessing, the data were segmented into 18,197 retrievable chunks. Evaluation occurred in two phases: (1) human validation by 30 participants (orthopedic specialists, medical students, and nonmedical users), who rated 12 standardized chatbot responses using a 5-point Likert scale, and (2) automated evaluation of 100 synthetic queries using the Retrieval-Augmented Generation Assessment Scale, measuring answer relevancy, contextual precision, and faithfulness. A permanent disclaimer indicated that the chatbot provides general information only and is not intended for diagnosis or treatment decisions.</p><p><strong>Results: </strong>Human ratings indicated high perceived quality for accuracy (mean 4.55, SD 0.45), helpfulness (mean 4.61, SD 0.57), ease of use (mean 4.90, SD 0.30), and clarity (mean 4.77, SD 0.43), while trust scored slightly lower (mean 4.23, SD 0.56). Retrieval-Augmented Generation Assessment Scale evaluation confirmed strong technical performance for answer relevancy (mean 0.864, SD 0.223), contextual precision (mean 0.891, SD 0.201), and faithfulness (mean 0.853, SD 0.171). Performance was highest for knee and back-related topics and lower for hip-related queries (eg, gluteal tendinopathy), which showed elevated error rates in differential diagnosis.</p><p><strong>Conclusions: </strong>The chatbot demonstrated strong performance in delivering orthopedic patient education through an RAG framework. Its deployment on the national Orthinform platform has led to more than 9500 real-world user interactions, supporting its relevance and acceptance. Future improvements should focus on expanding domain coverage, enhancing retrieval pre
背景:大型语言模型越来越多地应用于医疗保健文档、患者教育和临床决策支持。然而,它们的事实可靠性可能会受到幻觉和缺乏来源可追溯性的影响。检索增强生成(RAG)通过结合生成模型和文档检索机制来提高响应的准确性。虽然在医学环境中很有前景,但基于rag的系统在骨科和创伤外科患者教育方面仍未得到充分探索,特别是在非英语环境中。目的:本研究旨在开发和评估一种基于rag的聊天机器人,该机器人可以提供关于常见骨科疾病的德语循证信息。我们根据响应准确性、上下文精度和与检索源的一致性来评估系统的性能。此外,我们还检查了用户满意度、可用性和感知可信度。方法:将OpenAI的GPT语言模型与Qdrant向量数据库相结合,进行语义搜索。它的语料库由899个精心策划的德语文档组成,包括来自德国骨科和创伤外科学会Orthinform平台的国家骨科指南和患者教育内容。预处理后,数据被分割为18,197个可检索的块。评估分两个阶段进行:(1)由30名参与者(骨科专家、医学生和非医疗用户)进行人类验证,他们使用5分李克特量表对12个标准化聊天机器人的回答进行评分;(2)使用检索增强生成评估量表对100个合成查询进行自动评估,测量答案的相关性、上下文准确性和可信度。永久免责声明表明,聊天机器人仅提供一般信息,不用于诊断或治疗决策。结果:人类评分在准确性(平均4.55,SD 0.45)、有用性(平均4.61,SD 0.57)、易用性(平均4.90,SD 0.30)和清晰度(平均4.77,SD 0.43)方面显示出较高的感知质量,而信任得分略低(平均4.23,SD 0.56)。检索增强生成评估量表的评估在答案相关性(平均值0.864,SD 0.223)、上下文精度(平均值0.891,SD 0.201)和信度(平均值0.853,SD 0.171)方面证实了较强的技术表现。在与膝盖和背部相关的问题上表现最好,而在与臀部相关的问题(如臀腱病)上表现较差,后者在鉴别诊断中显示出较高的错误率。结论:聊天机器人在通过RAG框架提供骨科患者教育方面表现出色。它在国家Orthinform平台上的部署已经导致了超过9500个实际用户交互,支持了它的相关性和接受度。未来的改进应集中在扩大领域覆盖、提高检索精度、集成多模态内容和先进的RAG技术,以提高面向患者的应用程序的鲁棒性和安全性。
{"title":"Development and Evaluation of a Retrieval-Augmented Generation Chatbot for Orthopedic and Trauma Surgery Patient Education: Mixed-Methods Study.","authors":"David Baur, Jörg Ansorg, Christoph-Eckhard Heyde, Anna Voelker","doi":"10.2196/75262","DOIUrl":"10.2196/75262","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Large language models are increasingly applied in health care for documentation, patient education, and clinical decision support. However, their factual reliability can be compromised by hallucinations and a lack of source traceability. Retrieval-augmented generation (RAG) enhances response accuracy by combining generative models with document retrieval mechanisms. While promising in medical contexts, RAG-based systems remain underexplored in orthopedic and trauma surgery patient education, particularly in non-English settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to develop and evaluate a RAG-based chatbot that provides German-language, evidence-based information on common orthopedic conditions. We assessed the system's performance in terms of response accuracy, contextual precision, and alignment with retrieved sources. In addition, we examined user satisfaction, usability, and perceived trustworthiness.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;The chatbot integrated OpenAI's GPT language model with a Qdrant vector database for semantic search. Its corpus consisted of 899 curated German-language documents, including national orthopedic guidelines and patient education content from the Orthinform platform of the German Society of Orthopedics and Trauma Surgery. After preprocessing, the data were segmented into 18,197 retrievable chunks. Evaluation occurred in two phases: (1) human validation by 30 participants (orthopedic specialists, medical students, and nonmedical users), who rated 12 standardized chatbot responses using a 5-point Likert scale, and (2) automated evaluation of 100 synthetic queries using the Retrieval-Augmented Generation Assessment Scale, measuring answer relevancy, contextual precision, and faithfulness. A permanent disclaimer indicated that the chatbot provides general information only and is not intended for diagnosis or treatment decisions.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Human ratings indicated high perceived quality for accuracy (mean 4.55, SD 0.45), helpfulness (mean 4.61, SD 0.57), ease of use (mean 4.90, SD 0.30), and clarity (mean 4.77, SD 0.43), while trust scored slightly lower (mean 4.23, SD 0.56). Retrieval-Augmented Generation Assessment Scale evaluation confirmed strong technical performance for answer relevancy (mean 0.864, SD 0.223), contextual precision (mean 0.891, SD 0.201), and faithfulness (mean 0.853, SD 0.171). Performance was highest for knee and back-related topics and lower for hip-related queries (eg, gluteal tendinopathy), which showed elevated error rates in differential diagnosis.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;The chatbot demonstrated strong performance in delivering orthopedic patient education through an RAG framework. Its deployment on the national Orthinform platform has led to more than 9500 real-world user interactions, supporting its relevance and acceptance. Future improvements should focus on expanding domain coverage, enhancing retrieval pre","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e75262"},"PeriodicalIF":2.0,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12551339/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145356933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aiding Large Language Models Using Clinical Scoresheets for Neurobehavioral Diagnostic Classification From Text: Algorithm Development and Validation. 使用临床记分表帮助大型语言模型从文本进行神经行为诊断分类:算法开发和验证。
IF 2 Pub Date : 2025-10-21 DOI: 10.2196/75030
Kaiying Lin, Abdur Rasool, Saimourya Surabhi, Cezmi Mutlu, Haopeng Zhang, Dennis P Wall, Peter Washington
<p><strong>Background: </strong>Large language models (LLMs) have demonstrated the ability to perform complex tasks traditionally requiring human intelligence. However, their use in automated diagnostics for psychiatry and behavioral sciences remains under-studied.</p><p><strong>Objective: </strong>This study aimed to evaluate whether incorporating structured clinical assessment scales improved the diagnostic performance of LLM-based chatbots for neuropsychiatric conditions (we evaluated autism spectrum disorder, aphasia, and depression datasets) across two prompting strategies: (1) direct diagnosis and (2) code generation. We aimed to contextualize LLM-based diagnostic performance by benchmarking it against prior work that applied traditional machine learning classifiers to the same datasets, allowing us to assess whether LLMs offer competitive or complementary capabilities in clinical classification tasks.</p><p><strong>Methods: </strong>We tested two approaches using ChatGPT, Gemini, and Claude models: (1) direct diagnostic querying and (2) execution of chatbot-generated code for classification. Three diagnostic datasets were used: ASDBank (autism spectrum disorder), AphasiaBank (aphasia), and Distress Analysis Interview Corpus-Wizard-of-Oz interviews (depression and related conditions). Each approach was evaluated with and without the aid of clinical assessment scales. Performance was compared to existing machine learning benchmarks on these datasets.</p><p><strong>Results: </strong>Across all 3 datasets, incorporating clinical assessment scales led to little improvement in performance, and results remained inconsistent and generally below those reported in previous studies. On the AphasiaBank dataset, the direct diagnosis approach using ChatGPT with GPT-4 produced a low F<sub>1</sub>-score of 65.6% and specificity of 33%. The code generation method improved results, with ChatGPT with GPT-4o reaching an F<sub>1</sub>-score of 81.4%, specificity of 78.6%, and sensitivity of 84.3%. ChatGPT with GPT-o3 and Gemini 2.5 Pro performed even better, with F<sub>1</sub>-scores of 86.5% and 84.3%, respectively. For the ASDBank dataset, direct diagnosis results were lower, with F<sub>1</sub>-scores of 56% for ChatGPT with GPT-4 and 54% for ChatGPT with GPT-4o. Under code generation, ChatGPT with GPT-o3 reached 67.9%, and Claude 3.5 performed reasonably well with 60%. Gemini 2.5 Pro failed to respond under this assessment condition. In the Distress Analysis Interview Corpus-Wizard-of-Oz dataset, direct diagnosis yielded high accuracy (70.9%) but poor F<sub>1</sub>-scores of 8% using ChatGPT with GPT-4o. Code generation improved specificity-88.6% with ChatGPT with GPT-4o-but F<sub>1</sub>-scores remained low overall. These findings suggest that, while clinical scales may help structure outputs, prompting alone remains insufficient for consistent diagnostic accuracy.</p><p><strong>Conclusions: </strong>Current LLM-based chatbots, when prompted naively, under
背景:大型语言模型(llm)已经证明了执行传统上需要人类智能的复杂任务的能力。然而,它们在精神病学和行为科学的自动诊断中的应用仍有待研究。目的:本研究旨在通过两种提示策略(1)直接诊断和(2)代码生成,评估整合结构化临床评估量表是否提高了基于llm的聊天机器人对神经精神疾病(我们评估了自闭症谱系障碍、失语症和抑郁症数据集)的诊断性能。我们的目标是将基于llm的诊断性能与先前将传统机器学习分类器应用于相同数据集的工作进行基准测试,从而使我们能够评估llm在临床分类任务中是否提供竞争或互补的能力。方法:我们使用ChatGPT、Gemini和Claude模型测试了两种方法:(1)直接诊断查询和(2)执行聊天机器人生成的代码进行分类。使用了三个诊断数据集:ASDBank(自闭症谱系障碍)、AphasiaBank(失语症)和Distress Analysis Interview Corpus-Wizard-of-Oz访谈(抑郁症及相关疾病)。每种方法在有无临床评估量表的帮助下进行评估。将性能与这些数据集上现有的机器学习基准进行比较。结果:在所有3个数据集中,纳入临床评估量表导致的表现改善甚微,结果仍然不一致,通常低于先前研究报告的结果。在AphasiaBank数据集上,使用ChatGPT和GPT-4的直接诊断方法产生了65.6%的低f1评分和33%的特异性。代码生成方法改善了结果,ChatGPT与gpt - 40的f1评分达到81.4%,特异性为78.6%,敏感性为84.3%。ChatGPT与gpt - 03和Gemini 2.5 Pro的表现更好,f1得分分别为86.5%和84.3%。对于ASDBank数据集,直接诊断结果较低,具有GPT-4的ChatGPT的f1得分为56%,具有gpt - 40的ChatGPT的f1得分为54%。在代码生成下,使用gpt - 03的ChatGPT达到67.9%,而Claude 3.5的表现相当不错,达到60%。Gemini 2.5 Pro在此评估条件下没有反应。在窘迫分析访谈语料库- wizard -of- oz数据集中,直接诊断的准确率很高(70.9%),但使用ChatGPT和gpt - 40的f1得分较低,为8%。代码生成提高了特异性,使用gpt - 40的ChatGPT提高了88.6%,但f1得分总体上仍然很低。这些发现表明,虽然临床量表可能有助于构建输出,但仅靠提示仍然不足以达到一致的诊断准确性。结论:与专门的机器学习模型相比,当前基于llm的聊天机器人,当被天真地提示时,在精神病学和行为诊断任务上表现不佳。临床评估量表可能会适度地帮助聊天机器人的表现,但更复杂的提示工程和领域集成可能需要达到临床可操作的标准。
{"title":"Aiding Large Language Models Using Clinical Scoresheets for Neurobehavioral Diagnostic Classification From Text: Algorithm Development and Validation.","authors":"Kaiying Lin, Abdur Rasool, Saimourya Surabhi, Cezmi Mutlu, Haopeng Zhang, Dennis P Wall, Peter Washington","doi":"10.2196/75030","DOIUrl":"10.2196/75030","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Large language models (LLMs) have demonstrated the ability to perform complex tasks traditionally requiring human intelligence. However, their use in automated diagnostics for psychiatry and behavioral sciences remains under-studied.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to evaluate whether incorporating structured clinical assessment scales improved the diagnostic performance of LLM-based chatbots for neuropsychiatric conditions (we evaluated autism spectrum disorder, aphasia, and depression datasets) across two prompting strategies: (1) direct diagnosis and (2) code generation. We aimed to contextualize LLM-based diagnostic performance by benchmarking it against prior work that applied traditional machine learning classifiers to the same datasets, allowing us to assess whether LLMs offer competitive or complementary capabilities in clinical classification tasks.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We tested two approaches using ChatGPT, Gemini, and Claude models: (1) direct diagnostic querying and (2) execution of chatbot-generated code for classification. Three diagnostic datasets were used: ASDBank (autism spectrum disorder), AphasiaBank (aphasia), and Distress Analysis Interview Corpus-Wizard-of-Oz interviews (depression and related conditions). Each approach was evaluated with and without the aid of clinical assessment scales. Performance was compared to existing machine learning benchmarks on these datasets.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Across all 3 datasets, incorporating clinical assessment scales led to little improvement in performance, and results remained inconsistent and generally below those reported in previous studies. On the AphasiaBank dataset, the direct diagnosis approach using ChatGPT with GPT-4 produced a low F&lt;sub&gt;1&lt;/sub&gt;-score of 65.6% and specificity of 33%. The code generation method improved results, with ChatGPT with GPT-4o reaching an F&lt;sub&gt;1&lt;/sub&gt;-score of 81.4%, specificity of 78.6%, and sensitivity of 84.3%. ChatGPT with GPT-o3 and Gemini 2.5 Pro performed even better, with F&lt;sub&gt;1&lt;/sub&gt;-scores of 86.5% and 84.3%, respectively. For the ASDBank dataset, direct diagnosis results were lower, with F&lt;sub&gt;1&lt;/sub&gt;-scores of 56% for ChatGPT with GPT-4 and 54% for ChatGPT with GPT-4o. Under code generation, ChatGPT with GPT-o3 reached 67.9%, and Claude 3.5 performed reasonably well with 60%. Gemini 2.5 Pro failed to respond under this assessment condition. In the Distress Analysis Interview Corpus-Wizard-of-Oz dataset, direct diagnosis yielded high accuracy (70.9%) but poor F&lt;sub&gt;1&lt;/sub&gt;-scores of 8% using ChatGPT with GPT-4o. Code generation improved specificity-88.6% with ChatGPT with GPT-4o-but F&lt;sub&gt;1&lt;/sub&gt;-scores remained low overall. These findings suggest that, while clinical scales may help structure outputs, prompting alone remains insufficient for consistent diagnostic accuracy.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Current LLM-based chatbots, when prompted naively, under","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e75030"},"PeriodicalIF":2.0,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12587012/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145350379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison of Japanese Mpox (Monkeypox) Health Education Materials and Texts Created by Artificial Intelligence: Cross-Sectional Quantitative Content Analysis Study. 日本Mpox(猴痘)健康教材与人工智能文本的比较:横断面定量内容分析研究。
IF 2 Pub Date : 2025-10-17 DOI: 10.2196/70604
Shinya Ito, Emi Furukawa, Tsuyoshi Okuhara, Hiroko Okada, Takahiro Kiuchi
<p><strong>Background: </strong>Mpox (monkeypox) outbreaks since 2022 have emphasized the importance of accessible health education materials. However, many Japanese online resources on mpox are difficult to understand, creating barriers for public health communication. Recent advances in artificial intelligence (AI) such as ChatGPT-4o show promise in generating more comprehensible and actionable health education content.</p><p><strong>Objective: </strong>The aim of this study was to evaluate the comprehensibility, actionability, and readability of Japanese health education materials on mpox compared with texts generated by ChatGPT-4o.</p><p><strong>Methods: </strong>A cross-sectional study was conducted using systematic quantitative content analysis. A total of 119 publicly available Japanese health education materials on mpox were compared with 30 texts generated by ChatGPT-4o. Websites containing videos, social media posts, academic papers, and non-Japanese language content were excluded. For generating ChatGPT-4o texts, we used 3 separate prompts with 3 different keywords. For each keyword, text generation was repeated 10 times, with prompt history deleted each time to prevent previous outputs from influencing subsequent generations and to account for output variability. The Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) was used to assess the understandability and actionability of the generated text, while the Japanese Readability Measurement System (jReadability) was used to evaluate readability. The Journal of the American Medical Association benchmark criteria were applied to evaluate the quality of the materials.</p><p><strong>Results: </strong>A total of 119 Japanese mpox-related health education web pages and 30 ChatGPT-4o-generated texts were analyzed. AI-generated texts significantly outperformed web pages in understandability, with 80% (24/30) scoring ≥70% in PEMAT-P (P<.001). Readability scores for AI texts (mean 2.9, SD 0.4) were also higher than those for web pages (mean 2.4, SD 1.0; P=.009). However, web pages included more visual aids and actionable guidance such as practical instructions, which were largely absent in AI-generated content. Government agencies authored 90 (75.6%) out of 119 web pages, but only 31 (26.1%) included proper attribution. Most web pages (117/119, 98.3%) disclosed sponsorship and ownership.</p><p><strong>Conclusions: </strong>AI-generated texts were easier to understand and read than traditional web-based materials. However, web-based texts provided more visual aids and practical guidance. Combining AI-generated texts with traditional web-based materials may enhance the effectiveness of health education materials and improve accessibility to a broader audience. Further research is needed to explore the integration of AI-generated content into public health communication strategies and policies to optimize information delivery during health crises such as the mpox outbreak
背景:自2022年以来,猴痘(Mpox)暴发强调了获取卫生教育材料的重要性。然而,日本许多关于mpox的在线资源难以理解,这给公共卫生交流造成了障碍。人工智能(AI)的最新进展,如chatgpt - 40,有望产生更容易理解和可操作的健康教育内容。目的:本研究的目的是评价日本m痘健康教育材料的可理解性、可操作性和可读性,并与chatgpt - 40生成的文本进行比较。方法:采用系统定量含量分析法进行横断面研究。共有119份公开的日本m痘健康教育材料与chatgpt - 40生成的30份文本进行了比较。包含视频、社交媒体帖子、学术论文和非日语内容的网站被排除在外。为了生成chatgpt - 40文本,我们使用了3个带有3个不同关键字的单独提示。对于每个关键字,文本生成重复了10次,每次都删除提示历史,以防止以前的输出影响后续的输出,并考虑输出的可变性。使用患者教育材料可打印材料评估工具(PEMAT-P)评估生成文本的可理解性和可操作性,使用日本可读性测量系统(jReadability)评估可读性。采用美国医学协会杂志的基准标准来评估材料的质量。结果:共分析了119个日本mpox相关健康教育网页和30个chatgpt - 40生成的文本。人工智能生成的文本在可理解性方面明显优于网页,80%(24/30)的PEMAT-P得分≥70% (p)。结论:人工智能生成的文本比传统的基于网络的材料更容易理解和阅读。然而,基于网络的文本提供了更多的视觉辅助和实践指导。将人工智能生成的文本与传统的基于网络的材料相结合,可以提高卫生教育材料的有效性,并改善更广泛受众的可及性。需要进一步研究,探索将人工智能生成的内容整合到公共卫生传播战略和政策中,以优化m痘疫情等卫生危机期间的信息传递。
{"title":"Comparison of Japanese Mpox (Monkeypox) Health Education Materials and Texts Created by Artificial Intelligence: Cross-Sectional Quantitative Content Analysis Study.","authors":"Shinya Ito, Emi Furukawa, Tsuyoshi Okuhara, Hiroko Okada, Takahiro Kiuchi","doi":"10.2196/70604","DOIUrl":"10.2196/70604","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Mpox (monkeypox) outbreaks since 2022 have emphasized the importance of accessible health education materials. However, many Japanese online resources on mpox are difficult to understand, creating barriers for public health communication. Recent advances in artificial intelligence (AI) such as ChatGPT-4o show promise in generating more comprehensible and actionable health education content.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;The aim of this study was to evaluate the comprehensibility, actionability, and readability of Japanese health education materials on mpox compared with texts generated by ChatGPT-4o.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;A cross-sectional study was conducted using systematic quantitative content analysis. A total of 119 publicly available Japanese health education materials on mpox were compared with 30 texts generated by ChatGPT-4o. Websites containing videos, social media posts, academic papers, and non-Japanese language content were excluded. For generating ChatGPT-4o texts, we used 3 separate prompts with 3 different keywords. For each keyword, text generation was repeated 10 times, with prompt history deleted each time to prevent previous outputs from influencing subsequent generations and to account for output variability. The Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) was used to assess the understandability and actionability of the generated text, while the Japanese Readability Measurement System (jReadability) was used to evaluate readability. The Journal of the American Medical Association benchmark criteria were applied to evaluate the quality of the materials.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;A total of 119 Japanese mpox-related health education web pages and 30 ChatGPT-4o-generated texts were analyzed. AI-generated texts significantly outperformed web pages in understandability, with 80% (24/30) scoring ≥70% in PEMAT-P (P&lt;.001). Readability scores for AI texts (mean 2.9, SD 0.4) were also higher than those for web pages (mean 2.4, SD 1.0; P=.009). However, web pages included more visual aids and actionable guidance such as practical instructions, which were largely absent in AI-generated content. Government agencies authored 90 (75.6%) out of 119 web pages, but only 31 (26.1%) included proper attribution. Most web pages (117/119, 98.3%) disclosed sponsorship and ownership.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;AI-generated texts were easier to understand and read than traditional web-based materials. However, web-based texts provided more visual aids and practical guidance. Combining AI-generated texts with traditional web-based materials may enhance the effectiveness of health education materials and improve accessibility to a broader audience. Further research is needed to explore the integration of AI-generated content into public health communication strategies and policies to optimize information delivery during health crises such as the mpox outbreak","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e70604"},"PeriodicalIF":2.0,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12579291/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145314268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Learning Models to Screen Electronic Health Records for Breast and Colorectal Cancer Progression: Performance Evaluation Study. 深度学习模型筛选乳腺癌和结直肠癌进展的电子健康记录:性能评估研究。
IF 2 Pub Date : 2025-10-13 DOI: 10.2196/63767
Pascal Lambert, Rayyan Khan, Marshall Pitz, Harminder Singh, Helen Chen, Kathleen Decker

Background: Cancer progression is an important outcome in cancer research. However, it is frequently documented only in electronic health records (EHRs) as unstructured text, which requires lengthy and costly chart reviews to extract for retrospective studies.

Objective: This study aimed to evaluate the performance of 3 deep learning language models in determining breast and colorectal cancer progression in EHRs.

Methods: EHRs for individuals diagnosed with stage 4 breast or colorectal cancer between 2004 and 2020 in Manitoba, Canada, were extracted. A chart review was conducted to identify cancer progression in each EHR. Data were analyzed with pretrained deep learning language models (Bio+ClinicalBERT, Clinical-BigBird, and Clinical-Longformer). Sensitivity, positive predictive value, area under the curve, and scaled Brier scores were used to evaluate performance. Influential tokens were identified by removing and adding tokens to EHRs and examining changes in predicted probabilities.

Results: Clinical-BigBird and Clinical-Longformer models for breast and colorectal cancer cohorts demonstrated higher accuracy than the Bio+ClinicalBERT models (scaled Brier scores for breast cancer models: 0.70-0.79 vs 0.49-0.71; scaled Brier scores for colorectal cancer models: 0.61-0.65 vs 0.49-0.61). The same models also demonstrated higher sensitivity (breast cancer models: 86.6%-94.3% vs 76.6%-87.1%; colorectal cancer models: 73.1%-78.9% vs 62.8%-77.0%) and positive predictive value (breast cancer models: 77.9%-92.3% vs 80.6%-85.5%; colorectal cancer models: 81.6%-86.3% vs 72.9%-82.9%) compared to Bio+ClinicalBERT models. All models could remove more than 84% of charts from the chart review process. The most influential token was the word progression, which was influenced by the presence of other tokens and its position within an EHR.

Conclusions: The deep learning language models could help identify breast and colorectal cancer progression in EHRs and remove most charts from the chart review process. A limited number of tokens may influence model predictions. Improvements in model performance could be obtained by increasing the training dataset size and analyzing EHRs at the sentence level rather than at the EHR level.

背景:癌症进展是癌症研究的一个重要结果。然而,它通常只作为非结构化文本记录在电子健康记录(EHRs)中,这需要冗长而昂贵的图表审查才能提取用于回顾性研究。目的:本研究旨在评估3种深度学习语言模型在电子病历中确定乳腺癌和结直肠癌进展的性能。方法:提取加拿大马尼托巴省2004年至2020年间诊断为4期乳腺癌或结直肠癌患者的电子病历。进行图表回顾,以确定每个电子病历中的癌症进展情况。使用预训练的深度学习语言模型(Bio+ClinicalBERT、clinicalbigbird和clinicallongformer)分析数据。灵敏度、阳性预测值、曲线下面积和尺度Brier评分用于评估性能。通过在EHRs中删除和添加令牌以及检查预测概率的变化来识别有影响力的令牌。结果:用于乳腺癌和结直肠癌队列的临床- bigbird和临床- longformer模型比Bio+ClinicalBERT模型具有更高的准确性(乳腺癌模型的缩放Brier评分:0.70-0.79 vs 0.49-0.71;结直肠癌模型的缩放Brier评分:0.61-0.65 vs 0.49-0.61)。与Bio+ClinicalBERT模型相比,同样的模型也显示出更高的敏感性(乳腺癌模型:86.6%-94.3%对76.6%-87.1%;结直肠癌模型:73.1%-78.9%对62.8%-77.0%)和阳性预测值(乳腺癌模型:77.9%-92.3%对80.6%-85.5%;结直肠癌模型:81.6%-86.3%对72.9%-82.9%)。所有模型都可以从图表审查过程中删除超过84%的图表。最有影响力的代币是单词进度,它受到其他代币的存在及其在EHR中的位置的影响。结论:深度学习语言模型可以帮助在电子病历中识别乳腺癌和结直肠癌的进展,并从图表审查过程中删除大多数图表。有限数量的令牌可能会影响模型预测。通过增加训练数据集的大小和在句子水平而不是在电子病历水平上分析电子病历,可以提高模型的性能。
{"title":"Deep Learning Models to Screen Electronic Health Records for Breast and Colorectal Cancer Progression: Performance Evaluation Study.","authors":"Pascal Lambert, Rayyan Khan, Marshall Pitz, Harminder Singh, Helen Chen, Kathleen Decker","doi":"10.2196/63767","DOIUrl":"10.2196/63767","url":null,"abstract":"<p><strong>Background: </strong>Cancer progression is an important outcome in cancer research. However, it is frequently documented only in electronic health records (EHRs) as unstructured text, which requires lengthy and costly chart reviews to extract for retrospective studies.</p><p><strong>Objective: </strong>This study aimed to evaluate the performance of 3 deep learning language models in determining breast and colorectal cancer progression in EHRs.</p><p><strong>Methods: </strong>EHRs for individuals diagnosed with stage 4 breast or colorectal cancer between 2004 and 2020 in Manitoba, Canada, were extracted. A chart review was conducted to identify cancer progression in each EHR. Data were analyzed with pretrained deep learning language models (Bio+ClinicalBERT, Clinical-BigBird, and Clinical-Longformer). Sensitivity, positive predictive value, area under the curve, and scaled Brier scores were used to evaluate performance. Influential tokens were identified by removing and adding tokens to EHRs and examining changes in predicted probabilities.</p><p><strong>Results: </strong>Clinical-BigBird and Clinical-Longformer models for breast and colorectal cancer cohorts demonstrated higher accuracy than the Bio+ClinicalBERT models (scaled Brier scores for breast cancer models: 0.70-0.79 vs 0.49-0.71; scaled Brier scores for colorectal cancer models: 0.61-0.65 vs 0.49-0.61). The same models also demonstrated higher sensitivity (breast cancer models: 86.6%-94.3% vs 76.6%-87.1%; colorectal cancer models: 73.1%-78.9% vs 62.8%-77.0%) and positive predictive value (breast cancer models: 77.9%-92.3% vs 80.6%-85.5%; colorectal cancer models: 81.6%-86.3% vs 72.9%-82.9%) compared to Bio+ClinicalBERT models. All models could remove more than 84% of charts from the chart review process. The most influential token was the word progression, which was influenced by the presence of other tokens and its position within an EHR.</p><p><strong>Conclusions: </strong>The deep learning language models could help identify breast and colorectal cancer progression in EHRs and remove most charts from the chart review process. A limited number of tokens may influence model predictions. Improvements in model performance could be obtained by increasing the training dataset size and analyzing EHRs at the sentence level rather than at the EHR level.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e63767"},"PeriodicalIF":2.0,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12559821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145287896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust Cancer Crowdfunding Predictions: Leveraging Large Language Models and Machine Learning for Success Analysis. 稳健的癌症众筹预测:利用大型语言模型和机器学习进行成功分析。
IF 2 Pub Date : 2025-10-13 DOI: 10.2196/73448
Runa Bhaumik, Abhishikta Roy, Vineet Srivastava, Lokesh Boggavarapu, Ranganathan Chandrasekaran, Edward K Mensah, John Galvin

Background: Recent advances in large language models (LLMs), such as GPT-4o, offer a transformative opportunity to extract nuanced linguistic, emotional, and social features from campaign texts at scale. These models enable a deeper understanding of the factors influencing campaign success-far beyond what structured data alone can reveal. Given these advancements, there is a pressing need for an integrated modeling framework that leverages both LLM-derived features and machine learning algorithms to more accurately predict and explain success in medical crowdfunding.

Objective: This study addresses that gap by leveraging cutting-edge machine learning techniques alongside state-of-the-art large language models such as GPT-4o to automatically generate and extract nuanced linguistic, social, and clinical features from campaign narratives. By combining these features with ensemble learning approaches, the proposed methodology offers a novel and more comprehensive strategy for understanding and predicting crowdfunding success in the medical domain.

Methods: We used GPT-4o to extract linguistic and social determinants of health (SDOH) features from cancer crowdfunding campaign narratives. A Random Forest model with permutation importance was applied to rank features based on their contribution to predicting campaign success. Four machine learning algorithms-Random Forest, Gradient Boosting, Logistic Regression, and Elastic Net-were evaluated using stratified 10-fold cross-validation, with performance measured by accuracy, sensitivity, and specificity.

Results: Gradient Boosting consistently outperforms the other algorithms in terms of sensitivity (consistently around 0.786 to 0.798), indicating its superior ability to identify successful crowdfunding campaigns using linguistic and social determinants of health features. The permutation importance score reveals that for severe medical conditions, income loss, chemotherapy treatment, clear and effective communication, cognitive understanding, family involvement, empathy and social behaviors play an important role in the success of campaigns.

Conclusions: This study demonstrates that large language models like GPT-4o can effectively extract nuanced linguistic and social features from crowdfunding narratives, offering deeper insights than traditional methods. These features, when combined with machine learning, significantly improve the identification of key predictors of campaign success, such as medical severity, financial hardship, and empathetic communication. Our findings underscore the potential of LLMs to enhance predictive modeling in health-related crowdfunding and support more targeted policy and communication strategies to reduce financial vulnerability among cancer patients.

Clinicaltrial:

背景:大型语言模型(llm)的最新进展,如gpt - 40,提供了一个革命性的机会,可以大规模地从竞选文本中提取细微的语言、情感和社会特征。这些模型使我们能够更深入地了解影响竞选成功的因素——远远超出结构化数据本身所能揭示的范围。鉴于这些进步,迫切需要一个集成的建模框架,利用法学硕士衍生的特征和机器学习算法,更准确地预测和解释医疗众筹的成功。目的:本研究通过利用尖端的机器学习技术和最先进的大型语言模型(如gpt - 40)来解决这一差距,从竞选叙事中自动生成和提取细微的语言、社会和临床特征。通过将这些特征与集成学习方法相结合,所提出的方法为理解和预测医疗领域的众筹成功提供了一种新颖且更全面的策略。方法:我们使用gpt - 40从癌症众筹活动叙事中提取健康的语言和社会决定因素(SDOH)特征。一个具有排列重要性的随机森林模型被应用于基于预测活动成功的贡献来对特征进行排名。四种机器学习算法——随机森林、梯度增强、逻辑回归和弹性网络——使用分层10倍交叉验证进行评估,并通过准确性、灵敏度和特异性来衡量性能。结果:梯度增强在灵敏度方面始终优于其他算法(始终在0.786至0.798之间),表明其在使用健康特征的语言和社会决定因素识别成功众筹活动方面的卓越能力。排列重要性评分显示,对于严重的医疗状况,收入损失、化疗、清晰有效的沟通、认知理解、家庭参与、共情和社会行为对运动的成功起着重要作用。结论:本研究表明,像gpt - 40这样的大型语言模型可以有效地从众筹故事中提取细微的语言和社会特征,提供比传统方法更深入的见解。这些功能与机器学习相结合,可以显著提高对活动成功关键预测因素的识别,例如医疗严重程度、经济困难和移情沟通。我们的研究结果强调了法学硕士在增强与健康相关的众筹预测建模方面的潜力,并支持更有针对性的政策和沟通策略,以减少癌症患者的财务脆弱性。临床试验:
{"title":"Robust Cancer Crowdfunding Predictions: Leveraging Large Language Models and Machine Learning for Success Analysis.","authors":"Runa Bhaumik, Abhishikta Roy, Vineet Srivastava, Lokesh Boggavarapu, Ranganathan Chandrasekaran, Edward K Mensah, John Galvin","doi":"10.2196/73448","DOIUrl":"10.2196/73448","url":null,"abstract":"<p><strong>Background: </strong>Recent advances in large language models (LLMs), such as GPT-4o, offer a transformative opportunity to extract nuanced linguistic, emotional, and social features from campaign texts at scale. These models enable a deeper understanding of the factors influencing campaign success-far beyond what structured data alone can reveal. Given these advancements, there is a pressing need for an integrated modeling framework that leverages both LLM-derived features and machine learning algorithms to more accurately predict and explain success in medical crowdfunding.</p><p><strong>Objective: </strong>This study addresses that gap by leveraging cutting-edge machine learning techniques alongside state-of-the-art large language models such as GPT-4o to automatically generate and extract nuanced linguistic, social, and clinical features from campaign narratives. By combining these features with ensemble learning approaches, the proposed methodology offers a novel and more comprehensive strategy for understanding and predicting crowdfunding success in the medical domain.</p><p><strong>Methods: </strong>We used GPT-4o to extract linguistic and social determinants of health (SDOH) features from cancer crowdfunding campaign narratives. A Random Forest model with permutation importance was applied to rank features based on their contribution to predicting campaign success. Four machine learning algorithms-Random Forest, Gradient Boosting, Logistic Regression, and Elastic Net-were evaluated using stratified 10-fold cross-validation, with performance measured by accuracy, sensitivity, and specificity.</p><p><strong>Results: </strong>Gradient Boosting consistently outperforms the other algorithms in terms of sensitivity (consistently around 0.786 to 0.798), indicating its superior ability to identify successful crowdfunding campaigns using linguistic and social determinants of health features. The permutation importance score reveals that for severe medical conditions, income loss, chemotherapy treatment, clear and effective communication, cognitive understanding, family involvement, empathy and social behaviors play an important role in the success of campaigns.</p><p><strong>Conclusions: </strong>This study demonstrates that large language models like GPT-4o can effectively extract nuanced linguistic and social features from crowdfunding narratives, offering deeper insights than traditional methods. These features, when combined with machine learning, significantly improve the identification of key predictors of campaign success, such as medical severity, financial hardship, and empathetic communication. Our findings underscore the potential of LLMs to enhance predictive modeling in health-related crowdfunding and support more targeted policy and communication strategies to reduce financial vulnerability among cancer patients.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145287861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-World Evidence Synthesis of Digital Scribes Using Ambient Listening and Generative Artificial Intelligence for Clinician Documentation Workflows: Rapid Review. 使用环境聆听和生成人工智能的临床医生文档工作流程的数字抄写员的真实世界证据合成:快速回顾。
IF 2 Pub Date : 2025-10-10 DOI: 10.2196/76743
Naga Sasidhar Kanaparthy, Yenny Villuendas-Rey, Tolulope Bakare, Zihan Diao, Mark Iscoe, Andrew Loza, Donald Wright, Conrad Safranek, Isaac V Faustino, Alexandria Brackett, Edward R Melnick, R Andrew Taylor

Background: As physicians spend up to twice as much time on electronic health record tasks as on direct patient care, digital scribes have emerged as a promising solution to restore patient-clinician communication and reduce documentation burden-making it essential to study their real-world impact on clinical workflows, efficiency, and satisfaction.

Objective: This study aimed to synthesize evidence on clinician efficiency, user satisfaction, quality, and practical barriers associated with the use of digital scribes using ambient listening and generative artificial intelligence (AI) in real-world clinical settings.

Methods: A rapid review was conducted to evaluate the real-world evidence of digital scribes using ambient listening and generative AI in clinical practice from 2014 to 2024. Data were collected from Ovid MEDLINE, Embase, Web of Science-Core Collection, Cochrane CENTRAL and Reviews, and PubMed Central. Predefined eligibility criteria focused on studies addressing clinical implementation, excluding those centered solely on technical development or model validation. The findings of each study were synthesized and analyzed through the QUEST human evaluation framework for quality and safety and the Systems Engineering Initiative for Patient Safety (SEIPS) 3.0 model to assess integration into clinicians' workflows and experience.

Results: Of the 1450 studies identified, 6 met the inclusion criteria. These studies included an observational study, a case report, a peer-matched cohort study, and survey-based assessments conducted across academic health systems, community settings, and outpatient practices. The major themes noted were as follows: (1) they decreased self-reported documentation times, with associated increased length of notes; (2) physician burnout measured using standardized scales was unaffected, but physician engagement improved; (3) physician productivity, assessed via billing metrics, was unchanged; and (4) the studies fell short when compared to standardized frameworks.

Conclusions: Digital scribes show promise in reducing documentation burden and enhancing clinician satisfaction, thereby supporting workflow efficiency. However, the currently available evidence is sparse. Future real-world, multifaceted studies are needed before AI scribes can be recommended unequivocally.

背景:由于医生在电子健康记录任务上花费的时间是直接患者护理的两倍,数字抄写员已经成为恢复患者-临床沟通和减少文档负担的有前途的解决方案,因此研究它们对临床工作流程、效率和满意度的实际影响至关重要。目的:本研究旨在综合临床医生效率、用户满意度、质量和实际障碍的证据,这些证据与在现实世界的临床环境中使用使用环境听力和生成人工智能(AI)的数字抄写器有关。方法:快速回顾2014年至2024年临床实践中使用环境聆听和生成式人工智能的数字抄写员的真实证据。数据收集自Ovid MEDLINE、Embase、Web of Science-Core Collection、Cochrane CENTRAL and Reviews和PubMed CENTRAL。预定义的资格标准侧重于解决临床实施的研究,排除那些仅以技术开发或模型验证为中心的研究。每项研究的结果通过QUEST质量和安全人类评估框架和患者安全系统工程倡议(SEIPS) 3.0模型进行综合和分析,以评估临床医生工作流程和经验的整合情况。结果:在纳入的1450项研究中,有6项符合纳入标准。这些研究包括一项观察性研究、一份病例报告、一项同行匹配队列研究,以及在学术卫生系统、社区环境和门诊实践中进行的基于调查的评估。注意到的主要主题如下:(1)他们减少了自我报告的文件时间,相应的增加了笔记的长度;(2)采用标准化量表测量的医生职业倦怠不受影响,但医生敬业度有所提高;(3)通过计费指标评估的医生生产力没有变化;(4)与标准化框架相比,这些研究存在不足。结论:数字抄写员有望减轻文件负担,提高临床医生的满意度,从而提高工作效率。然而,目前可获得的证据很少。在明确推荐人工智能抄写员之前,需要对未来的现实世界进行多方面的研究。
{"title":"Real-World Evidence Synthesis of Digital Scribes Using Ambient Listening and Generative Artificial Intelligence for Clinician Documentation Workflows: Rapid Review.","authors":"Naga Sasidhar Kanaparthy, Yenny Villuendas-Rey, Tolulope Bakare, Zihan Diao, Mark Iscoe, Andrew Loza, Donald Wright, Conrad Safranek, Isaac V Faustino, Alexandria Brackett, Edward R Melnick, R Andrew Taylor","doi":"10.2196/76743","DOIUrl":"10.2196/76743","url":null,"abstract":"<p><strong>Background: </strong>As physicians spend up to twice as much time on electronic health record tasks as on direct patient care, digital scribes have emerged as a promising solution to restore patient-clinician communication and reduce documentation burden-making it essential to study their real-world impact on clinical workflows, efficiency, and satisfaction.</p><p><strong>Objective: </strong>This study aimed to synthesize evidence on clinician efficiency, user satisfaction, quality, and practical barriers associated with the use of digital scribes using ambient listening and generative artificial intelligence (AI) in real-world clinical settings.</p><p><strong>Methods: </strong>A rapid review was conducted to evaluate the real-world evidence of digital scribes using ambient listening and generative AI in clinical practice from 2014 to 2024. Data were collected from Ovid MEDLINE, Embase, Web of Science-Core Collection, Cochrane CENTRAL and Reviews, and PubMed Central. Predefined eligibility criteria focused on studies addressing clinical implementation, excluding those centered solely on technical development or model validation. The findings of each study were synthesized and analyzed through the QUEST human evaluation framework for quality and safety and the Systems Engineering Initiative for Patient Safety (SEIPS) 3.0 model to assess integration into clinicians' workflows and experience.</p><p><strong>Results: </strong>Of the 1450 studies identified, 6 met the inclusion criteria. These studies included an observational study, a case report, a peer-matched cohort study, and survey-based assessments conducted across academic health systems, community settings, and outpatient practices. The major themes noted were as follows: (1) they decreased self-reported documentation times, with associated increased length of notes; (2) physician burnout measured using standardized scales was unaffected, but physician engagement improved; (3) physician productivity, assessed via billing metrics, was unchanged; and (4) the studies fell short when compared to standardized frameworks.</p><p><strong>Conclusions: </strong>Digital scribes show promise in reducing documentation burden and enhancing clinician satisfaction, thereby supporting workflow efficiency. However, the currently available evidence is sparse. Future real-world, multifaceted studies are needed before AI scribes can be recommended unequivocally.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76743"},"PeriodicalIF":2.0,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12513689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145276742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Use of Automated Machine Learning to Detect Undiagnosed Diabetes in US Adults: Development and Validation Study. 使用自动机器学习检测美国成人未确诊糖尿病:开发和验证研究。
IF 2 Pub Date : 2025-10-08 DOI: 10.2196/68260
Jianxiu Liu, Fred Ssewamala, Ruopeng An, Mengmeng Ji

Background: Early diagnosis of diabetes is essential for early interventions to slow the progression of dysglycemia and its comorbidities. However, among individuals with diabetes, about 23% were unaware of their condition.

Objective: This study aims to investigate the potential use of automated machine learning (AutoML) models and self-reported data in detecting undiagnosed diabetes among US adults.

Methods: Individual-level data, including biochemical tests for diabetes, demographic characteristics, family history of diabetes, anthropometric measures, dietary intakes, health behaviors, and chronic conditions, were retrieved from the National Health and Nutrition Examination Survey, 1999-2020. Undiagnosed diabetes was defined as having no prior self-reported diagnosis but meeting diagnostic criteria for elevated hemoglobin A1c, fasting plasma glucose, or 2-hour plasma glucose. The H2O AutoML framework, which allows for automated hyperparameter tuning, model selection, and ensemble learning, was used to automate the machine learning workflow. For comparative analysis, 4 traditional machine learning models-logistic regression, support vector machines, random forest, and extreme gradient boosting-were implemented. Model performance was evaluated using the area under the receiver operating characteristic curve.

Results: The study included 11,815 participants aged 20 years and older, comprising 2256 patients with undiagnosed diabetes and 9559 without diabetes. The average age was 59.76 (SD 15.0) years for participants with undiagnosed diabetes and 46.78 (SD 17.2) years for those without diabetes. The AutoML model demonstrated superior performance compared with the 4 traditional machine learning models. The trained AutoML model achieved an area under the receiver operating characteristic curve of 0.909 (95% CI 0.897-0.921) in the test set. The model demonstrated a sensitivity of 70.26%, specificity of 90.46%, positive predictive value of 64.10%, and negative predictive value of 92.61% for identifying undiagnosed diabetes from nondiabetes.

Conclusions: To our knowledge, this study is the first to utilize the AutoML model for detecting undiagnosed diabetes in US adults. The model's strong performance and applicability to the broader US population make it a promising tool for large-scale diabetes screening efforts.

背景:糖尿病的早期诊断对于早期干预以减缓血糖异常及其合并症的进展至关重要。然而,在糖尿病患者中,约23%的人不知道自己的病情。目的:本研究旨在探讨自动机器学习(AutoML)模型和自我报告数据在检测美国成年人未确诊糖尿病中的潜在用途。方法:从1999-2020年全国健康与营养检查调查中检索个人数据,包括糖尿病生化检测、人口统计学特征、糖尿病家族史、人体测量测量、饮食摄入、健康行为和慢性病。未确诊的糖尿病定义为没有先前自我报告的诊断,但符合糖化血红蛋白升高、空腹血糖或2小时血糖的诊断标准。H2O AutoML框架允许自动超参数调优、模型选择和集成学习,用于自动化机器学习工作流程。为了进行对比分析,我们使用了4种传统的机器学习模型——逻辑回归、支持向量机、随机森林和极端梯度增强。使用接收器工作特性曲线下的面积来评估模型性能。结果:该研究纳入了11,815名年龄在20岁及以上的参与者,其中包括2256名未确诊的糖尿病患者和9559名非糖尿病患者。未确诊糖尿病患者的平均年龄为59.76岁(SD 15.0),无糖尿病患者的平均年龄为46.78岁(SD 17.2)。与4种传统的机器学习模型相比,AutoML模型表现出优越的性能。在测试集中,训练后的AutoML模型在接收者工作特征曲线下的面积为0.909 (95% CI 0.897-0.921)。该模型鉴别未确诊糖尿病和非糖尿病的敏感性为70.26%,特异性为90.46%,阳性预测值为64.10%,阴性预测值为92.61%。结论:据我们所知,本研究首次利用AutoML模型检测美国成人未确诊糖尿病。该模型的强大性能和对更广泛的美国人群的适用性使其成为大规模糖尿病筛查工作的有前途的工具。
{"title":"Use of Automated Machine Learning to Detect Undiagnosed Diabetes in US Adults: Development and Validation Study.","authors":"Jianxiu Liu, Fred Ssewamala, Ruopeng An, Mengmeng Ji","doi":"10.2196/68260","DOIUrl":"10.2196/68260","url":null,"abstract":"<p><strong>Background: </strong>Early diagnosis of diabetes is essential for early interventions to slow the progression of dysglycemia and its comorbidities. However, among individuals with diabetes, about 23% were unaware of their condition.</p><p><strong>Objective: </strong>This study aims to investigate the potential use of automated machine learning (AutoML) models and self-reported data in detecting undiagnosed diabetes among US adults.</p><p><strong>Methods: </strong>Individual-level data, including biochemical tests for diabetes, demographic characteristics, family history of diabetes, anthropometric measures, dietary intakes, health behaviors, and chronic conditions, were retrieved from the National Health and Nutrition Examination Survey, 1999-2020. Undiagnosed diabetes was defined as having no prior self-reported diagnosis but meeting diagnostic criteria for elevated hemoglobin A1c, fasting plasma glucose, or 2-hour plasma glucose. The H2O AutoML framework, which allows for automated hyperparameter tuning, model selection, and ensemble learning, was used to automate the machine learning workflow. For comparative analysis, 4 traditional machine learning models-logistic regression, support vector machines, random forest, and extreme gradient boosting-were implemented. Model performance was evaluated using the area under the receiver operating characteristic curve.</p><p><strong>Results: </strong>The study included 11,815 participants aged 20 years and older, comprising 2256 patients with undiagnosed diabetes and 9559 without diabetes. The average age was 59.76 (SD 15.0) years for participants with undiagnosed diabetes and 46.78 (SD 17.2) years for those without diabetes. The AutoML model demonstrated superior performance compared with the 4 traditional machine learning models. The trained AutoML model achieved an area under the receiver operating characteristic curve of 0.909 (95% CI 0.897-0.921) in the test set. The model demonstrated a sensitivity of 70.26%, specificity of 90.46%, positive predictive value of 64.10%, and negative predictive value of 92.61% for identifying undiagnosed diabetes from nondiabetes.</p><p><strong>Conclusions: </strong>To our knowledge, this study is the first to utilize the AutoML model for detecting undiagnosed diabetes in US adults. The model's strong performance and applicability to the broader US population make it a promising tool for large-scale diabetes screening efforts.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68260"},"PeriodicalIF":2.0,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12532270/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145304956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reinforcement Learning to Prevent Acute Care Events Among Medicaid Populations: Mixed Methods Study. 强化学习预防医疗补助人群中的急性护理事件:混合方法研究。
IF 2 Pub Date : 2025-10-08 DOI: 10.2196/74264
Sanjay Basu, Bhairavi Muralidharan, Parth Sheth, Dan Wanek, John Morgan, Sadiq Patel

Background: Multidisciplinary care management teams must rapidly prioritize interventions for patients with complex medical and social needs. Current approaches rely on individual training, judgment, and experience, missing opportunities to learn from longitudinal trajectories and prevent adverse outcomes through recommender systems.

Objective: This study aims to evaluate whether a reinforcement learning approach could outperform standard care management practices in recommending optimal interventions for patients with complex needs.

Methods: Using data from 3175 Medicaid beneficiaries in care management programs across 2 states from 2023 to 2024, we compared alternative approaches for recommending "next best step" interventions: the standard experience-based approach (status quo) and a state-action-reward-state-action (SARSA) reinforcement learning model. We evaluated performance using clinical impact metrics, conducted counterfactual causal inference analyses to estimate reductions in acute care events, assessed fairness across demographic subgroups, and performed qualitative chart reviews where the models differed.

Results: In counterfactual analyses, SARSA-guided care management reduced acute care events by 12 percentage points (95% CI 2.2-21.8 percentage points, a 20.7% relative reduction; P=.02) compared to the status quo approach, with a number needed to treat of 8.3 (95% CI 4.6-45.2) to prevent 1 acute event. The approach showed improved fairness across demographic groups, including gender (3.8% vs 5.3% disparity in acute event rates, reduction 1.5%, 95% CI 0.3%-2.7%) and race and ethnicity (5.6% vs 8.9% disparity, reduction 3.3%, 95% CI 1.1%-5.5%). In qualitative reviews, the SARSA model detected and recommended interventions for specific medical-social interactions, such as respiratory issues associated with poor housing quality or food insecurity in individuals with diabetes.

Conclusions: SARSA-guided care management shows potential to reduce acute care use compared to standard practice. The approach demonstrates how reinforcement learning can improve complex decision-making in situations where patients face concurrent clinical and social factors while maintaining safety and fairness across demographic subgroups.

背景:多学科护理管理团队必须迅速对具有复杂医疗和社会需求的患者进行优先干预。目前的方法依赖于个人培训、判断和经验,失去了从纵向轨迹中学习的机会,并通过推荐系统预防不良后果。目的:本研究旨在评估强化学习方法在为有复杂需求的患者推荐最佳干预措施方面是否优于标准护理管理实践。方法:利用2023年至2024年两个州医疗管理项目3175名医疗补助受益人的数据,我们比较了推荐“下一个最佳步骤”干预措施的替代方法:标准的基于经验的方法(现状)和国家-行动-奖励-国家-行动(SARSA)强化学习模型。我们使用临床影响指标评估绩效,进行反事实因果推理分析以估计急性护理事件的减少,评估人口统计亚组的公平性,并在模型不同的地方进行定性图表回顾。结果:在反事实分析中,与现状方法相比,sarsa引导的护理管理减少了12个百分点的急性护理事件(95% CI 2.2-21.8个百分点,相对减少20.7%;P= 0.02),需要治疗8.3个(95% CI 4.6-45.2)才能预防1个急性事件。该方法显示不同人口统计群体的公平性得到改善,包括性别(急性事件发生率差异3.8% vs 5.3%,减少1.5%,95% CI 0.3%-2.7%)和种族和民族(差异5.6% vs 8.9%,减少3.3%,95% CI 1.1%-5.5%)。在定性评价中,SARSA模型发现并推荐了针对特定医疗-社会相互作用的干预措施,例如糖尿病患者与住房质量差或食物不安全相关的呼吸问题。结论:与标准做法相比,sars引导的护理管理显示出减少急性护理使用的潜力。该方法展示了强化学习如何在患者同时面临临床和社会因素的情况下改善复杂的决策,同时保持人口亚组的安全性和公平性。
{"title":"Reinforcement Learning to Prevent Acute Care Events Among Medicaid Populations: Mixed Methods Study.","authors":"Sanjay Basu, Bhairavi Muralidharan, Parth Sheth, Dan Wanek, John Morgan, Sadiq Patel","doi":"10.2196/74264","DOIUrl":"10.2196/74264","url":null,"abstract":"<p><strong>Background: </strong>Multidisciplinary care management teams must rapidly prioritize interventions for patients with complex medical and social needs. Current approaches rely on individual training, judgment, and experience, missing opportunities to learn from longitudinal trajectories and prevent adverse outcomes through recommender systems.</p><p><strong>Objective: </strong>This study aims to evaluate whether a reinforcement learning approach could outperform standard care management practices in recommending optimal interventions for patients with complex needs.</p><p><strong>Methods: </strong>Using data from 3175 Medicaid beneficiaries in care management programs across 2 states from 2023 to 2024, we compared alternative approaches for recommending \"next best step\" interventions: the standard experience-based approach (status quo) and a state-action-reward-state-action (SARSA) reinforcement learning model. We evaluated performance using clinical impact metrics, conducted counterfactual causal inference analyses to estimate reductions in acute care events, assessed fairness across demographic subgroups, and performed qualitative chart reviews where the models differed.</p><p><strong>Results: </strong>In counterfactual analyses, SARSA-guided care management reduced acute care events by 12 percentage points (95% CI 2.2-21.8 percentage points, a 20.7% relative reduction; P=.02) compared to the status quo approach, with a number needed to treat of 8.3 (95% CI 4.6-45.2) to prevent 1 acute event. The approach showed improved fairness across demographic groups, including gender (3.8% vs 5.3% disparity in acute event rates, reduction 1.5%, 95% CI 0.3%-2.7%) and race and ethnicity (5.6% vs 8.9% disparity, reduction 3.3%, 95% CI 1.1%-5.5%). In qualitative reviews, the SARSA model detected and recommended interventions for specific medical-social interactions, such as respiratory issues associated with poor housing quality or food insecurity in individuals with diabetes.</p><p><strong>Conclusions: </strong>SARSA-guided care management shows potential to reduce acute care use compared to standard practice. The approach demonstrates how reinforcement learning can improve complex decision-making in situations where patients face concurrent clinical and social factors while maintaining safety and fairness across demographic subgroups.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e74264"},"PeriodicalIF":2.0,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12547335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145254070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the Capability of Large Language Models for Navigation of the Australian Health Care System: Comparative Study. 评估澳洲医疗保健系统导航的大型语言模型的能力:比较研究。
IF 2 Pub Date : 2025-10-07 DOI: 10.2196/76203
Joshua Simmich, Megan Heather Ross, Trevor Glen Russell

Background: Australians can face significant challenges in navigating the health care system, especially in rural and regional areas. Generative search tools, powered by large language models (LLMs), show promise in improving health information retrieval by generating direct answers. However, concerns remain regarding their accuracy and reliability when compared to traditional search engines in a health care context.

Objective: This study aimed to compare the effectiveness of a generative artificial intelligence (AI) search (ie, Microsoft Copilot) versus a conventional search engine (Google Web Search) for navigating health care information.

Methods: A total of 97 adults in Queensland, Australia, participated in a web-based survey, answering scenario-based health care navigation questions using either Microsoft Copilot or Google Web Search. Accuracy was assessed using binary correct or incorrect ratings, graded correctness (incorrect, partially correct, or correct), and numerical scores (0-2 for service identification and 0-6 for criteria). Participants also completed a Technology Rating Questionnaire (TRQ) to evaluate their experience with their assigned tool.

Results: Participants assigned to Microsoft Copilot outperformed the Google Web Search group on 2 health care navigation tasks (identifying aged care application services and listing mobility allowance eligibility criteria), with no clear evidence of a difference in the remaining 6 tasks. On the TRQ, participants rated Google Web Search higher in willingness to adopt and perceived impact on quality of life, and lower in effort needed to learn. Both tools received similar ratings in perceived value, confidence, help required to use, and concerns about privacy.

Conclusions: Generative AI tools can achieve comparable accuracy to traditional search engines for health care navigation tasks, though this did not translate into an improved user experience. Further evaluation is needed as AI technology improves and users become more familiar with its use.

背景:澳大利亚人在医疗保健系统导航方面可能面临重大挑战,特别是在农村和地区。由大型语言模型(llm)提供支持的生成式搜索工具有望通过生成直接答案来改进健康信息检索。然而,与医疗保健领域的传统搜索引擎相比,它们的准确性和可靠性仍然令人担忧。目的:本研究旨在比较生成式人工智能(AI)搜索(即Microsoft Copilot)与传统搜索引擎(b谷歌Web search)在导航医疗保健信息方面的有效性。方法:澳大利亚昆士兰州共有97名成年人参与了一项基于网络的调查,使用微软Copilot或谷歌网络搜索回答基于场景的医疗保健导航问题。使用二元正确或不正确评级、分级正确性(不正确、部分正确或正确)和数字分数(服务识别为0-2,标准为0-6)来评估准确性。参与者还完成了一份技术评级问卷(TRQ),以评估他们使用指定工具的体验。结果:分配给Microsoft Copilot的参与者在2个医疗保健导航任务(识别老年护理应用服务和列出移动津贴资格标准)上优于b谷歌Web Search组,在其余6个任务中没有明确的差异证据。在TRQ上,参与者对谷歌网络搜索的接受意愿和对生活质量的感知影响评分较高,而学习所需的努力较低。这两种工具在感知价值、信心、使用所需的帮助和对隐私的关注方面获得了相似的评级。结论:在医疗保健导航任务中,生成式人工智能工具可以达到与传统搜索引擎相当的准确性,尽管这并没有转化为改进的用户体验。随着人工智能技术的进步和用户对其用法的熟悉,需要进一步的评估。
{"title":"Assessing the Capability of Large Language Models for Navigation of the Australian Health Care System: Comparative Study.","authors":"Joshua Simmich, Megan Heather Ross, Trevor Glen Russell","doi":"10.2196/76203","DOIUrl":"10.2196/76203","url":null,"abstract":"<p><strong>Background: </strong>Australians can face significant challenges in navigating the health care system, especially in rural and regional areas. Generative search tools, powered by large language models (LLMs), show promise in improving health information retrieval by generating direct answers. However, concerns remain regarding their accuracy and reliability when compared to traditional search engines in a health care context.</p><p><strong>Objective: </strong>This study aimed to compare the effectiveness of a generative artificial intelligence (AI) search (ie, Microsoft Copilot) versus a conventional search engine (Google Web Search) for navigating health care information.</p><p><strong>Methods: </strong>A total of 97 adults in Queensland, Australia, participated in a web-based survey, answering scenario-based health care navigation questions using either Microsoft Copilot or Google Web Search. Accuracy was assessed using binary correct or incorrect ratings, graded correctness (incorrect, partially correct, or correct), and numerical scores (0-2 for service identification and 0-6 for criteria). Participants also completed a Technology Rating Questionnaire (TRQ) to evaluate their experience with their assigned tool.</p><p><strong>Results: </strong>Participants assigned to Microsoft Copilot outperformed the Google Web Search group on 2 health care navigation tasks (identifying aged care application services and listing mobility allowance eligibility criteria), with no clear evidence of a difference in the remaining 6 tasks. On the TRQ, participants rated Google Web Search higher in willingness to adopt and perceived impact on quality of life, and lower in effort needed to learn. Both tools received similar ratings in perceived value, confidence, help required to use, and concerns about privacy.</p><p><strong>Conclusions: </strong>Generative AI tools can achieve comparable accuracy to traditional search engines for health care navigation tasks, though this did not translate into an improved user experience. Further evaluation is needed as AI technology improves and users become more familiar with its use.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76203"},"PeriodicalIF":2.0,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12508777/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145253999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR AI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1