首页 > 最新文献

JMIR AI最新文献

英文 中文
Message Humanness as a Predictor of AI's Perception as Human: Secondary Data Analysis of the HeartBot Study. 信息人性化作为人工智能感知人类的预测因素:HeartBot研究的次要数据分析。
IF 2 Pub Date : 2026-02-03 DOI: 10.2196/67717
Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka

Background: Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.

Objective: This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.

Methods: This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.

Results: Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).

Conclusions: To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.

背景:人工智能(AI)聊天机器人已经成为医疗保健领域增强健康知识和促进不同人群健康行为的重要工具。然而,影响人工智能聊天机器人和人机交互感知的因素在很大程度上是未知的。目的:本研究旨在确定与人工智能聊天机器人身份感知相关的交互特征,并根据不同女性样本的社会人口状况和以前的聊天机器人使用情况进行调整。方法:本研究是对HeartBot试验数据的二次分析,该试验在2023年10月至2024年1月期间通过社交媒体招募的25岁及以上女性中进行。HeartBot试验的最初目标是评估与全自动AI HeartBot聊天机器人互动后对心脏病发作的意识和知识的变化。所有参与者都与HeartBot互动一次。在对话开始时,聊天机器人介绍自己为HeartBot。然而,它并没有明确指出参与者将与人工智能系统进行交互。在聊天机器人之后的调查中,测量了感知到的聊天机器人身份(人类与人工智能代理)、与HeartBot的对话长度、消息的人性化、消息的有效性以及对人工智能的态度。通过多变量逻辑回归,研究了预测女性对聊天机器人作为人类身份的感知的因素,调整了年龄、种族或民族、教育程度、以前使用过的人工智能聊天机器人、信息人性化、信息有效性和对人工智能的态度。结果:在92名女性(平均年龄45.9岁,标准差11.9,范围26-70岁)中,三分之二(n= 61,66%)的样本正确识别了聊天机器人的身份,而三分之一(n= 31,34%)的样本将聊天机器人误认为是人类。超过一半(n= 53,58%)的人以前有过人工智能聊天机器人的经验。参与者与HeartBot的平均互动时间为13.0分钟(SD 7.8),输入82.5个单词(SD 61.9)。在多变量分析中,与人工智能相比,只有信息的人性与聊天机器人作为人类的身份感知显著相关(调整后的优势比为2.37,95% CI为1.26-4.48;P=.007)。结论:据我们所知,这是第一个明确询问参与者在医疗保健领域,他们是将互动视为人类还是聊天机器人(HeartBot)的研究。这项研究的发现(信息人性化的作用和重要性)为设计聊天机器人提供了新的见解。然而,目前的证据仍然是初步的。未来的研究需要在更大规模的研究中了解聊天机器人身份、信息人性化和健康结果之间的关系。
{"title":"Message Humanness as a Predictor of AI's Perception as Human: Secondary Data Analysis of the HeartBot Study.","authors":"Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka","doi":"10.2196/67717","DOIUrl":"https://doi.org/10.2196/67717","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.</p><p><strong>Objective: </strong>This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.</p><p><strong>Methods: </strong>This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.</p><p><strong>Results: </strong>Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).</p><p><strong>Conclusions: </strong>To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e67717"},"PeriodicalIF":2.0,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study. 五种人工智能模型在USMLE上的表现step1问题:一项比较观察研究
IF 2 Pub Date : 2026-01-30 DOI: 10.2196/76928
Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham

Background: Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.

Objective: To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.

Methods: This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.

Results: Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.

Conclusions: AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.

Clinicaltrial:

背景:人工智能(AI)模型在医学教育中的应用越来越广泛。虽然ChatGPT等模型之前在usmle风格的问题上表现出色,但现在有了功能增强的新人工智能工具,需要对不同医学领域和问题格式的准确性和可靠性进行比较评估。目的:评估和比较五种公开可用的人工智能模型:Grok、ChatGPT-4、Copilot、Gemini和DeepSeek在USMLE Step 1 Free 120个问题集上的性能,检查它们在问题类型和医学主题上的准确性和一致性。方法:本横断面观察研究于2025年2月10日至3月5日进行。119个usmle风格的问题(不包括一个基于音频的问题)中的每一个都使用标准化的提示周期呈现给每个AI模型。模型回答每个问题三次,以评估信心和一致性。问题分为基于文本或基于图像,基于案例或基于信息。统计分析采用卡方检验和Fisher精确检验,两两比较采用Bonferroni调整。结果:Grok得分最高(91.6%),其次是Copilot(84.9%)、Gemini(84.0%)、ChatGPT-4(79.8%)和DeepSeek(72.3%)。DeepSeek的较低分数是由于无法处理视觉媒体,导致基于图像的项目的准确率为0%。当仅限于文本问题(n = 96)时,DeepSeek的准确率提高到89.6%,与Copilot相当。Grok在基于图像的问题(91.3%)和基于案例的问题(89.7%)上的准确率最高,Grok和DeepSeek在基于案例的问题上的差异有统计学意义(p = 0.011)。模型在生物统计学和流行病学方面表现最好(96.7%),在肌肉骨骼、皮肤和结缔组织方面表现最差(62.9%)。Grok在回答中保持了100%的一致性,而Copilot表现出了最高的自我纠正(一致性为94.1%),在第三次尝试时将其准确性提高到89.9%。结论:人工智能模型在不同领域表现出不同的优势,Grok在该数据集中表现出最高的准确性和一致性,特别是对于基于图像和推理重的问题。虽然ChatGPT-4仍然被广泛使用,但Grok和Copilot等较新的模型也表现得很有竞争力。随着人工智能工具的快速发展,持续评估是必不可少的。临床试验:
{"title":"Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study.","authors":"Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham","doi":"10.2196/76928","DOIUrl":"https://doi.org/10.2196/76928","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.</p><p><strong>Objective: </strong>To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.</p><p><strong>Methods: </strong>This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.</p><p><strong>Results: </strong>Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.</p><p><strong>Conclusions: </strong>AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation. 在NSCLC TNM分期中通过快速工程和监督微调来增强LLM:框架开发和验证。
IF 2 Pub Date : 2026-01-29 DOI: 10.2196/77988
Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li
<p><strong>Background: </strong>Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.</p><p><strong>Objective: </strong>This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).</p><p><strong>Methods: </strong>We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.</p><p><strong>Results: </strong>The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.</p><p><strong>Conclusions: </strong>The proposed hybrid fra
背景:准确的TNM分期是非小细胞肺癌(NSCLC)治疗计划和预后的基础。然而,其复杂性带来了重大挑战,特别是在不同临床环境的标准化解释方面。传统的基于规则的自然语言处理方法受其依赖于人工制定的规则的限制,并且容易受到临床报告不一致的影响。目的:本研究旨在通过先进的即时工程和监督微调(SFT),战略性地增强大型语言模型GLM-4-Air,开发和验证一个强大、准确、高效的NSCLC TNM分期人工智能框架。方法:我们构建了一个精心整理的数据集,包含492份去识别的真实世界医学影像报告,并根据AJCC(美国癌症联合委员会)第8版指南,由资深医生严格验证TNM分期注释。通过多阶段过程对GLM-4-Air模型进行了系统优化:针对所有阶段任务采用思维链推理和领域知识注入的迭代提示工程,然后针对推理密集型的T和N阶段任务采用低秩自适应(Low-Rank Adaptation, LoRA)的参数高效SFT。最终的混合模型在一个完全固定的内部测试集(黑盒)上进行评估,并使用标准指标、统计测试和分期错误的临床影响分析对gpt - 40进行基准测试。结果:优化后的混合GLM-4-Air模型性能可靠。它在黑盒测试集上获得了更高的分期准确性:T为92%(95%置信区间(CI): 0.850-0.959), N为86% (95% CI: 0.779-0.915), M为92% (95% CI: 0.850-0.959),总体临床分期为90%;相比之下,gpt - 40分别达到87% (95% CI: 0.790-0.922)、70% (95% CI: 0.604-0.781)、78% (95% CI: 0.689-0.850)和80%。宏观平均f1得分分别为0.914 (T)、0.815 (N)和0.831 (M),持续优于gpt - 40(0.836、0.620和0.698),进一步证明了模型的稳健性。对混淆矩阵的分析证实了该模型在识别关键分期特征方面的熟练程度,同时有效地减少了假阴性。至关重要的是,临床影响评估显示严重的I类错误大幅减少,这被定义为可能显著影响后续临床决策的错误分类。我们的模型在两个测试集的M阶段中犯了0个第一类错误,在T和N阶段犯了更少的第一类错误。此外,该框架展示了实际的可部署性,实现了对消费级硬件(例如,4个RTX 4090 gpu)的有效推断,延迟适合临床工作流程并可接受。结论:所提出的混合框架,整合了结构化提示工程,并将SFT应用于推理繁重的任务(T/N),使GLM-4-Air模型成为一种高度准确、临床可靠且经济高效的NSCLC TNM自动分期解决方案。这项工作证明了与现成的通才模型相比,领域优化的小型模型的有效性和潜力,有望在资源感知型医疗保健环境中增强诊断标准化。临床试验:
{"title":"Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation.","authors":"Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li","doi":"10.2196/77988","DOIUrl":"https://doi.org/10.2196/77988","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;The proposed hybrid fra","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Model-based Chatbots and Agentic AI for Mental Health Counseling: A Systematic Review of Methodologies, Evaluation Frameworks, and Ethical Safeguards. 基于大型语言模型的聊天机器人和用于心理健康咨询的人工智能代理:方法、评估框架和道德保障的系统回顾。
IF 2 Pub Date : 2026-01-27 DOI: 10.2196/80348
Ha Na Cho, Kai Zheng, Jiayuan Wang, Di Hu
<p><strong>Background: </strong>Large language model (LLM)-based chatbots have rapidly emerged as tools for digital mental health (MH) counseling. However, evidence on their methodological quality, evaluation rigor, and ethical safeguards remains fragmented, limiting interpretation of clinical readiness and deployment safety.</p><p><strong>Objective: </strong>This systematic review aimed to synthesize the methodologies, evaluation practices, and ethical/governance frameworks of LLM-based chatbots developed for MH counseling and to identify gaps affecting validity, reproducibility, and translation.</p><p><strong>Methods: </strong>We searched Google Scholar, PubMed, IEEE Xplore, and ACM Digital Library for studies published between January 2020 and May 2025. Eligible studies reported original development or empirical evaluation of LLM-driven MH counseling chatbots. We excluded studies that did not involve LLM-based conversational agents, were not focused on counseling or supportive MH communication, or lacked evaluable system outputs or outcomes. Screening and data extraction were conducted in Covidence following PRISMA 2020 guidance. Study quality was appraised using a structured traffic-light framework across five methodological domains (design, dataset reporting, evaluation metrics, external validation, and ethics), with an overall judgment derived across domains. We used narrative synthesis with descriptive aggregation to summarize methodological trends, evaluation metrics, and governance considerations.</p><p><strong>Results: </strong>Twenty studies met inclusion criteria. GPT-based models (GPT-2/3/4) were used in 45% (9/20) of studies, while 90% (18/20) used fine-tuned or domain-adaptation using models such as LlaMa, ChatGLM, or Qwen. Reported deployment types were not mutually exclusive; standalone applications were most common (90%, 18/20), and some systems were also implemented as virtual agents (20%, 4/20) or delivered via existing platforms (10%, 2/20). Evaluation approaches were frequently mixed, with qualitative assessment (65%, 13/20), such as thematic analysis or rubric-based scoring, often complemented by quantitative language metrics (90%, 18/20), including BLEU, ROUGE, or perplexity. Quality appraisal indicated consistently low risk for dataset reporting and evaluation metrics, but recurring limitations were observed in external validation and reporting on ethics and safety, including incomplete documentation of safety safeguards and governance practices. No included study reported registered randomized controlled trials or independent clinical validation in real-world care settings.</p><p><strong>Conclusions: </strong>LLM-based MH counseling chatbots show promise for scalable and personalized support, but current evidence is limited by heterogeneous study designs, minimal external validation, and inconsistent reporting of safety and governance practices. Future work should prioritize clinically grounded evaluation frameworks, tra
背景:基于大语言模型(LLM)的聊天机器人已经迅速成为数字心理健康(MH)咨询的工具。然而,关于其方法质量、评估严谨性和伦理保障的证据仍然不完整,限制了对临床准备和部署安全性的解释。目的:本系统综述旨在综合为MH咨询开发的基于法学硕士的聊天机器人的方法、评估实践和伦理/治理框架,并确定影响有效性、可重复性和翻译的差距。方法:我们检索谷歌Scholar、PubMed、IEEE explore和ACM数字图书馆,检索2020年1月至2025年5月间发表的研究。符合条件的研究报告了llm驱动的MH咨询聊天机器人的原始开发或实证评估。我们排除了不涉及基于法学硕士的会话代理,不关注咨询或支持性MH沟通,或缺乏可评估的系统输出或结果的研究。根据PRISMA 2020指南,在冠状病毒期间进行了筛查和数据提取。研究质量通过五个方法学领域(设计、数据集报告、评估指标、外部验证和伦理)的结构化红绿灯框架进行评估,并得出跨领域的总体判断。我们使用叙述性综合和描述性聚合来总结方法论趋势、评估指标和治理考虑。结果:20项研究符合纳入标准。45%(9/20)的研究使用了基于gpt的模型(GPT-2/3/4),而90%(18/20)的研究使用了微调或领域自适应的模型,如LlaMa、ChatGLM或Qwen。报告的部署类型不是相互排斥的;独立应用程序是最常见的(90%,18/20),一些系统也作为虚拟代理实现(20%,4/20)或通过现有平台交付(10%,2/20)。评估方法经常混合,定性评估(65%,13/20),如专题分析或基于规则的评分,通常辅以定量语言指标(90%,18/20),包括BLEU, ROUGE或perplexity。质量评估表明数据集报告和评估指标的风险始终较低,但在伦理和安全的外部验证和报告中观察到反复出现的限制,包括安全保障和治理实践的文件不完整。没有纳入的研究报告注册的随机对照试验或独立的临床验证在现实世界的护理环境。结论:基于法学硕士的MH咨询聊天机器人有望提供可扩展和个性化的支持,但目前的证据受到异质性研究设计、最小的外部验证以及安全性和治理实践不一致报告的限制。未来的工作应优先考虑基于临床的评估框架、透明的模型报告和及时的配置,以及使用标准化结果进行更强的验证,以支持安全、可靠和监管就绪的部署。临床试验:
{"title":"Large Language Model-based Chatbots and Agentic AI for Mental Health Counseling: A Systematic Review of Methodologies, Evaluation Frameworks, and Ethical Safeguards.","authors":"Ha Na Cho, Kai Zheng, Jiayuan Wang, Di Hu","doi":"10.2196/80348","DOIUrl":"https://doi.org/10.2196/80348","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Large language model (LLM)-based chatbots have rapidly emerged as tools for digital mental health (MH) counseling. However, evidence on their methodological quality, evaluation rigor, and ethical safeguards remains fragmented, limiting interpretation of clinical readiness and deployment safety.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This systematic review aimed to synthesize the methodologies, evaluation practices, and ethical/governance frameworks of LLM-based chatbots developed for MH counseling and to identify gaps affecting validity, reproducibility, and translation.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We searched Google Scholar, PubMed, IEEE Xplore, and ACM Digital Library for studies published between January 2020 and May 2025. Eligible studies reported original development or empirical evaluation of LLM-driven MH counseling chatbots. We excluded studies that did not involve LLM-based conversational agents, were not focused on counseling or supportive MH communication, or lacked evaluable system outputs or outcomes. Screening and data extraction were conducted in Covidence following PRISMA 2020 guidance. Study quality was appraised using a structured traffic-light framework across five methodological domains (design, dataset reporting, evaluation metrics, external validation, and ethics), with an overall judgment derived across domains. We used narrative synthesis with descriptive aggregation to summarize methodological trends, evaluation metrics, and governance considerations.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Twenty studies met inclusion criteria. GPT-based models (GPT-2/3/4) were used in 45% (9/20) of studies, while 90% (18/20) used fine-tuned or domain-adaptation using models such as LlaMa, ChatGLM, or Qwen. Reported deployment types were not mutually exclusive; standalone applications were most common (90%, 18/20), and some systems were also implemented as virtual agents (20%, 4/20) or delivered via existing platforms (10%, 2/20). Evaluation approaches were frequently mixed, with qualitative assessment (65%, 13/20), such as thematic analysis or rubric-based scoring, often complemented by quantitative language metrics (90%, 18/20), including BLEU, ROUGE, or perplexity. Quality appraisal indicated consistently low risk for dataset reporting and evaluation metrics, but recurring limitations were observed in external validation and reporting on ethics and safety, including incomplete documentation of safety safeguards and governance practices. No included study reported registered randomized controlled trials or independent clinical validation in real-world care settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;LLM-based MH counseling chatbots show promise for scalable and personalized support, but current evidence is limited by heterogeneous study designs, minimal external validation, and inconsistent reporting of safety and governance practices. Future work should prioritize clinically grounded evaluation frameworks, tra","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Discovery of Leukemia Inhibitors Using AI-Driven Quantitative Structure-Activity Relationship: Algorithm Development and Validation. 利用人工智能驱动的QSAR模型加速发现白血病抑制剂。
IF 2 Pub Date : 2026-01-27 DOI: 10.2196/81552
Samuel Kakraba, Edmund Fosu Agyemang, Robert J Shmookler Reis
<p><strong>Background: </strong>Leukemia treatment remains a major challenge in oncology. While thiadiazolidinone analogs show potential to inhibit leukemia cell proliferation, they often lack sufficient potency and selectivity. Traditional drug discovery struggles to efficiently explore the vast chemical landscape, highlighting the need for innovative computational strategies. Machine learning (ML)-enhanced quantitative structure-activity relationship (QSAR) modeling offers a promising route to identify and optimize inhibitors with improved activity and specificity.</p><p><strong>Objective: </strong>We aimed to develop and validate an integrated ML-enhanced QSAR modeling workflow for the rational design and prediction of thiadiazolidinone analogs with improved antileukemia activity by systematically evaluating molecular descriptors and algorithmic approaches to identify key determinants of potency and guide future inhibitor optimization.</p><p><strong>Methods: </strong>We analyzed 35 thiadiazolidinone derivatives with confirmed antileukemia activity, removing outliers for data quality. Using Schrödinger MAESTRO, we calculated 220 molecular descriptors (1D-4D). Seventeen ML models, including random forests, XGBoost, and neural networks, were trained on 70% of the data and tested on 30%, using stratified random sampling. Model performance was assessed with 12 metrics, including mean squared error (MSE), coefficient of determination (explained variance; R<sup>2</sup>), and Shapley additive explanations (SHAP) values, and optimized via hyperparameter tuning and 5-fold cross-validation. Additional analyses, including train-test gap assessment, comparison to baseline linear models, and cross-validation stability analysis, were performed to assess genuine learning rather than overfitting.</p><p><strong>Results: </strong>Isotonic regression ranked first with the lowest test MSE (0.00031 ± 0.00009), outperforming baseline models by over 15% in explained variance. Ensemble methods, especially LightGBM and random forest, also showed superior predictive performance (LightGBM: MSE=0.00063 ± 0.00012; R<sup>2</sup>=0.9709 ± 0.0084). Training-to-test performance degradation of LightGBM was modest (ΔR<sup>2</sup>=-0.01, ΔMSE=+0.000126), suggesting genuine pattern learning rather than memorization. SHAP analysis revealed that the most influential features contributing to antileukemia activity were global molecular shape (r_qp_glob; mean SHAP value=0.52), weighted polar surface area (r_qp_WPSA; ≈0.50), polarizability (r_qp_QPpolrz; ≈0.49), partition coefficient (r_qp_QPlogPC16; ≈0.48), solvent-accessible surface area (r_qp_SASA; ≈0.48), hydrogen bond donor count (r_qp_donorHB; ≈0.48), and the sum of topological distances between oxygen and chlorine atoms (i_desc_Sum_of_topological_distances_between_O.Cl; ≈0.47). These features highlight the importance of steric complementarity and the 3D arrangement of functional groups. Aqueous solubility (r_qp_QPlogS; ≈0.47) and
背景:白血病治疗仍然是肿瘤学的主要挑战。虽然噻二唑烷酮(TDZD)类似物显示出抑制白血病细胞增殖的潜力,但它们往往缺乏足够的效力和选择性。传统的药物发现努力有效地探索广阔的化学景观,突出需要创新的计算策略。机器学习(ML)增强的QSAR建模为识别和优化具有更高活性和特异性的抑制剂提供了一条有前途的途径。目的:通过系统评估分子描述符和算法方法,开发并验证一个集成的机器学习增强的QSAR建模工作流程,用于合理设计和预测具有改善抗白血病活性的噻二唑烷酮(TDZD)类似物,以确定效力的关键决定因素并指导未来的抑制剂优化。方法:我们分析了35种证实具有抗白血病活性的TDZD衍生物,去除数据质量的异常值。使用Schrödinger MAESTRO,我们计算了220个分子描述符(1D-4D)。17个ML模型,包括随机森林、XGBoost和神经网络,使用分层抽样对70%的数据进行训练,并对30%的数据进行测试。通过MSE、R²和SHAP值等12个指标评估模型性能,并通过超参数调整和5倍交叉验证进行优化。其他分析包括训练测试差距评估、基线线性模型比较和交叉验证稳定性分析,以评估真实学习而不是过拟合。结果:集合方法的预测效果较好,其中以LightGBM和Random Forest的预测效果最好(LightGBM: MSE = 0.00063±0.00012;R²= 0.971±0.0084)。训练到测试的性能下降是适度的(ΔR²= -0.01,ΔMSE = +0.000126),表明真正的模式学习而不是记忆。等渗回归排名第二,在解释方差上优于基线模型15%以上。SHAP分析显示,对抗白血病活性影响最大的特征是整体分子形状(r_qp_glob,平均SHAP值= 0.52)、加权极性表面积(r_qp_WPSA; ~0.50)、极化率(r_qp_QPpolrz; ~0.49)、分配系数(r_qp_QPlogPC16; ~0.48)、溶剂可及表面积(r_qp_SASA; ~0.48)、氢键供体数(r_qp_donorHB;~0.48),氧和氯原子之间的拓扑距离和(i_desc_Sum_of_topological_distances_between_O.Cl; ~0.47)。这些参数突出了立体互补和官能团三维排列的重要性。水溶性(r_qp_QPlogS; ~0.47)和氢键受体计数(r_qp_accptHB; ~0.44)也在前十位特征之列。这些描述符的重要性在多个算法模型中是一致的,包括随机森林、XGBoost和PLS方法。结论:将高级ML与QSAR建模相结合,可以系统地分析该数据集上TDZD类似物的结构-活性关系。虽然集成方法捕获具有高内部验证指标的复杂模式,但在做出广泛的治疗声明之前,对独立化合物和前瞻性实验测试进行外部验证是必不可少的。这项工作为未来的验证工作提供了方法学基础,并确定了分子特征。
{"title":"Accelerating Discovery of Leukemia Inhibitors Using AI-Driven Quantitative Structure-Activity Relationship: Algorithm Development and Validation.","authors":"Samuel Kakraba, Edmund Fosu Agyemang, Robert J Shmookler Reis","doi":"10.2196/81552","DOIUrl":"10.2196/81552","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Leukemia treatment remains a major challenge in oncology. While thiadiazolidinone analogs show potential to inhibit leukemia cell proliferation, they often lack sufficient potency and selectivity. Traditional drug discovery struggles to efficiently explore the vast chemical landscape, highlighting the need for innovative computational strategies. Machine learning (ML)-enhanced quantitative structure-activity relationship (QSAR) modeling offers a promising route to identify and optimize inhibitors with improved activity and specificity.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;We aimed to develop and validate an integrated ML-enhanced QSAR modeling workflow for the rational design and prediction of thiadiazolidinone analogs with improved antileukemia activity by systematically evaluating molecular descriptors and algorithmic approaches to identify key determinants of potency and guide future inhibitor optimization.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We analyzed 35 thiadiazolidinone derivatives with confirmed antileukemia activity, removing outliers for data quality. Using Schrödinger MAESTRO, we calculated 220 molecular descriptors (1D-4D). Seventeen ML models, including random forests, XGBoost, and neural networks, were trained on 70% of the data and tested on 30%, using stratified random sampling. Model performance was assessed with 12 metrics, including mean squared error (MSE), coefficient of determination (explained variance; R&lt;sup&gt;2&lt;/sup&gt;), and Shapley additive explanations (SHAP) values, and optimized via hyperparameter tuning and 5-fold cross-validation. Additional analyses, including train-test gap assessment, comparison to baseline linear models, and cross-validation stability analysis, were performed to assess genuine learning rather than overfitting.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Isotonic regression ranked first with the lowest test MSE (0.00031 ± 0.00009), outperforming baseline models by over 15% in explained variance. Ensemble methods, especially LightGBM and random forest, also showed superior predictive performance (LightGBM: MSE=0.00063 ± 0.00012; R&lt;sup&gt;2&lt;/sup&gt;=0.9709 ± 0.0084). Training-to-test performance degradation of LightGBM was modest (ΔR&lt;sup&gt;2&lt;/sup&gt;=-0.01, ΔMSE=+0.000126), suggesting genuine pattern learning rather than memorization. SHAP analysis revealed that the most influential features contributing to antileukemia activity were global molecular shape (r_qp_glob; mean SHAP value=0.52), weighted polar surface area (r_qp_WPSA; ≈0.50), polarizability (r_qp_QPpolrz; ≈0.49), partition coefficient (r_qp_QPlogPC16; ≈0.48), solvent-accessible surface area (r_qp_SASA; ≈0.48), hydrogen bond donor count (r_qp_donorHB; ≈0.48), and the sum of topological distances between oxygen and chlorine atoms (i_desc_Sum_of_topological_distances_between_O.Cl; ≈0.47). These features highlight the importance of steric complementarity and the 3D arrangement of functional groups. Aqueous solubility (r_qp_QPlogS; ≈0.47) and","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":"e81552"},"PeriodicalIF":2.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12892034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating an AI Decision Support System for the Emergency Department: Retrospective Study. 评估急诊科人工智能决策支持系统:回顾性研究
IF 2 Pub Date : 2026-01-26 DOI: 10.2196/80448
Yvette Van Der Haas, Wiesje Roskamp, Lidwina Elisabeth Maria Chang-Willems, Boudewijn van Dongen, Swetta Jansen, Annemarie de Jong, Renata Medeiros de Carvalho, Dorien Melman, Arjan van de Merwe, Marieke Bastian-Sanders, Bart Overbeek, Rogier Leendert Charles Plas, Marleen Vreeburg, Thomas van Dijk

Background: Overcrowding in the emergency department (ED) is a growing challenge, associated with increased medical errors, longer patient stays, higher morbidity, and increased mortality rates. Artificial intelligence (AI) decision support tools have shown potential in addressing this problem by assisting with faster decision-making regarding patient admissions; yet many studies neglect to focus on the clinical relevance and practical applications of these AI solutions.

Objective: This study aimed to evaluate the clinical relevance of an AI model in predicting patient admission from the ED to hospital wards and its potential impact on reducing the time needed to make an admission decision.

Methods: A retrospective study was conducted using anonymized patient data from St. Antonius Hospital, the Netherlands, from January 2018 to September 2023. An Extreme Gradient Boosting AI model was developed and tested on these data of 154,347 visits to predict admission decisions. The model was evaluated using data segmented into 10-minute intervals, which reflected real-world applicability. The primary outcome measured was the reduction in the decision-making time between the AI model and the admission decision made by the clinician. Secondary outcomes analyzed the performance of the model across various subgroups, including the age of the patient, medical specialty, classification category, and time of day.

Results: The AI model demonstrated a precision of 0.78 and a recall of 0.73, with a median time saving of 111 (IQR 59-169) minutes for true positive predicted patients. Subgroup analysis revealed that older patients and certain specialties such as pulmonology benefited the most from the AI model, with time savings of up to 90 minutes per patient.

Conclusions: The AI model shows significant potential to reduce the time to admission decisions, alleviate ED overcrowding, and improve patient care. The model offers the advantage of always providing weighted advice on admission, even when the ED is under pressure. Future prospective studies are needed to assess the impact in the real world and further enhance the performance of the model in diverse hospital settings.

背景:急诊科(ED)人满为患是一个日益严峻的挑战,与医疗差错增加、患者住院时间延长、发病率升高和死亡率增加有关。人工智能(AI)决策支持工具已经显示出解决这一问题的潜力,它可以帮助患者更快地做出入院决策;然而,许多研究忽视了这些人工智能解决方案的临床相关性和实际应用。目的:本研究旨在评估人工智能模型在预测患者从急诊科到医院病房的住院情况方面的临床相关性,以及它对减少做出住院决定所需时间的潜在影响。方法:对2018年1月至2023年9月荷兰圣安东尼奥医院的匿名患者数据进行回顾性研究。研究人员开发了一个极端梯度增强人工智能模型,并对这些154,347次访问的数据进行了测试,以预测录取决定。该模型使用以10分钟为间隔的数据进行评估,这反映了现实世界的适用性。测量的主要结果是人工智能模型与临床医生做出的入院决定之间的决策时间的减少。次要结果分析了模型在不同亚组中的表现,包括患者的年龄、医学专业、分类类别和一天中的时间。结果:AI模型的准确率为0.78,召回率为0.73,预测真阳性患者的平均节省时间为111分钟(IQR 59-169)。亚组分析显示,老年患者和某些专科(如肺科)从人工智能模型中受益最大,每位患者最多可节省90分钟的时间。结论:人工智能模型在缩短住院决策时间、缓解急诊科人满为患和改善患者护理方面显示出巨大的潜力。这种模式的优势在于,即使在急诊科面临压力的情况下,也总是能在入院时提供有分量的建议。未来的前瞻性研究需要评估其在现实世界中的影响,并进一步提高该模型在不同医院环境中的表现。
{"title":"Evaluating an AI Decision Support System for the Emergency Department: Retrospective Study.","authors":"Yvette Van Der Haas, Wiesje Roskamp, Lidwina Elisabeth Maria Chang-Willems, Boudewijn van Dongen, Swetta Jansen, Annemarie de Jong, Renata Medeiros de Carvalho, Dorien Melman, Arjan van de Merwe, Marieke Bastian-Sanders, Bart Overbeek, Rogier Leendert Charles Plas, Marleen Vreeburg, Thomas van Dijk","doi":"10.2196/80448","DOIUrl":"10.2196/80448","url":null,"abstract":"<p><strong>Background: </strong>Overcrowding in the emergency department (ED) is a growing challenge, associated with increased medical errors, longer patient stays, higher morbidity, and increased mortality rates. Artificial intelligence (AI) decision support tools have shown potential in addressing this problem by assisting with faster decision-making regarding patient admissions; yet many studies neglect to focus on the clinical relevance and practical applications of these AI solutions.</p><p><strong>Objective: </strong>This study aimed to evaluate the clinical relevance of an AI model in predicting patient admission from the ED to hospital wards and its potential impact on reducing the time needed to make an admission decision.</p><p><strong>Methods: </strong>A retrospective study was conducted using anonymized patient data from St. Antonius Hospital, the Netherlands, from January 2018 to September 2023. An Extreme Gradient Boosting AI model was developed and tested on these data of 154,347 visits to predict admission decisions. The model was evaluated using data segmented into 10-minute intervals, which reflected real-world applicability. The primary outcome measured was the reduction in the decision-making time between the AI model and the admission decision made by the clinician. Secondary outcomes analyzed the performance of the model across various subgroups, including the age of the patient, medical specialty, classification category, and time of day.</p><p><strong>Results: </strong>The AI model demonstrated a precision of 0.78 and a recall of 0.73, with a median time saving of 111 (IQR 59-169) minutes for true positive predicted patients. Subgroup analysis revealed that older patients and certain specialties such as pulmonology benefited the most from the AI model, with time savings of up to 90 minutes per patient.</p><p><strong>Conclusions: </strong>The AI model shows significant potential to reduce the time to admission decisions, alleviate ED overcrowding, and improve patient care. The model offers the advantage of always providing weighted advice on admission, even when the ED is under pressure. Future prospective studies are needed to assess the impact in the real world and further enhance the performance of the model in diverse hospital settings.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80448"},"PeriodicalIF":2.0,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12887564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging Large Language Models to Improve the Readability of German Online Medical Texts: Evaluation Study. 利用大型语言模型来提高德语在线医学文本的可读性:评估研究。
IF 2 Pub Date : 2026-01-23 DOI: 10.2196/77149
Amela Miftaroski, Richard Zowalla, Martin Wiesner, Monika Pobiruchin

Background: Patient education materials (PEMs) found online are often written at a complexity level too high for the average reader, which can hinder understanding and informed decision-making. Large language models (LLMs) may offer a solution by simplifying complex medical texts. To date, little is known about how well LLMs can handle simplification tasks for German-language PEMs.

Objective: The study aims to investigate whether LLMs can increase the readability of German online medical texts to a recommended level.

Methods: A sample of 60 German texts originating from online medical resources was compiled. To improve the readability of these texts, four LLMs were selected and used for text simplification: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, and Le Chat. Next, readability scores (Flesch reading ease [FRE] and Wiener Sachtextformel [4th Vienna Formula; WSTF]) of the original texts were computed and compared to the rephrased LLM versions. A Student t test for paired samples was used to test the reduction of readability scores, ideally to or lower than the eighth grade level.

Results: Most of the original texts were rated as difficult to quite difficult (average WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64). On average, the LLMs achieved the following average scores: ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), ChatGPT-4o (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51), and Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58). ChatGPT-3.5, ChatGPT-40, and Microsoft Copilot showed a statistically significant improvement in readability. However, the t tests yielded no statistically significant results for the reduction of scores lower than the eighth grade level.

Conclusions: LLMs can improve the readability of PEMs in German. This moderate improvement can support patients reading PEMs online. LLMs demonstrated their potential to make complex online medical text more accessible to a broader audience by increasing readability. This is the first study to evaluate this for German online medical texts.

背景:网上发现的患者教育材料(PEMs)通常编写的复杂程度对于普通读者来说太高,这可能会阻碍理解和知情决策。大型语言模型(llm)可以通过简化复杂的医学文本提供解决方案。迄今为止,法学硕士如何处理德语PEMs的简化任务尚不清楚。目的:本研究旨在探讨法学硕士是否可以将德语在线医学文本的可读性提高到推荐水平。方法:对60篇来源于网上医学资源的德文文献进行整理。为了提高这些文本的可读性,我们选择了四个llm进行文本简化:ChatGPT-3.5、chatgpt - 40、Microsoft Copilot和Le Chat。接下来,计算原文的可读性分数(Flesch reading ease [FRE]和Wiener Sachtextformel [4th Vienna Formula; WSTF]),并与改写后的LLM版本进行比较。配对样本的学生t检验用于测试可读性分数的降低,理想情况下是达到或低于八年级水平。结果:大部分原始文本被评为困难到相当困难(平均WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64)。平均而言,LLMs的平均得分如下:ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), chatgpt - 40 (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51)和Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58)。ChatGPT-3.5、ChatGPT-40和Microsoft Copilot在可读性上有统计学上的显著改善。然而,对于低于八年级水平的分数的降低,t检验没有产生统计学上显著的结果。结论:llm可以提高德文PEMs的可读性。这种适度的改善可以支持患者在线阅读PEMs。法学硕士展示了它们的潜力,通过提高可读性,使复杂的在线医学文本更容易为更广泛的受众所接受。这是第一个对德国在线医学文本进行评估的研究。
{"title":"Leveraging Large Language Models to Improve the Readability of German Online Medical Texts: Evaluation Study.","authors":"Amela Miftaroski, Richard Zowalla, Martin Wiesner, Monika Pobiruchin","doi":"10.2196/77149","DOIUrl":"10.2196/77149","url":null,"abstract":"<p><strong>Background: </strong>Patient education materials (PEMs) found online are often written at a complexity level too high for the average reader, which can hinder understanding and informed decision-making. Large language models (LLMs) may offer a solution by simplifying complex medical texts. To date, little is known about how well LLMs can handle simplification tasks for German-language PEMs.</p><p><strong>Objective: </strong>The study aims to investigate whether LLMs can increase the readability of German online medical texts to a recommended level.</p><p><strong>Methods: </strong>A sample of 60 German texts originating from online medical resources was compiled. To improve the readability of these texts, four LLMs were selected and used for text simplification: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, and Le Chat. Next, readability scores (Flesch reading ease [FRE] and Wiener Sachtextformel [4th Vienna Formula; WSTF]) of the original texts were computed and compared to the rephrased LLM versions. A Student t test for paired samples was used to test the reduction of readability scores, ideally to or lower than the eighth grade level.</p><p><strong>Results: </strong>Most of the original texts were rated as difficult to quite difficult (average WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64). On average, the LLMs achieved the following average scores: ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), ChatGPT-4o (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51), and Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58). ChatGPT-3.5, ChatGPT-40, and Microsoft Copilot showed a statistically significant improvement in readability. However, the t tests yielded no statistically significant results for the reduction of scores lower than the eighth grade level.</p><p><strong>Conclusions: </strong>LLMs can improve the readability of PEMs in German. This moderate improvement can support patients reading PEMs online. LLMs demonstrated their potential to make complex online medical text more accessible to a broader audience by increasing readability. This is the first study to evaluate this for German online medical texts.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e77149"},"PeriodicalIF":2.0,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12829587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessment of the Modified Rankin Scale in Electronic Health Records with a Fine-tuned Large Language Model. 用一个微调的大语言模型评估电子健康记录中修改的Rankin量表。
IF 2 Pub Date : 2026-01-18 DOI: 10.2196/82607
Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompa Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib

Background: The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHR), but this process is labor-intensive and prone to inter-rater variability. Large language models (LLMs) have demonstrated potential in automating text classification.

Objective: We aim to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.

Methods: We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) hospital discharge and (2) approximately 90 days post-discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS 0-2) versus non-independence (mRS 3-6). Four-fold cross-validation was conducted, using accuracy and Cohen's kappa as model performance metrics.

Results: A total of 2,290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen's Kappa of 0.92. Class-specific accuracy was highest for mRS 4 (90%) and lowest for mRS 2 (28%). The binary model-considering only functional independence vs non-independence -attained an accuracy of 92% and Cohen's Kappa of 0.84. Conclusions.

Conclusions: Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis, however, improving discrimination between intermediate scores is required.

Clinicaltrial:

背景:改良Rankin量表(mRS)是脑卒中研究中的一项重要指标,常被用作临床试验和观察性研究的主要指标。可以通过电子健康记录(EHR)对mr进行回顾性评估,但这一过程是劳动密集型的,而且容易出现评分者之间的差异。大型语言模型(llm)在自动文本分类方面已经显示出潜力。目的:我们的目标是创建一个微调LLM,可以分析电子病历文本和分类mRS评分的临床和研究应用。方法:我们对2020年8月至2023年6月在一家大型学术医院系统接受中风神经病学专科服务的患者进行了一项回顾性队列研究。在两个时间点审查每位患者的医疗记录:(1)出院和(2)出院后约90天。两个独立的研究人员在每个时间点分配了一个mRS评分。在EHR传代上训练两个独立的模型,并将相应的mRS分数作为标记结果:(1)一个多类模型,用于对所有七个mRS分数进行分类;(2)一个二元模型,用于对功能独立性(mRS 0-2)和非独立性(mRS 3-6)进行分类。使用准确性和Cohen’s kappa作为模型性能指标,进行了四重交叉验证。结果:共有2290篇相应mRS评分的EHR传代被纳入模型训练。考虑到mrs的所有7个分数,多类模型的准确率达到77%,加权科恩Kappa为0.92。分类特异性准确率最高的是mRS 4(90%),最低的是mRS 2(28%)。二元模型——只考虑功能独立性和非独立性——达到了92%的准确率,Cohen的Kappa为0.84。结论。结论:我们的研究结果表明,llm可以成功地通过EHR文本分析来确定mRS分数,然而,需要提高中间分数之间的区分能力。临床试验:
{"title":"Assessment of the Modified Rankin Scale in Electronic Health Records with a Fine-tuned Large Language Model.","authors":"Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompa Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib","doi":"10.2196/82607","DOIUrl":"10.2196/82607","url":null,"abstract":"<p><strong>Background: </strong>The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHR), but this process is labor-intensive and prone to inter-rater variability. Large language models (LLMs) have demonstrated potential in automating text classification.</p><p><strong>Objective: </strong>We aim to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.</p><p><strong>Methods: </strong>We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) hospital discharge and (2) approximately 90 days post-discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS 0-2) versus non-independence (mRS 3-6). Four-fold cross-validation was conducted, using accuracy and Cohen's kappa as model performance metrics.</p><p><strong>Results: </strong>A total of 2,290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen's Kappa of 0.92. Class-specific accuracy was highest for mRS 4 (90%) and lowest for mRS 2 (28%). The binary model-considering only functional independence vs non-independence -attained an accuracy of 92% and Cohen's Kappa of 0.84. Conclusions.</p><p><strong>Conclusions: </strong>Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis, however, improving discrimination between intermediate scores is required.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Treatment Recommendations for Clinical Deterioration on the Wards: Development and Validation of Machine Learning Models. 病房临床恶化的治疗建议:机器学习模型的开发和验证。
IF 2 Pub Date : 2026-01-16 DOI: 10.2196/81642
Eric Pulick, Kyle A Carey, Tonela Qyli, Madeline K Oguss, Jamila K Picart, Leena Penumalee, Lily K Nezirova, Sean T Tully, Emily R Gilbert, Nirav S Shah, Urmila Ravichandran, Majid Afshar, Dana P Edelson, Yonatan Mintz, Matthew M Churpek

Background: Clinical deterioration in general ward patients is associated with increased morbidity and mortality. Early and appropriate treatments can improve outcomes for such patients. While machine learning (ML) tools have proven successful in the early identification of clinical deterioration risk, little work has explored their effectiveness in providing data-driven treatment recommendations to clinicians for high-risk patients.

Objective: This study established ML performance benchmarks for predicting the need for 10 common clinical deterioration interventions. This study also compared the performance of various ML models to inform which types of approaches are well-suited to these prediction tasks.

Methods: We relied on a chart-reviewed, multicenter dataset of general ward patients experiencing clinical deterioration (n=2480 encounters), who were identified as high risk using a Food and Drug Administration-cleared early warning score (electronic Cardiac Arrest Risk Triage score). Manual chart review labeled each encounter with gold-standard lifesaving treatment labels. We trained elastic net logistic regression, gradient boosted machines, long short-term memory, and stacking ensemble models to predict the need for 10 common deterioration interventions at the time of the deterioration elevated risk score. Models were trained on encounters from 3 health systems and externally validated on encounters from a fourth health system. Discriminative performance, assessed by the area under the receiver operating characteristic curve (AUROC), was the primary evaluation metric.

Results: Discriminative performance varied widely by model and prediction task, with AUROCs typically ranging from 0.7 to 0.9. Across all models, antiarrhythmics were the easiest treatment to predict (mean AUROC 0.866, SD 0.012) while anticoagulants were the hardest to predict (mean AUROC 0.660, SD 0.065). While no individual modeling approach outperformed the others across all tasks, the gradient boosted machines tended to show the best individual performance. Additionally, the stacking ensemble, which combined predictions from all models, typically matched or outperformed the best-performing individual model for each task. We also demonstrated that a sizable fraction of patients in our evaluation cohort were untreated at the time of the deterioration elevated risk score, highlighting an opportunity to leverage ML tools to decrease treatment latency.

Conclusions: We found variability in the discrimination of ML models across tasks and model approaches for predicting lifesaving treatments in patients with clinical deterioration. Overall performance was high, and these models could be paired with early warning scores to provide clinicians with timely and actionable treatment recommendations to improve patient care.

背景:普通病房患者的临床恶化与发病率和死亡率增加有关。早期和适当的治疗可以改善这类患者的预后。虽然机器学习(ML)工具在早期识别临床恶化风险方面已被证明是成功的,但在为临床医生提供高风险患者的数据驱动治疗建议方面,很少有研究探索它们的有效性。目的:本研究建立了预测10种常见临床恶化干预措施需求的ML性能基准。本研究还比较了各种ML模型的性能,以告知哪种类型的方法非常适合这些预测任务。方法:我们依赖于一个经过图表审查的多中心数据集,其中包括经历临床恶化的普通病房患者(n=2480次就诊),这些患者使用美国食品和药物管理局批准的早期预警评分(电子心脏骤停风险分诊评分)被确定为高风险。手动图表审查标记了每一次遭遇与黄金标准的救生治疗标签。我们训练了弹性网络逻辑回归、梯度增强机器、长短期记忆和堆叠集成模型,以预测在恶化升高风险评分时对10种常见恶化干预的需求。对来自3个卫生系统的遭遇进行了模型培训,并对来自第四个卫生系统的遭遇进行了外部验证。鉴别表现以受试者工作特征曲线下面积(AUROC)为主要评价指标。结果:不同模型和预测任务的判别性能差异很大,auroc一般在0.7 ~ 0.9之间。在所有模型中,抗心律失常药物是最容易预测的治疗方法(平均AUROC为0.866,SD为0.012),而抗凝药物最难预测(平均AUROC为0.660,SD为0.065)。虽然没有一种单独的建模方法在所有任务中都优于其他方法,但梯度增强的机器往往表现出最佳的单独性能。此外,堆叠集成,结合了所有模型的预测,通常匹配或优于每个任务中表现最好的单个模型。我们还证明,在我们的评估队列中,相当一部分患者在恶化升高的风险评分时未接受治疗,这突出了利用ML工具减少治疗延迟的机会。结论:我们发现,在预测临床恶化患者的救命治疗方面,ML模型在不同任务和模型方法之间的区分存在差异。总体表现较高,这些模型可以与早期预警评分配对,为临床医生提供及时和可操作的治疗建议,以改善患者护理。
{"title":"Treatment Recommendations for Clinical Deterioration on the Wards: Development and Validation of Machine Learning Models.","authors":"Eric Pulick, Kyle A Carey, Tonela Qyli, Madeline K Oguss, Jamila K Picart, Leena Penumalee, Lily K Nezirova, Sean T Tully, Emily R Gilbert, Nirav S Shah, Urmila Ravichandran, Majid Afshar, Dana P Edelson, Yonatan Mintz, Matthew M Churpek","doi":"10.2196/81642","DOIUrl":"10.2196/81642","url":null,"abstract":"<p><strong>Background: </strong>Clinical deterioration in general ward patients is associated with increased morbidity and mortality. Early and appropriate treatments can improve outcomes for such patients. While machine learning (ML) tools have proven successful in the early identification of clinical deterioration risk, little work has explored their effectiveness in providing data-driven treatment recommendations to clinicians for high-risk patients.</p><p><strong>Objective: </strong>This study established ML performance benchmarks for predicting the need for 10 common clinical deterioration interventions. This study also compared the performance of various ML models to inform which types of approaches are well-suited to these prediction tasks.</p><p><strong>Methods: </strong>We relied on a chart-reviewed, multicenter dataset of general ward patients experiencing clinical deterioration (n=2480 encounters), who were identified as high risk using a Food and Drug Administration-cleared early warning score (electronic Cardiac Arrest Risk Triage score). Manual chart review labeled each encounter with gold-standard lifesaving treatment labels. We trained elastic net logistic regression, gradient boosted machines, long short-term memory, and stacking ensemble models to predict the need for 10 common deterioration interventions at the time of the deterioration elevated risk score. Models were trained on encounters from 3 health systems and externally validated on encounters from a fourth health system. Discriminative performance, assessed by the area under the receiver operating characteristic curve (AUROC), was the primary evaluation metric.</p><p><strong>Results: </strong>Discriminative performance varied widely by model and prediction task, with AUROCs typically ranging from 0.7 to 0.9. Across all models, antiarrhythmics were the easiest treatment to predict (mean AUROC 0.866, SD 0.012) while anticoagulants were the hardest to predict (mean AUROC 0.660, SD 0.065). While no individual modeling approach outperformed the others across all tasks, the gradient boosted machines tended to show the best individual performance. Additionally, the stacking ensemble, which combined predictions from all models, typically matched or outperformed the best-performing individual model for each task. We also demonstrated that a sizable fraction of patients in our evaluation cohort were untreated at the time of the deterioration elevated risk score, highlighting an opportunity to leverage ML tools to decrease treatment latency.</p><p><strong>Conclusions: </strong>We found variability in the discrimination of ML models across tasks and model approaches for predicting lifesaving treatments in patients with clinical deterioration. Overall performance was high, and these models could be paired with early warning scores to provide clinicians with timely and actionable treatment recommendations to improve patient care.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81642"},"PeriodicalIF":2.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12810948/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Role of AI in Improving Digital Wellness Among Older Adults: Comparative Bibliometric Analysis. 人工智能在改善老年人数字健康中的作用:比较文献计量分析。
IF 2 Pub Date : 2026-01-14 DOI: 10.2196/71248
Naveh Eskinazi, Moti Zwilling, Adilson Marques, Riki Tesler

Background: Advances in artificial intelligence (AI) have revolutionized digital wellness by providing innovative solutions for health, social connectivity, and overall well-being. Despite these advancements, the older population often struggles with barriers such as accessibility, digital literacy, and infrastructure limitations, leaving them at risk of digital exclusion. These challenges underscore the critical need for tailored AI-driven interventions to bridge the digital divide and enhance the inclusion of older adults in the digital ecosystem.

Objective: This study presents a comparative bibliometric analysis of research on the role of AI in promoting digital wellness, with a particular emphasis on the older population in comparison to the general population. The analysis addressed five key research topics: (1) the evolution of AI's impact on digital wellness over time for both the older and general population, (2) patterns of collaboration globally, (3) leading institutions' contribution to AI-focused research, (4) prominent journals in the field, and (5) emerging themes and trends in AI-related research.

Methods: Data were collected from the Web of Science between 2016 and 2025, totaling 3429 documents (344 related to older people), analyzed using bibliometric tools.

Results: Results indicate that AI-related digital wellness research for the general population has experienced exponential growth since 2016, with significant contributions from the United States, the United Kingdom, and China. In contrast, research on older people has seen slower growth, with more localized collaboration networks and a steady increase in citations. Key research topics for the general population include digital health, machine learning, and telemedicine, whereas studies on older people focus on dementia, mobile health, and risk management.

Conclusions: The results of our analysis highlight an increasing body of research focused on AI-driven solutions intended to improve the digital wellness among older people and identify future research directions to refer to the specific needs of this population segment.

背景:人工智能(AI)的进步通过为健康、社会连接和整体福祉提供创新的解决方案,彻底改变了数字健康。尽管取得了这些进步,但老年人口往往面临无障碍、数字素养和基础设施限制等障碍,使他们面临被数字排斥的风险。这些挑战突出表明,迫切需要有针对性的人工智能驱动干预措施,以弥合数字鸿沟,促进老年人融入数字生态系统。目的:本研究对人工智能在促进数字健康方面的作用进行了比较文献计量分析,特别强调了与普通人群相比老年人群的作用。该分析涉及五个关键研究主题:(1)随着时间的推移,人工智能对老年人和普通人群数字健康的影响的演变;(2)全球合作模式;(3)领先机构对人工智能研究的贡献;(4)该领域的知名期刊;(5)人工智能相关研究的新兴主题和趋势。方法:收集2016 - 2025年Web of Science的数据,共3429篇文献(其中344篇与老年人相关),使用文献计量学工具进行分析。结果表明,自2016年以来,针对普通人群的人工智能相关数字健康研究呈指数级增长,其中美国、英国和中国的贡献显著。相比之下,对老年人的研究增长较慢,有更多的本地化合作网络,引用量稳步增长。针对普通人群的主要研究课题包括数字健康、机器学习和远程医疗,而针对老年人的研究则侧重于痴呆症、移动健康和风险管理。结论:我们的分析结果表明,越来越多的研究集中在人工智能驱动的解决方案上,旨在改善老年人的数字健康,并确定未来的研究方向,以参考这一人群的具体需求。
{"title":"The Role of AI in Improving Digital Wellness Among Older Adults: Comparative Bibliometric Analysis.","authors":"Naveh Eskinazi, Moti Zwilling, Adilson Marques, Riki Tesler","doi":"10.2196/71248","DOIUrl":"10.2196/71248","url":null,"abstract":"<p><strong>Background: </strong>Advances in artificial intelligence (AI) have revolutionized digital wellness by providing innovative solutions for health, social connectivity, and overall well-being. Despite these advancements, the older population often struggles with barriers such as accessibility, digital literacy, and infrastructure limitations, leaving them at risk of digital exclusion. These challenges underscore the critical need for tailored AI-driven interventions to bridge the digital divide and enhance the inclusion of older adults in the digital ecosystem.</p><p><strong>Objective: </strong>This study presents a comparative bibliometric analysis of research on the role of AI in promoting digital wellness, with a particular emphasis on the older population in comparison to the general population. The analysis addressed five key research topics: (1) the evolution of AI's impact on digital wellness over time for both the older and general population, (2) patterns of collaboration globally, (3) leading institutions' contribution to AI-focused research, (4) prominent journals in the field, and (5) emerging themes and trends in AI-related research.</p><p><strong>Methods: </strong>Data were collected from the Web of Science between 2016 and 2025, totaling 3429 documents (344 related to older people), analyzed using bibliometric tools.</p><p><strong>Results: </strong>Results indicate that AI-related digital wellness research for the general population has experienced exponential growth since 2016, with significant contributions from the United States, the United Kingdom, and China. In contrast, research on older people has seen slower growth, with more localized collaboration networks and a steady increase in citations. Key research topics for the general population include digital health, machine learning, and telemedicine, whereas studies on older people focus on dementia, mobile health, and risk management.</p><p><strong>Conclusions: </strong>The results of our analysis highlight an increasing body of research focused on AI-driven solutions intended to improve the digital wellness among older people and identify future research directions to refer to the specific needs of this population segment.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e71248"},"PeriodicalIF":2.0,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12808873/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR AI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1