Background: Artificial intelligence (AI) has, in the recent past, experienced a rebirth with the growth of generative AI systems such as ChatGPT and Bard. These systems are trained with billions of parameters and have enabled widespread accessibility and understanding of AI among different user groups. Widespread adoption of AI has led to the need for understanding how machine learning (ML) models operate to build trust in them. An understanding of how these models generate their results remains a huge challenge that explainable AI seeks to solve. Federated learning (FL) grew out of the need to have privacy-preserving AI by having ML models that are decentralized but still share model parameters with a global model.
Objective: This study sought to examine the extent of development of the explainable AI field within the FL environment in relation to the main contributions made, the types of FL, the sectors it is applied to, the models used, the methods applied by each study, and the databases from which sources are obtained.
Methods: A systematic search in 8 electronic databases, namely, Web of Science Core Collection, Scopus, PubMed, ACM Digital Library, IEEE Xplore, Mendeley, BASE, and Google Scholar, was undertaken.
Results: A review of 26 studies revealed that research on explainable FL is steadily growing despite being concentrated in Europe and Asia. The key determinants of FL use were data privacy and limited training data. Horizontal FL remains the preferred approach for federated ML, whereas post hoc explainability techniques were preferred.
Conclusions: There is potential for development of novel approaches and improvement of existing approaches in the explainable FL field, especially for critical areas.
背景:近年来,随着ChatGPT和Bard等生成式人工智能系统的发展,人工智能(AI)经历了一次重生。这些系统经过数十亿个参数的训练,使不同用户群体能够广泛访问和理解人工智能。人工智能的广泛采用导致需要了解机器学习(ML)模型如何运作,以建立对它们的信任。理解这些模型是如何产生结果的,仍然是可解释人工智能寻求解决的一个巨大挑战。联邦学习(FL)源于对保护隐私的人工智能的需求,通过使用分散的ML模型,但仍然与全局模型共享模型参数。目的:本研究试图检查FL环境中可解释的AI领域的发展程度,包括所做的主要贡献、FL的类型、它所应用的部门、所使用的模型、每项研究应用的方法以及从中获得资源的数据库。方法:系统检索Web of Science Core Collection、Scopus、PubMed、ACM Digital Library、IEEE explore、Mendeley、BASE、谷歌Scholar等8个电子数据库。结果:对26项研究的回顾表明,尽管主要集中在欧洲和亚洲,但对可解释性FL的研究正在稳步增长。使用FL的关键决定因素是数据隐私和有限的训练数据。水平FL仍然是联邦ML的首选方法,而事后可解释性技术是首选方法。结论:在可解释的FL领域,特别是在关键领域,存在开发新方法和改进现有方法的潜力。试验注册:OSF registres10.17605 /OSF. io /Y85WA;https://osf.io/y85wa。
{"title":"Explainable AI Approaches in Federated Learning: Systematic Review.","authors":"Titus Tunduny, Bernard Shibwabo","doi":"10.2196/69985","DOIUrl":"https://doi.org/10.2196/69985","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) has, in the recent past, experienced a rebirth with the growth of generative AI systems such as ChatGPT and Bard. These systems are trained with billions of parameters and have enabled widespread accessibility and understanding of AI among different user groups. Widespread adoption of AI has led to the need for understanding how machine learning (ML) models operate to build trust in them. An understanding of how these models generate their results remains a huge challenge that explainable AI seeks to solve. Federated learning (FL) grew out of the need to have privacy-preserving AI by having ML models that are decentralized but still share model parameters with a global model.</p><p><strong>Objective: </strong>This study sought to examine the extent of development of the explainable AI field within the FL environment in relation to the main contributions made, the types of FL, the sectors it is applied to, the models used, the methods applied by each study, and the databases from which sources are obtained.</p><p><strong>Methods: </strong>A systematic search in 8 electronic databases, namely, Web of Science Core Collection, Scopus, PubMed, ACM Digital Library, IEEE Xplore, Mendeley, BASE, and Google Scholar, was undertaken.</p><p><strong>Results: </strong>A review of 26 studies revealed that research on explainable FL is steadily growing despite being concentrated in Europe and Asia. The key determinants of FL use were data privacy and limited training data. Horizontal FL remains the preferred approach for federated ML, whereas post hoc explainability techniques were preferred.</p><p><strong>Conclusions: </strong>There is potential for development of novel approaches and improvement of existing approaches in the explainable FL field, especially for critical areas.</p><p><strong>Trial registration: </strong>OSF Registries 10.17605/OSF.IO/Y85WA; https://osf.io/y85wa.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e69985"},"PeriodicalIF":2.0,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka
Background: Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.
Objective: This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.
Methods: This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.
Results: Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).
Conclusions: To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.
{"title":"Message Humanness as a Predictor of AI's Perception as Human: Secondary Data Analysis of the HeartBot Study.","authors":"Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka","doi":"10.2196/67717","DOIUrl":"https://doi.org/10.2196/67717","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.</p><p><strong>Objective: </strong>This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.</p><p><strong>Methods: </strong>This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.</p><p><strong>Results: </strong>Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).</p><p><strong>Conclusions: </strong>To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e67717"},"PeriodicalIF":2.0,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham
Background: Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.
Objective: To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.
Methods: This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.
Results: Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.
Conclusions: AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.
{"title":"Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study.","authors":"Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham","doi":"10.2196/76928","DOIUrl":"https://doi.org/10.2196/76928","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.</p><p><strong>Objective: </strong>To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.</p><p><strong>Methods: </strong>This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.</p><p><strong>Results: </strong>Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.</p><p><strong>Conclusions: </strong>AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li
<p><strong>Background: </strong>Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.</p><p><strong>Objective: </strong>This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).</p><p><strong>Methods: </strong>We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.</p><p><strong>Results: </strong>The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.</p><p><strong>Conclusions: </strong>The proposed hybrid fra
{"title":"Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation.","authors":"Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li","doi":"10.2196/77988","DOIUrl":"https://doi.org/10.2196/77988","url":null,"abstract":"<p><strong>Background: </strong>Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.</p><p><strong>Objective: </strong>This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).</p><p><strong>Methods: </strong>We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.</p><p><strong>Results: </strong>The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.</p><p><strong>Conclusions: </strong>The proposed hybrid fra","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p><strong>Background: </strong>Large language model (LLM)-based chatbots have rapidly emerged as tools for digital mental health (MH) counseling. However, evidence on their methodological quality, evaluation rigor, and ethical safeguards remains fragmented, limiting interpretation of clinical readiness and deployment safety.</p><p><strong>Objective: </strong>This systematic review aimed to synthesize the methodologies, evaluation practices, and ethical/governance frameworks of LLM-based chatbots developed for MH counseling and to identify gaps affecting validity, reproducibility, and translation.</p><p><strong>Methods: </strong>We searched Google Scholar, PubMed, IEEE Xplore, and ACM Digital Library for studies published between January 2020 and May 2025. Eligible studies reported original development or empirical evaluation of LLM-driven MH counseling chatbots. We excluded studies that did not involve LLM-based conversational agents, were not focused on counseling or supportive MH communication, or lacked evaluable system outputs or outcomes. Screening and data extraction were conducted in Covidence following PRISMA 2020 guidance. Study quality was appraised using a structured traffic-light framework across five methodological domains (design, dataset reporting, evaluation metrics, external validation, and ethics), with an overall judgment derived across domains. We used narrative synthesis with descriptive aggregation to summarize methodological trends, evaluation metrics, and governance considerations.</p><p><strong>Results: </strong>Twenty studies met inclusion criteria. GPT-based models (GPT-2/3/4) were used in 45% (9/20) of studies, while 90% (18/20) used fine-tuned or domain-adaptation using models such as LlaMa, ChatGLM, or Qwen. Reported deployment types were not mutually exclusive; standalone applications were most common (90%, 18/20), and some systems were also implemented as virtual agents (20%, 4/20) or delivered via existing platforms (10%, 2/20). Evaluation approaches were frequently mixed, with qualitative assessment (65%, 13/20), such as thematic analysis or rubric-based scoring, often complemented by quantitative language metrics (90%, 18/20), including BLEU, ROUGE, or perplexity. Quality appraisal indicated consistently low risk for dataset reporting and evaluation metrics, but recurring limitations were observed in external validation and reporting on ethics and safety, including incomplete documentation of safety safeguards and governance practices. No included study reported registered randomized controlled trials or independent clinical validation in real-world care settings.</p><p><strong>Conclusions: </strong>LLM-based MH counseling chatbots show promise for scalable and personalized support, but current evidence is limited by heterogeneous study designs, minimal external validation, and inconsistent reporting of safety and governance practices. Future work should prioritize clinically grounded evaluation frameworks, tra
{"title":"Large Language Model-based Chatbots and Agentic AI for Mental Health Counseling: A Systematic Review of Methodologies, Evaluation Frameworks, and Ethical Safeguards.","authors":"Ha Na Cho, Kai Zheng, Jiayuan Wang, Di Hu","doi":"10.2196/80348","DOIUrl":"https://doi.org/10.2196/80348","url":null,"abstract":"<p><strong>Background: </strong>Large language model (LLM)-based chatbots have rapidly emerged as tools for digital mental health (MH) counseling. However, evidence on their methodological quality, evaluation rigor, and ethical safeguards remains fragmented, limiting interpretation of clinical readiness and deployment safety.</p><p><strong>Objective: </strong>This systematic review aimed to synthesize the methodologies, evaluation practices, and ethical/governance frameworks of LLM-based chatbots developed for MH counseling and to identify gaps affecting validity, reproducibility, and translation.</p><p><strong>Methods: </strong>We searched Google Scholar, PubMed, IEEE Xplore, and ACM Digital Library for studies published between January 2020 and May 2025. Eligible studies reported original development or empirical evaluation of LLM-driven MH counseling chatbots. We excluded studies that did not involve LLM-based conversational agents, were not focused on counseling or supportive MH communication, or lacked evaluable system outputs or outcomes. Screening and data extraction were conducted in Covidence following PRISMA 2020 guidance. Study quality was appraised using a structured traffic-light framework across five methodological domains (design, dataset reporting, evaluation metrics, external validation, and ethics), with an overall judgment derived across domains. We used narrative synthesis with descriptive aggregation to summarize methodological trends, evaluation metrics, and governance considerations.</p><p><strong>Results: </strong>Twenty studies met inclusion criteria. GPT-based models (GPT-2/3/4) were used in 45% (9/20) of studies, while 90% (18/20) used fine-tuned or domain-adaptation using models such as LlaMa, ChatGLM, or Qwen. Reported deployment types were not mutually exclusive; standalone applications were most common (90%, 18/20), and some systems were also implemented as virtual agents (20%, 4/20) or delivered via existing platforms (10%, 2/20). Evaluation approaches were frequently mixed, with qualitative assessment (65%, 13/20), such as thematic analysis or rubric-based scoring, often complemented by quantitative language metrics (90%, 18/20), including BLEU, ROUGE, or perplexity. Quality appraisal indicated consistently low risk for dataset reporting and evaluation metrics, but recurring limitations were observed in external validation and reporting on ethics and safety, including incomplete documentation of safety safeguards and governance practices. No included study reported registered randomized controlled trials or independent clinical validation in real-world care settings.</p><p><strong>Conclusions: </strong>LLM-based MH counseling chatbots show promise for scalable and personalized support, but current evidence is limited by heterogeneous study designs, minimal external validation, and inconsistent reporting of safety and governance practices. Future work should prioritize clinically grounded evaluation frameworks, tra","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samuel Kakraba, Edmund Fosu Agyemang, Robert J Shmookler Reis
<p><strong>Background: </strong>Leukemia treatment remains a major challenge in oncology. While thiadiazolidinone analogs show potential to inhibit leukemia cell proliferation, they often lack sufficient potency and selectivity. Traditional drug discovery struggles to efficiently explore the vast chemical landscape, highlighting the need for innovative computational strategies. Machine learning (ML)-enhanced quantitative structure-activity relationship (QSAR) modeling offers a promising route to identify and optimize inhibitors with improved activity and specificity.</p><p><strong>Objective: </strong>We aimed to develop and validate an integrated ML-enhanced QSAR modeling workflow for the rational design and prediction of thiadiazolidinone analogs with improved antileukemia activity by systematically evaluating molecular descriptors and algorithmic approaches to identify key determinants of potency and guide future inhibitor optimization.</p><p><strong>Methods: </strong>We analyzed 35 thiadiazolidinone derivatives with confirmed antileukemia activity, removing outliers for data quality. Using Schrödinger MAESTRO, we calculated 220 molecular descriptors (1D-4D). Seventeen ML models, including random forests, XGBoost, and neural networks, were trained on 70% of the data and tested on 30%, using stratified random sampling. Model performance was assessed with 12 metrics, including mean squared error (MSE), coefficient of determination (explained variance; R<sup>2</sup>), and Shapley additive explanations (SHAP) values, and optimized via hyperparameter tuning and 5-fold cross-validation. Additional analyses, including train-test gap assessment, comparison to baseline linear models, and cross-validation stability analysis, were performed to assess genuine learning rather than overfitting.</p><p><strong>Results: </strong>Isotonic regression ranked first with the lowest test MSE (0.00031 ± 0.00009), outperforming baseline models by over 15% in explained variance. Ensemble methods, especially LightGBM and random forest, also showed superior predictive performance (LightGBM: MSE=0.00063 ± 0.00012; R<sup>2</sup>=0.9709 ± 0.0084). Training-to-test performance degradation of LightGBM was modest (ΔR<sup>2</sup>=-0.01, ΔMSE=+0.000126), suggesting genuine pattern learning rather than memorization. SHAP analysis revealed that the most influential features contributing to antileukemia activity were global molecular shape (r_qp_glob; mean SHAP value=0.52), weighted polar surface area (r_qp_WPSA; ≈0.50), polarizability (r_qp_QPpolrz; ≈0.49), partition coefficient (r_qp_QPlogPC16; ≈0.48), solvent-accessible surface area (r_qp_SASA; ≈0.48), hydrogen bond donor count (r_qp_donorHB; ≈0.48), and the sum of topological distances between oxygen and chlorine atoms (i_desc_Sum_of_topological_distances_between_O.Cl; ≈0.47). These features highlight the importance of steric complementarity and the 3D arrangement of functional groups. Aqueous solubility (r_qp_QPlogS; ≈0.47) and
{"title":"Accelerating Discovery of Leukemia Inhibitors Using AI-Driven Quantitative Structure-Activity Relationship: Algorithm Development and Validation.","authors":"Samuel Kakraba, Edmund Fosu Agyemang, Robert J Shmookler Reis","doi":"10.2196/81552","DOIUrl":"10.2196/81552","url":null,"abstract":"<p><strong>Background: </strong>Leukemia treatment remains a major challenge in oncology. While thiadiazolidinone analogs show potential to inhibit leukemia cell proliferation, they often lack sufficient potency and selectivity. Traditional drug discovery struggles to efficiently explore the vast chemical landscape, highlighting the need for innovative computational strategies. Machine learning (ML)-enhanced quantitative structure-activity relationship (QSAR) modeling offers a promising route to identify and optimize inhibitors with improved activity and specificity.</p><p><strong>Objective: </strong>We aimed to develop and validate an integrated ML-enhanced QSAR modeling workflow for the rational design and prediction of thiadiazolidinone analogs with improved antileukemia activity by systematically evaluating molecular descriptors and algorithmic approaches to identify key determinants of potency and guide future inhibitor optimization.</p><p><strong>Methods: </strong>We analyzed 35 thiadiazolidinone derivatives with confirmed antileukemia activity, removing outliers for data quality. Using Schrödinger MAESTRO, we calculated 220 molecular descriptors (1D-4D). Seventeen ML models, including random forests, XGBoost, and neural networks, were trained on 70% of the data and tested on 30%, using stratified random sampling. Model performance was assessed with 12 metrics, including mean squared error (MSE), coefficient of determination (explained variance; R<sup>2</sup>), and Shapley additive explanations (SHAP) values, and optimized via hyperparameter tuning and 5-fold cross-validation. Additional analyses, including train-test gap assessment, comparison to baseline linear models, and cross-validation stability analysis, were performed to assess genuine learning rather than overfitting.</p><p><strong>Results: </strong>Isotonic regression ranked first with the lowest test MSE (0.00031 ± 0.00009), outperforming baseline models by over 15% in explained variance. Ensemble methods, especially LightGBM and random forest, also showed superior predictive performance (LightGBM: MSE=0.00063 ± 0.00012; R<sup>2</sup>=0.9709 ± 0.0084). Training-to-test performance degradation of LightGBM was modest (ΔR<sup>2</sup>=-0.01, ΔMSE=+0.000126), suggesting genuine pattern learning rather than memorization. SHAP analysis revealed that the most influential features contributing to antileukemia activity were global molecular shape (r_qp_glob; mean SHAP value=0.52), weighted polar surface area (r_qp_WPSA; ≈0.50), polarizability (r_qp_QPpolrz; ≈0.49), partition coefficient (r_qp_QPlogPC16; ≈0.48), solvent-accessible surface area (r_qp_SASA; ≈0.48), hydrogen bond donor count (r_qp_donorHB; ≈0.48), and the sum of topological distances between oxygen and chlorine atoms (i_desc_Sum_of_topological_distances_between_O.Cl; ≈0.47). These features highlight the importance of steric complementarity and the 3D arrangement of functional groups. Aqueous solubility (r_qp_QPlogS; ≈0.47) and","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":"e81552"},"PeriodicalIF":2.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12892034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yvette Van Der Haas, Wiesje Roskamp, Lidwina Elisabeth Maria Chang-Willems, Boudewijn van Dongen, Swetta Jansen, Annemarie de Jong, Renata Medeiros de Carvalho, Dorien Melman, Arjan van de Merwe, Marieke Bastian-Sanders, Bart Overbeek, Rogier Leendert Charles Plas, Marleen Vreeburg, Thomas van Dijk
Background: Overcrowding in the emergency department (ED) is a growing challenge, associated with increased medical errors, longer patient stays, higher morbidity, and increased mortality rates. Artificial intelligence (AI) decision support tools have shown potential in addressing this problem by assisting with faster decision-making regarding patient admissions; yet many studies neglect to focus on the clinical relevance and practical applications of these AI solutions.
Objective: This study aimed to evaluate the clinical relevance of an AI model in predicting patient admission from the ED to hospital wards and its potential impact on reducing the time needed to make an admission decision.
Methods: A retrospective study was conducted using anonymized patient data from St. Antonius Hospital, the Netherlands, from January 2018 to September 2023. An Extreme Gradient Boosting AI model was developed and tested on these data of 154,347 visits to predict admission decisions. The model was evaluated using data segmented into 10-minute intervals, which reflected real-world applicability. The primary outcome measured was the reduction in the decision-making time between the AI model and the admission decision made by the clinician. Secondary outcomes analyzed the performance of the model across various subgroups, including the age of the patient, medical specialty, classification category, and time of day.
Results: The AI model demonstrated a precision of 0.78 and a recall of 0.73, with a median time saving of 111 (IQR 59-169) minutes for true positive predicted patients. Subgroup analysis revealed that older patients and certain specialties such as pulmonology benefited the most from the AI model, with time savings of up to 90 minutes per patient.
Conclusions: The AI model shows significant potential to reduce the time to admission decisions, alleviate ED overcrowding, and improve patient care. The model offers the advantage of always providing weighted advice on admission, even when the ED is under pressure. Future prospective studies are needed to assess the impact in the real world and further enhance the performance of the model in diverse hospital settings.
{"title":"Evaluating an AI Decision Support System for the Emergency Department: Retrospective Study.","authors":"Yvette Van Der Haas, Wiesje Roskamp, Lidwina Elisabeth Maria Chang-Willems, Boudewijn van Dongen, Swetta Jansen, Annemarie de Jong, Renata Medeiros de Carvalho, Dorien Melman, Arjan van de Merwe, Marieke Bastian-Sanders, Bart Overbeek, Rogier Leendert Charles Plas, Marleen Vreeburg, Thomas van Dijk","doi":"10.2196/80448","DOIUrl":"10.2196/80448","url":null,"abstract":"<p><strong>Background: </strong>Overcrowding in the emergency department (ED) is a growing challenge, associated with increased medical errors, longer patient stays, higher morbidity, and increased mortality rates. Artificial intelligence (AI) decision support tools have shown potential in addressing this problem by assisting with faster decision-making regarding patient admissions; yet many studies neglect to focus on the clinical relevance and practical applications of these AI solutions.</p><p><strong>Objective: </strong>This study aimed to evaluate the clinical relevance of an AI model in predicting patient admission from the ED to hospital wards and its potential impact on reducing the time needed to make an admission decision.</p><p><strong>Methods: </strong>A retrospective study was conducted using anonymized patient data from St. Antonius Hospital, the Netherlands, from January 2018 to September 2023. An Extreme Gradient Boosting AI model was developed and tested on these data of 154,347 visits to predict admission decisions. The model was evaluated using data segmented into 10-minute intervals, which reflected real-world applicability. The primary outcome measured was the reduction in the decision-making time between the AI model and the admission decision made by the clinician. Secondary outcomes analyzed the performance of the model across various subgroups, including the age of the patient, medical specialty, classification category, and time of day.</p><p><strong>Results: </strong>The AI model demonstrated a precision of 0.78 and a recall of 0.73, with a median time saving of 111 (IQR 59-169) minutes for true positive predicted patients. Subgroup analysis revealed that older patients and certain specialties such as pulmonology benefited the most from the AI model, with time savings of up to 90 minutes per patient.</p><p><strong>Conclusions: </strong>The AI model shows significant potential to reduce the time to admission decisions, alleviate ED overcrowding, and improve patient care. The model offers the advantage of always providing weighted advice on admission, even when the ED is under pressure. Future prospective studies are needed to assess the impact in the real world and further enhance the performance of the model in diverse hospital settings.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80448"},"PeriodicalIF":2.0,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12887564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amela Miftaroski, Richard Zowalla, Martin Wiesner, Monika Pobiruchin
Background: Patient education materials (PEMs) found online are often written at a complexity level too high for the average reader, which can hinder understanding and informed decision-making. Large language models (LLMs) may offer a solution by simplifying complex medical texts. To date, little is known about how well LLMs can handle simplification tasks for German-language PEMs.
Objective: The study aims to investigate whether LLMs can increase the readability of German online medical texts to a recommended level.
Methods: A sample of 60 German texts originating from online medical resources was compiled. To improve the readability of these texts, four LLMs were selected and used for text simplification: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, and Le Chat. Next, readability scores (Flesch reading ease [FRE] and Wiener Sachtextformel [4th Vienna Formula; WSTF]) of the original texts were computed and compared to the rephrased LLM versions. A Student t test for paired samples was used to test the reduction of readability scores, ideally to or lower than the eighth grade level.
Results: Most of the original texts were rated as difficult to quite difficult (average WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64). On average, the LLMs achieved the following average scores: ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), ChatGPT-4o (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51), and Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58). ChatGPT-3.5, ChatGPT-40, and Microsoft Copilot showed a statistically significant improvement in readability. However, the t tests yielded no statistically significant results for the reduction of scores lower than the eighth grade level.
Conclusions: LLMs can improve the readability of PEMs in German. This moderate improvement can support patients reading PEMs online. LLMs demonstrated their potential to make complex online medical text more accessible to a broader audience by increasing readability. This is the first study to evaluate this for German online medical texts.
{"title":"Leveraging Large Language Models to Improve the Readability of German Online Medical Texts: Evaluation Study.","authors":"Amela Miftaroski, Richard Zowalla, Martin Wiesner, Monika Pobiruchin","doi":"10.2196/77149","DOIUrl":"10.2196/77149","url":null,"abstract":"<p><strong>Background: </strong>Patient education materials (PEMs) found online are often written at a complexity level too high for the average reader, which can hinder understanding and informed decision-making. Large language models (LLMs) may offer a solution by simplifying complex medical texts. To date, little is known about how well LLMs can handle simplification tasks for German-language PEMs.</p><p><strong>Objective: </strong>The study aims to investigate whether LLMs can increase the readability of German online medical texts to a recommended level.</p><p><strong>Methods: </strong>A sample of 60 German texts originating from online medical resources was compiled. To improve the readability of these texts, four LLMs were selected and used for text simplification: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, and Le Chat. Next, readability scores (Flesch reading ease [FRE] and Wiener Sachtextformel [4th Vienna Formula; WSTF]) of the original texts were computed and compared to the rephrased LLM versions. A Student t test for paired samples was used to test the reduction of readability scores, ideally to or lower than the eighth grade level.</p><p><strong>Results: </strong>Most of the original texts were rated as difficult to quite difficult (average WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64). On average, the LLMs achieved the following average scores: ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), ChatGPT-4o (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51), and Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58). ChatGPT-3.5, ChatGPT-40, and Microsoft Copilot showed a statistically significant improvement in readability. However, the t tests yielded no statistically significant results for the reduction of scores lower than the eighth grade level.</p><p><strong>Conclusions: </strong>LLMs can improve the readability of PEMs in German. This moderate improvement can support patients reading PEMs online. LLMs demonstrated their potential to make complex online medical text more accessible to a broader audience by increasing readability. This is the first study to evaluate this for German online medical texts.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e77149"},"PeriodicalIF":2.0,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12829587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompa Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib
Background: The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHR), but this process is labor-intensive and prone to inter-rater variability. Large language models (LLMs) have demonstrated potential in automating text classification.
Objective: We aim to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.
Methods: We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) hospital discharge and (2) approximately 90 days post-discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS 0-2) versus non-independence (mRS 3-6). Four-fold cross-validation was conducted, using accuracy and Cohen's kappa as model performance metrics.
Results: A total of 2,290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen's Kappa of 0.92. Class-specific accuracy was highest for mRS 4 (90%) and lowest for mRS 2 (28%). The binary model-considering only functional independence vs non-independence -attained an accuracy of 92% and Cohen's Kappa of 0.84. Conclusions.
Conclusions: Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis, however, improving discrimination between intermediate scores is required.
{"title":"Assessment of the Modified Rankin Scale in Electronic Health Records with a Fine-tuned Large Language Model.","authors":"Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompa Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib","doi":"10.2196/82607","DOIUrl":"10.2196/82607","url":null,"abstract":"<p><strong>Background: </strong>The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHR), but this process is labor-intensive and prone to inter-rater variability. Large language models (LLMs) have demonstrated potential in automating text classification.</p><p><strong>Objective: </strong>We aim to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.</p><p><strong>Methods: </strong>We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) hospital discharge and (2) approximately 90 days post-discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS 0-2) versus non-independence (mRS 3-6). Four-fold cross-validation was conducted, using accuracy and Cohen's kappa as model performance metrics.</p><p><strong>Results: </strong>A total of 2,290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen's Kappa of 0.92. Class-specific accuracy was highest for mRS 4 (90%) and lowest for mRS 2 (28%). The binary model-considering only functional independence vs non-independence -attained an accuracy of 92% and Cohen's Kappa of 0.84. Conclusions.</p><p><strong>Conclusions: </strong>Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis, however, improving discrimination between intermediate scores is required.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Pulick, Kyle A Carey, Tonela Qyli, Madeline K Oguss, Jamila K Picart, Leena Penumalee, Lily K Nezirova, Sean T Tully, Emily R Gilbert, Nirav S Shah, Urmila Ravichandran, Majid Afshar, Dana P Edelson, Yonatan Mintz, Matthew M Churpek
Background: Clinical deterioration in general ward patients is associated with increased morbidity and mortality. Early and appropriate treatments can improve outcomes for such patients. While machine learning (ML) tools have proven successful in the early identification of clinical deterioration risk, little work has explored their effectiveness in providing data-driven treatment recommendations to clinicians for high-risk patients.
Objective: This study established ML performance benchmarks for predicting the need for 10 common clinical deterioration interventions. This study also compared the performance of various ML models to inform which types of approaches are well-suited to these prediction tasks.
Methods: We relied on a chart-reviewed, multicenter dataset of general ward patients experiencing clinical deterioration (n=2480 encounters), who were identified as high risk using a Food and Drug Administration-cleared early warning score (electronic Cardiac Arrest Risk Triage score). Manual chart review labeled each encounter with gold-standard lifesaving treatment labels. We trained elastic net logistic regression, gradient boosted machines, long short-term memory, and stacking ensemble models to predict the need for 10 common deterioration interventions at the time of the deterioration elevated risk score. Models were trained on encounters from 3 health systems and externally validated on encounters from a fourth health system. Discriminative performance, assessed by the area under the receiver operating characteristic curve (AUROC), was the primary evaluation metric.
Results: Discriminative performance varied widely by model and prediction task, with AUROCs typically ranging from 0.7 to 0.9. Across all models, antiarrhythmics were the easiest treatment to predict (mean AUROC 0.866, SD 0.012) while anticoagulants were the hardest to predict (mean AUROC 0.660, SD 0.065). While no individual modeling approach outperformed the others across all tasks, the gradient boosted machines tended to show the best individual performance. Additionally, the stacking ensemble, which combined predictions from all models, typically matched or outperformed the best-performing individual model for each task. We also demonstrated that a sizable fraction of patients in our evaluation cohort were untreated at the time of the deterioration elevated risk score, highlighting an opportunity to leverage ML tools to decrease treatment latency.
Conclusions: We found variability in the discrimination of ML models across tasks and model approaches for predicting lifesaving treatments in patients with clinical deterioration. Overall performance was high, and these models could be paired with early warning scores to provide clinicians with timely and actionable treatment recommendations to improve patient care.
{"title":"Treatment Recommendations for Clinical Deterioration on the Wards: Development and Validation of Machine Learning Models.","authors":"Eric Pulick, Kyle A Carey, Tonela Qyli, Madeline K Oguss, Jamila K Picart, Leena Penumalee, Lily K Nezirova, Sean T Tully, Emily R Gilbert, Nirav S Shah, Urmila Ravichandran, Majid Afshar, Dana P Edelson, Yonatan Mintz, Matthew M Churpek","doi":"10.2196/81642","DOIUrl":"10.2196/81642","url":null,"abstract":"<p><strong>Background: </strong>Clinical deterioration in general ward patients is associated with increased morbidity and mortality. Early and appropriate treatments can improve outcomes for such patients. While machine learning (ML) tools have proven successful in the early identification of clinical deterioration risk, little work has explored their effectiveness in providing data-driven treatment recommendations to clinicians for high-risk patients.</p><p><strong>Objective: </strong>This study established ML performance benchmarks for predicting the need for 10 common clinical deterioration interventions. This study also compared the performance of various ML models to inform which types of approaches are well-suited to these prediction tasks.</p><p><strong>Methods: </strong>We relied on a chart-reviewed, multicenter dataset of general ward patients experiencing clinical deterioration (n=2480 encounters), who were identified as high risk using a Food and Drug Administration-cleared early warning score (electronic Cardiac Arrest Risk Triage score). Manual chart review labeled each encounter with gold-standard lifesaving treatment labels. We trained elastic net logistic regression, gradient boosted machines, long short-term memory, and stacking ensemble models to predict the need for 10 common deterioration interventions at the time of the deterioration elevated risk score. Models were trained on encounters from 3 health systems and externally validated on encounters from a fourth health system. Discriminative performance, assessed by the area under the receiver operating characteristic curve (AUROC), was the primary evaluation metric.</p><p><strong>Results: </strong>Discriminative performance varied widely by model and prediction task, with AUROCs typically ranging from 0.7 to 0.9. Across all models, antiarrhythmics were the easiest treatment to predict (mean AUROC 0.866, SD 0.012) while anticoagulants were the hardest to predict (mean AUROC 0.660, SD 0.065). While no individual modeling approach outperformed the others across all tasks, the gradient boosted machines tended to show the best individual performance. Additionally, the stacking ensemble, which combined predictions from all models, typically matched or outperformed the best-performing individual model for each task. We also demonstrated that a sizable fraction of patients in our evaluation cohort were untreated at the time of the deterioration elevated risk score, highlighting an opportunity to leverage ML tools to decrease treatment latency.</p><p><strong>Conclusions: </strong>We found variability in the discrimination of ML models across tasks and model approaches for predicting lifesaving treatments in patients with clinical deterioration. Overall performance was high, and these models could be paired with early warning scores to provide clinicians with timely and actionable treatment recommendations to improve patient care.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81642"},"PeriodicalIF":2.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12810948/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}