Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka
Background: Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.
Objective: This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.
Methods: This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.
Results: Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).
Conclusions: To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.
{"title":"Message Humanness as a Predictor of AI's Perception as Human: Secondary Data Analysis of the HeartBot Study.","authors":"Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka","doi":"10.2196/67717","DOIUrl":"https://doi.org/10.2196/67717","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.</p><p><strong>Objective: </strong>This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.</p><p><strong>Methods: </strong>This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.</p><p><strong>Results: </strong>Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).</p><p><strong>Conclusions: </strong>To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e67717"},"PeriodicalIF":2.0,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham
Background: Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.
Objective: To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.
Methods: This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.
Results: Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.
Conclusions: AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.
{"title":"Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study.","authors":"Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham","doi":"10.2196/76928","DOIUrl":"https://doi.org/10.2196/76928","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.</p><p><strong>Objective: </strong>To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.</p><p><strong>Methods: </strong>This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.</p><p><strong>Results: </strong>Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.</p><p><strong>Conclusions: </strong>AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li
<p><strong>Background: </strong>Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.</p><p><strong>Objective: </strong>This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).</p><p><strong>Methods: </strong>We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.</p><p><strong>Results: </strong>The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.</p><p><strong>Conclusions: </strong>The proposed hybrid fra
{"title":"Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation.","authors":"Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li","doi":"10.2196/77988","DOIUrl":"https://doi.org/10.2196/77988","url":null,"abstract":"<p><strong>Background: </strong>Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.</p><p><strong>Objective: </strong>This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).</p><p><strong>Methods: </strong>We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.</p><p><strong>Results: </strong>The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.</p><p><strong>Conclusions: </strong>The proposed hybrid fra","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p><strong>Background: </strong>Large language model (LLM)-based chatbots have rapidly emerged as tools for digital mental health (MH) counseling. However, evidence on their methodological quality, evaluation rigor, and ethical safeguards remains fragmented, limiting interpretation of clinical readiness and deployment safety.</p><p><strong>Objective: </strong>This systematic review aimed to synthesize the methodologies, evaluation practices, and ethical/governance frameworks of LLM-based chatbots developed for MH counseling and to identify gaps affecting validity, reproducibility, and translation.</p><p><strong>Methods: </strong>We searched Google Scholar, PubMed, IEEE Xplore, and ACM Digital Library for studies published between January 2020 and May 2025. Eligible studies reported original development or empirical evaluation of LLM-driven MH counseling chatbots. We excluded studies that did not involve LLM-based conversational agents, were not focused on counseling or supportive MH communication, or lacked evaluable system outputs or outcomes. Screening and data extraction were conducted in Covidence following PRISMA 2020 guidance. Study quality was appraised using a structured traffic-light framework across five methodological domains (design, dataset reporting, evaluation metrics, external validation, and ethics), with an overall judgment derived across domains. We used narrative synthesis with descriptive aggregation to summarize methodological trends, evaluation metrics, and governance considerations.</p><p><strong>Results: </strong>Twenty studies met inclusion criteria. GPT-based models (GPT-2/3/4) were used in 45% (9/20) of studies, while 90% (18/20) used fine-tuned or domain-adaptation using models such as LlaMa, ChatGLM, or Qwen. Reported deployment types were not mutually exclusive; standalone applications were most common (90%, 18/20), and some systems were also implemented as virtual agents (20%, 4/20) or delivered via existing platforms (10%, 2/20). Evaluation approaches were frequently mixed, with qualitative assessment (65%, 13/20), such as thematic analysis or rubric-based scoring, often complemented by quantitative language metrics (90%, 18/20), including BLEU, ROUGE, or perplexity. Quality appraisal indicated consistently low risk for dataset reporting and evaluation metrics, but recurring limitations were observed in external validation and reporting on ethics and safety, including incomplete documentation of safety safeguards and governance practices. No included study reported registered randomized controlled trials or independent clinical validation in real-world care settings.</p><p><strong>Conclusions: </strong>LLM-based MH counseling chatbots show promise for scalable and personalized support, but current evidence is limited by heterogeneous study designs, minimal external validation, and inconsistent reporting of safety and governance practices. Future work should prioritize clinically grounded evaluation frameworks, tra
{"title":"Large Language Model-based Chatbots and Agentic AI for Mental Health Counseling: A Systematic Review of Methodologies, Evaluation Frameworks, and Ethical Safeguards.","authors":"Ha Na Cho, Kai Zheng, Jiayuan Wang, Di Hu","doi":"10.2196/80348","DOIUrl":"https://doi.org/10.2196/80348","url":null,"abstract":"<p><strong>Background: </strong>Large language model (LLM)-based chatbots have rapidly emerged as tools for digital mental health (MH) counseling. However, evidence on their methodological quality, evaluation rigor, and ethical safeguards remains fragmented, limiting interpretation of clinical readiness and deployment safety.</p><p><strong>Objective: </strong>This systematic review aimed to synthesize the methodologies, evaluation practices, and ethical/governance frameworks of LLM-based chatbots developed for MH counseling and to identify gaps affecting validity, reproducibility, and translation.</p><p><strong>Methods: </strong>We searched Google Scholar, PubMed, IEEE Xplore, and ACM Digital Library for studies published between January 2020 and May 2025. Eligible studies reported original development or empirical evaluation of LLM-driven MH counseling chatbots. We excluded studies that did not involve LLM-based conversational agents, were not focused on counseling or supportive MH communication, or lacked evaluable system outputs or outcomes. Screening and data extraction were conducted in Covidence following PRISMA 2020 guidance. Study quality was appraised using a structured traffic-light framework across five methodological domains (design, dataset reporting, evaluation metrics, external validation, and ethics), with an overall judgment derived across domains. We used narrative synthesis with descriptive aggregation to summarize methodological trends, evaluation metrics, and governance considerations.</p><p><strong>Results: </strong>Twenty studies met inclusion criteria. GPT-based models (GPT-2/3/4) were used in 45% (9/20) of studies, while 90% (18/20) used fine-tuned or domain-adaptation using models such as LlaMa, ChatGLM, or Qwen. Reported deployment types were not mutually exclusive; standalone applications were most common (90%, 18/20), and some systems were also implemented as virtual agents (20%, 4/20) or delivered via existing platforms (10%, 2/20). Evaluation approaches were frequently mixed, with qualitative assessment (65%, 13/20), such as thematic analysis or rubric-based scoring, often complemented by quantitative language metrics (90%, 18/20), including BLEU, ROUGE, or perplexity. Quality appraisal indicated consistently low risk for dataset reporting and evaluation metrics, but recurring limitations were observed in external validation and reporting on ethics and safety, including incomplete documentation of safety safeguards and governance practices. No included study reported registered randomized controlled trials or independent clinical validation in real-world care settings.</p><p><strong>Conclusions: </strong>LLM-based MH counseling chatbots show promise for scalable and personalized support, but current evidence is limited by heterogeneous study designs, minimal external validation, and inconsistent reporting of safety and governance practices. Future work should prioritize clinically grounded evaluation frameworks, tra","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samuel Kakraba, Edmund Fosu Agyemang, Robert J Shmookler Reis
<p><strong>Background: </strong>Leukemia treatment remains a major challenge in oncology. While thiadiazolidinone analogs show potential to inhibit leukemia cell proliferation, they often lack sufficient potency and selectivity. Traditional drug discovery struggles to efficiently explore the vast chemical landscape, highlighting the need for innovative computational strategies. Machine learning (ML)-enhanced quantitative structure-activity relationship (QSAR) modeling offers a promising route to identify and optimize inhibitors with improved activity and specificity.</p><p><strong>Objective: </strong>We aimed to develop and validate an integrated ML-enhanced QSAR modeling workflow for the rational design and prediction of thiadiazolidinone analogs with improved antileukemia activity by systematically evaluating molecular descriptors and algorithmic approaches to identify key determinants of potency and guide future inhibitor optimization.</p><p><strong>Methods: </strong>We analyzed 35 thiadiazolidinone derivatives with confirmed antileukemia activity, removing outliers for data quality. Using Schrödinger MAESTRO, we calculated 220 molecular descriptors (1D-4D). Seventeen ML models, including random forests, XGBoost, and neural networks, were trained on 70% of the data and tested on 30%, using stratified random sampling. Model performance was assessed with 12 metrics, including mean squared error (MSE), coefficient of determination (explained variance; R<sup>2</sup>), and Shapley additive explanations (SHAP) values, and optimized via hyperparameter tuning and 5-fold cross-validation. Additional analyses, including train-test gap assessment, comparison to baseline linear models, and cross-validation stability analysis, were performed to assess genuine learning rather than overfitting.</p><p><strong>Results: </strong>Isotonic regression ranked first with the lowest test MSE (0.00031 ± 0.00009), outperforming baseline models by over 15% in explained variance. Ensemble methods, especially LightGBM and random forest, also showed superior predictive performance (LightGBM: MSE=0.00063 ± 0.00012; R<sup>2</sup>=0.9709 ± 0.0084). Training-to-test performance degradation of LightGBM was modest (ΔR<sup>2</sup>=-0.01, ΔMSE=+0.000126), suggesting genuine pattern learning rather than memorization. SHAP analysis revealed that the most influential features contributing to antileukemia activity were global molecular shape (r_qp_glob; mean SHAP value=0.52), weighted polar surface area (r_qp_WPSA; ≈0.50), polarizability (r_qp_QPpolrz; ≈0.49), partition coefficient (r_qp_QPlogPC16; ≈0.48), solvent-accessible surface area (r_qp_SASA; ≈0.48), hydrogen bond donor count (r_qp_donorHB; ≈0.48), and the sum of topological distances between oxygen and chlorine atoms (i_desc_Sum_of_topological_distances_between_O.Cl; ≈0.47). These features highlight the importance of steric complementarity and the 3D arrangement of functional groups. Aqueous solubility (r_qp_QPlogS; ≈0.47) and
{"title":"Accelerating Discovery of Leukemia Inhibitors Using AI-Driven Quantitative Structure-Activity Relationship: Algorithm Development and Validation.","authors":"Samuel Kakraba, Edmund Fosu Agyemang, Robert J Shmookler Reis","doi":"10.2196/81552","DOIUrl":"10.2196/81552","url":null,"abstract":"<p><strong>Background: </strong>Leukemia treatment remains a major challenge in oncology. While thiadiazolidinone analogs show potential to inhibit leukemia cell proliferation, they often lack sufficient potency and selectivity. Traditional drug discovery struggles to efficiently explore the vast chemical landscape, highlighting the need for innovative computational strategies. Machine learning (ML)-enhanced quantitative structure-activity relationship (QSAR) modeling offers a promising route to identify and optimize inhibitors with improved activity and specificity.</p><p><strong>Objective: </strong>We aimed to develop and validate an integrated ML-enhanced QSAR modeling workflow for the rational design and prediction of thiadiazolidinone analogs with improved antileukemia activity by systematically evaluating molecular descriptors and algorithmic approaches to identify key determinants of potency and guide future inhibitor optimization.</p><p><strong>Methods: </strong>We analyzed 35 thiadiazolidinone derivatives with confirmed antileukemia activity, removing outliers for data quality. Using Schrödinger MAESTRO, we calculated 220 molecular descriptors (1D-4D). Seventeen ML models, including random forests, XGBoost, and neural networks, were trained on 70% of the data and tested on 30%, using stratified random sampling. Model performance was assessed with 12 metrics, including mean squared error (MSE), coefficient of determination (explained variance; R<sup>2</sup>), and Shapley additive explanations (SHAP) values, and optimized via hyperparameter tuning and 5-fold cross-validation. Additional analyses, including train-test gap assessment, comparison to baseline linear models, and cross-validation stability analysis, were performed to assess genuine learning rather than overfitting.</p><p><strong>Results: </strong>Isotonic regression ranked first with the lowest test MSE (0.00031 ± 0.00009), outperforming baseline models by over 15% in explained variance. Ensemble methods, especially LightGBM and random forest, also showed superior predictive performance (LightGBM: MSE=0.00063 ± 0.00012; R<sup>2</sup>=0.9709 ± 0.0084). Training-to-test performance degradation of LightGBM was modest (ΔR<sup>2</sup>=-0.01, ΔMSE=+0.000126), suggesting genuine pattern learning rather than memorization. SHAP analysis revealed that the most influential features contributing to antileukemia activity were global molecular shape (r_qp_glob; mean SHAP value=0.52), weighted polar surface area (r_qp_WPSA; ≈0.50), polarizability (r_qp_QPpolrz; ≈0.49), partition coefficient (r_qp_QPlogPC16; ≈0.48), solvent-accessible surface area (r_qp_SASA; ≈0.48), hydrogen bond donor count (r_qp_donorHB; ≈0.48), and the sum of topological distances between oxygen and chlorine atoms (i_desc_Sum_of_topological_distances_between_O.Cl; ≈0.47). These features highlight the importance of steric complementarity and the 3D arrangement of functional groups. Aqueous solubility (r_qp_QPlogS; ≈0.47) and","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":"e81552"},"PeriodicalIF":2.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12892034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yvette Van Der Haas, Wiesje Roskamp, Lidwina Elisabeth Maria Chang-Willems, Boudewijn van Dongen, Swetta Jansen, Annemarie de Jong, Renata Medeiros de Carvalho, Dorien Melman, Arjan van de Merwe, Marieke Bastian-Sanders, Bart Overbeek, Rogier Leendert Charles Plas, Marleen Vreeburg, Thomas van Dijk
Background: Overcrowding in the emergency department (ED) is a growing challenge, associated with increased medical errors, longer patient stays, higher morbidity, and increased mortality rates. Artificial intelligence (AI) decision support tools have shown potential in addressing this problem by assisting with faster decision-making regarding patient admissions; yet many studies neglect to focus on the clinical relevance and practical applications of these AI solutions.
Objective: This study aimed to evaluate the clinical relevance of an AI model in predicting patient admission from the ED to hospital wards and its potential impact on reducing the time needed to make an admission decision.
Methods: A retrospective study was conducted using anonymized patient data from St. Antonius Hospital, the Netherlands, from January 2018 to September 2023. An Extreme Gradient Boosting AI model was developed and tested on these data of 154,347 visits to predict admission decisions. The model was evaluated using data segmented into 10-minute intervals, which reflected real-world applicability. The primary outcome measured was the reduction in the decision-making time between the AI model and the admission decision made by the clinician. Secondary outcomes analyzed the performance of the model across various subgroups, including the age of the patient, medical specialty, classification category, and time of day.
Results: The AI model demonstrated a precision of 0.78 and a recall of 0.73, with a median time saving of 111 (IQR 59-169) minutes for true positive predicted patients. Subgroup analysis revealed that older patients and certain specialties such as pulmonology benefited the most from the AI model, with time savings of up to 90 minutes per patient.
Conclusions: The AI model shows significant potential to reduce the time to admission decisions, alleviate ED overcrowding, and improve patient care. The model offers the advantage of always providing weighted advice on admission, even when the ED is under pressure. Future prospective studies are needed to assess the impact in the real world and further enhance the performance of the model in diverse hospital settings.
{"title":"Evaluating an AI Decision Support System for the Emergency Department: Retrospective Study.","authors":"Yvette Van Der Haas, Wiesje Roskamp, Lidwina Elisabeth Maria Chang-Willems, Boudewijn van Dongen, Swetta Jansen, Annemarie de Jong, Renata Medeiros de Carvalho, Dorien Melman, Arjan van de Merwe, Marieke Bastian-Sanders, Bart Overbeek, Rogier Leendert Charles Plas, Marleen Vreeburg, Thomas van Dijk","doi":"10.2196/80448","DOIUrl":"10.2196/80448","url":null,"abstract":"<p><strong>Background: </strong>Overcrowding in the emergency department (ED) is a growing challenge, associated with increased medical errors, longer patient stays, higher morbidity, and increased mortality rates. Artificial intelligence (AI) decision support tools have shown potential in addressing this problem by assisting with faster decision-making regarding patient admissions; yet many studies neglect to focus on the clinical relevance and practical applications of these AI solutions.</p><p><strong>Objective: </strong>This study aimed to evaluate the clinical relevance of an AI model in predicting patient admission from the ED to hospital wards and its potential impact on reducing the time needed to make an admission decision.</p><p><strong>Methods: </strong>A retrospective study was conducted using anonymized patient data from St. Antonius Hospital, the Netherlands, from January 2018 to September 2023. An Extreme Gradient Boosting AI model was developed and tested on these data of 154,347 visits to predict admission decisions. The model was evaluated using data segmented into 10-minute intervals, which reflected real-world applicability. The primary outcome measured was the reduction in the decision-making time between the AI model and the admission decision made by the clinician. Secondary outcomes analyzed the performance of the model across various subgroups, including the age of the patient, medical specialty, classification category, and time of day.</p><p><strong>Results: </strong>The AI model demonstrated a precision of 0.78 and a recall of 0.73, with a median time saving of 111 (IQR 59-169) minutes for true positive predicted patients. Subgroup analysis revealed that older patients and certain specialties such as pulmonology benefited the most from the AI model, with time savings of up to 90 minutes per patient.</p><p><strong>Conclusions: </strong>The AI model shows significant potential to reduce the time to admission decisions, alleviate ED overcrowding, and improve patient care. The model offers the advantage of always providing weighted advice on admission, even when the ED is under pressure. Future prospective studies are needed to assess the impact in the real world and further enhance the performance of the model in diverse hospital settings.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80448"},"PeriodicalIF":2.0,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12887564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amela Miftaroski, Richard Zowalla, Martin Wiesner, Monika Pobiruchin
Background: Patient education materials (PEMs) found online are often written at a complexity level too high for the average reader, which can hinder understanding and informed decision-making. Large language models (LLMs) may offer a solution by simplifying complex medical texts. To date, little is known about how well LLMs can handle simplification tasks for German-language PEMs.
Objective: The study aims to investigate whether LLMs can increase the readability of German online medical texts to a recommended level.
Methods: A sample of 60 German texts originating from online medical resources was compiled. To improve the readability of these texts, four LLMs were selected and used for text simplification: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, and Le Chat. Next, readability scores (Flesch reading ease [FRE] and Wiener Sachtextformel [4th Vienna Formula; WSTF]) of the original texts were computed and compared to the rephrased LLM versions. A Student t test for paired samples was used to test the reduction of readability scores, ideally to or lower than the eighth grade level.
Results: Most of the original texts were rated as difficult to quite difficult (average WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64). On average, the LLMs achieved the following average scores: ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), ChatGPT-4o (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51), and Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58). ChatGPT-3.5, ChatGPT-40, and Microsoft Copilot showed a statistically significant improvement in readability. However, the t tests yielded no statistically significant results for the reduction of scores lower than the eighth grade level.
Conclusions: LLMs can improve the readability of PEMs in German. This moderate improvement can support patients reading PEMs online. LLMs demonstrated their potential to make complex online medical text more accessible to a broader audience by increasing readability. This is the first study to evaluate this for German online medical texts.
{"title":"Leveraging Large Language Models to Improve the Readability of German Online Medical Texts: Evaluation Study.","authors":"Amela Miftaroski, Richard Zowalla, Martin Wiesner, Monika Pobiruchin","doi":"10.2196/77149","DOIUrl":"10.2196/77149","url":null,"abstract":"<p><strong>Background: </strong>Patient education materials (PEMs) found online are often written at a complexity level too high for the average reader, which can hinder understanding and informed decision-making. Large language models (LLMs) may offer a solution by simplifying complex medical texts. To date, little is known about how well LLMs can handle simplification tasks for German-language PEMs.</p><p><strong>Objective: </strong>The study aims to investigate whether LLMs can increase the readability of German online medical texts to a recommended level.</p><p><strong>Methods: </strong>A sample of 60 German texts originating from online medical resources was compiled. To improve the readability of these texts, four LLMs were selected and used for text simplification: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, and Le Chat. Next, readability scores (Flesch reading ease [FRE] and Wiener Sachtextformel [4th Vienna Formula; WSTF]) of the original texts were computed and compared to the rephrased LLM versions. A Student t test for paired samples was used to test the reduction of readability scores, ideally to or lower than the eighth grade level.</p><p><strong>Results: </strong>Most of the original texts were rated as difficult to quite difficult (average WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64). On average, the LLMs achieved the following average scores: ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), ChatGPT-4o (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51), and Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58). ChatGPT-3.5, ChatGPT-40, and Microsoft Copilot showed a statistically significant improvement in readability. However, the t tests yielded no statistically significant results for the reduction of scores lower than the eighth grade level.</p><p><strong>Conclusions: </strong>LLMs can improve the readability of PEMs in German. This moderate improvement can support patients reading PEMs online. LLMs demonstrated their potential to make complex online medical text more accessible to a broader audience by increasing readability. This is the first study to evaluate this for German online medical texts.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e77149"},"PeriodicalIF":2.0,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12829587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompa Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib
Background: The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHR), but this process is labor-intensive and prone to inter-rater variability. Large language models (LLMs) have demonstrated potential in automating text classification.
Objective: We aim to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.
Methods: We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) hospital discharge and (2) approximately 90 days post-discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS 0-2) versus non-independence (mRS 3-6). Four-fold cross-validation was conducted, using accuracy and Cohen's kappa as model performance metrics.
Results: A total of 2,290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen's Kappa of 0.92. Class-specific accuracy was highest for mRS 4 (90%) and lowest for mRS 2 (28%). The binary model-considering only functional independence vs non-independence -attained an accuracy of 92% and Cohen's Kappa of 0.84. Conclusions.
Conclusions: Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis, however, improving discrimination between intermediate scores is required.
{"title":"Assessment of the Modified Rankin Scale in Electronic Health Records with a Fine-tuned Large Language Model.","authors":"Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompa Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib","doi":"10.2196/82607","DOIUrl":"10.2196/82607","url":null,"abstract":"<p><strong>Background: </strong>The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHR), but this process is labor-intensive and prone to inter-rater variability. Large language models (LLMs) have demonstrated potential in automating text classification.</p><p><strong>Objective: </strong>We aim to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.</p><p><strong>Methods: </strong>We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) hospital discharge and (2) approximately 90 days post-discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS 0-2) versus non-independence (mRS 3-6). Four-fold cross-validation was conducted, using accuracy and Cohen's kappa as model performance metrics.</p><p><strong>Results: </strong>A total of 2,290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen's Kappa of 0.92. Class-specific accuracy was highest for mRS 4 (90%) and lowest for mRS 2 (28%). The binary model-considering only functional independence vs non-independence -attained an accuracy of 92% and Cohen's Kappa of 0.84. Conclusions.</p><p><strong>Conclusions: </strong>Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis, however, improving discrimination between intermediate scores is required.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Pulick, Kyle A Carey, Tonela Qyli, Madeline K Oguss, Jamila K Picart, Leena Penumalee, Lily K Nezirova, Sean T Tully, Emily R Gilbert, Nirav S Shah, Urmila Ravichandran, Majid Afshar, Dana P Edelson, Yonatan Mintz, Matthew M Churpek
Background: Clinical deterioration in general ward patients is associated with increased morbidity and mortality. Early and appropriate treatments can improve outcomes for such patients. While machine learning (ML) tools have proven successful in the early identification of clinical deterioration risk, little work has explored their effectiveness in providing data-driven treatment recommendations to clinicians for high-risk patients.
Objective: This study established ML performance benchmarks for predicting the need for 10 common clinical deterioration interventions. This study also compared the performance of various ML models to inform which types of approaches are well-suited to these prediction tasks.
Methods: We relied on a chart-reviewed, multicenter dataset of general ward patients experiencing clinical deterioration (n=2480 encounters), who were identified as high risk using a Food and Drug Administration-cleared early warning score (electronic Cardiac Arrest Risk Triage score). Manual chart review labeled each encounter with gold-standard lifesaving treatment labels. We trained elastic net logistic regression, gradient boosted machines, long short-term memory, and stacking ensemble models to predict the need for 10 common deterioration interventions at the time of the deterioration elevated risk score. Models were trained on encounters from 3 health systems and externally validated on encounters from a fourth health system. Discriminative performance, assessed by the area under the receiver operating characteristic curve (AUROC), was the primary evaluation metric.
Results: Discriminative performance varied widely by model and prediction task, with AUROCs typically ranging from 0.7 to 0.9. Across all models, antiarrhythmics were the easiest treatment to predict (mean AUROC 0.866, SD 0.012) while anticoagulants were the hardest to predict (mean AUROC 0.660, SD 0.065). While no individual modeling approach outperformed the others across all tasks, the gradient boosted machines tended to show the best individual performance. Additionally, the stacking ensemble, which combined predictions from all models, typically matched or outperformed the best-performing individual model for each task. We also demonstrated that a sizable fraction of patients in our evaluation cohort were untreated at the time of the deterioration elevated risk score, highlighting an opportunity to leverage ML tools to decrease treatment latency.
Conclusions: We found variability in the discrimination of ML models across tasks and model approaches for predicting lifesaving treatments in patients with clinical deterioration. Overall performance was high, and these models could be paired with early warning scores to provide clinicians with timely and actionable treatment recommendations to improve patient care.
{"title":"Treatment Recommendations for Clinical Deterioration on the Wards: Development and Validation of Machine Learning Models.","authors":"Eric Pulick, Kyle A Carey, Tonela Qyli, Madeline K Oguss, Jamila K Picart, Leena Penumalee, Lily K Nezirova, Sean T Tully, Emily R Gilbert, Nirav S Shah, Urmila Ravichandran, Majid Afshar, Dana P Edelson, Yonatan Mintz, Matthew M Churpek","doi":"10.2196/81642","DOIUrl":"10.2196/81642","url":null,"abstract":"<p><strong>Background: </strong>Clinical deterioration in general ward patients is associated with increased morbidity and mortality. Early and appropriate treatments can improve outcomes for such patients. While machine learning (ML) tools have proven successful in the early identification of clinical deterioration risk, little work has explored their effectiveness in providing data-driven treatment recommendations to clinicians for high-risk patients.</p><p><strong>Objective: </strong>This study established ML performance benchmarks for predicting the need for 10 common clinical deterioration interventions. This study also compared the performance of various ML models to inform which types of approaches are well-suited to these prediction tasks.</p><p><strong>Methods: </strong>We relied on a chart-reviewed, multicenter dataset of general ward patients experiencing clinical deterioration (n=2480 encounters), who were identified as high risk using a Food and Drug Administration-cleared early warning score (electronic Cardiac Arrest Risk Triage score). Manual chart review labeled each encounter with gold-standard lifesaving treatment labels. We trained elastic net logistic regression, gradient boosted machines, long short-term memory, and stacking ensemble models to predict the need for 10 common deterioration interventions at the time of the deterioration elevated risk score. Models were trained on encounters from 3 health systems and externally validated on encounters from a fourth health system. Discriminative performance, assessed by the area under the receiver operating characteristic curve (AUROC), was the primary evaluation metric.</p><p><strong>Results: </strong>Discriminative performance varied widely by model and prediction task, with AUROCs typically ranging from 0.7 to 0.9. Across all models, antiarrhythmics were the easiest treatment to predict (mean AUROC 0.866, SD 0.012) while anticoagulants were the hardest to predict (mean AUROC 0.660, SD 0.065). While no individual modeling approach outperformed the others across all tasks, the gradient boosted machines tended to show the best individual performance. Additionally, the stacking ensemble, which combined predictions from all models, typically matched or outperformed the best-performing individual model for each task. We also demonstrated that a sizable fraction of patients in our evaluation cohort were untreated at the time of the deterioration elevated risk score, highlighting an opportunity to leverage ML tools to decrease treatment latency.</p><p><strong>Conclusions: </strong>We found variability in the discrimination of ML models across tasks and model approaches for predicting lifesaving treatments in patients with clinical deterioration. Overall performance was high, and these models could be paired with early warning scores to provide clinicians with timely and actionable treatment recommendations to improve patient care.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81642"},"PeriodicalIF":2.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12810948/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Advances in artificial intelligence (AI) have revolutionized digital wellness by providing innovative solutions for health, social connectivity, and overall well-being. Despite these advancements, the older population often struggles with barriers such as accessibility, digital literacy, and infrastructure limitations, leaving them at risk of digital exclusion. These challenges underscore the critical need for tailored AI-driven interventions to bridge the digital divide and enhance the inclusion of older adults in the digital ecosystem.
Objective: This study presents a comparative bibliometric analysis of research on the role of AI in promoting digital wellness, with a particular emphasis on the older population in comparison to the general population. The analysis addressed five key research topics: (1) the evolution of AI's impact on digital wellness over time for both the older and general population, (2) patterns of collaboration globally, (3) leading institutions' contribution to AI-focused research, (4) prominent journals in the field, and (5) emerging themes and trends in AI-related research.
Methods: Data were collected from the Web of Science between 2016 and 2025, totaling 3429 documents (344 related to older people), analyzed using bibliometric tools.
Results: Results indicate that AI-related digital wellness research for the general population has experienced exponential growth since 2016, with significant contributions from the United States, the United Kingdom, and China. In contrast, research on older people has seen slower growth, with more localized collaboration networks and a steady increase in citations. Key research topics for the general population include digital health, machine learning, and telemedicine, whereas studies on older people focus on dementia, mobile health, and risk management.
Conclusions: The results of our analysis highlight an increasing body of research focused on AI-driven solutions intended to improve the digital wellness among older people and identify future research directions to refer to the specific needs of this population segment.
背景:人工智能(AI)的进步通过为健康、社会连接和整体福祉提供创新的解决方案,彻底改变了数字健康。尽管取得了这些进步,但老年人口往往面临无障碍、数字素养和基础设施限制等障碍,使他们面临被数字排斥的风险。这些挑战突出表明,迫切需要有针对性的人工智能驱动干预措施,以弥合数字鸿沟,促进老年人融入数字生态系统。目的:本研究对人工智能在促进数字健康方面的作用进行了比较文献计量分析,特别强调了与普通人群相比老年人群的作用。该分析涉及五个关键研究主题:(1)随着时间的推移,人工智能对老年人和普通人群数字健康的影响的演变;(2)全球合作模式;(3)领先机构对人工智能研究的贡献;(4)该领域的知名期刊;(5)人工智能相关研究的新兴主题和趋势。方法:收集2016 - 2025年Web of Science的数据,共3429篇文献(其中344篇与老年人相关),使用文献计量学工具进行分析。结果表明,自2016年以来,针对普通人群的人工智能相关数字健康研究呈指数级增长,其中美国、英国和中国的贡献显著。相比之下,对老年人的研究增长较慢,有更多的本地化合作网络,引用量稳步增长。针对普通人群的主要研究课题包括数字健康、机器学习和远程医疗,而针对老年人的研究则侧重于痴呆症、移动健康和风险管理。结论:我们的分析结果表明,越来越多的研究集中在人工智能驱动的解决方案上,旨在改善老年人的数字健康,并确定未来的研究方向,以参考这一人群的具体需求。
{"title":"The Role of AI in Improving Digital Wellness Among Older Adults: Comparative Bibliometric Analysis.","authors":"Naveh Eskinazi, Moti Zwilling, Adilson Marques, Riki Tesler","doi":"10.2196/71248","DOIUrl":"10.2196/71248","url":null,"abstract":"<p><strong>Background: </strong>Advances in artificial intelligence (AI) have revolutionized digital wellness by providing innovative solutions for health, social connectivity, and overall well-being. Despite these advancements, the older population often struggles with barriers such as accessibility, digital literacy, and infrastructure limitations, leaving them at risk of digital exclusion. These challenges underscore the critical need for tailored AI-driven interventions to bridge the digital divide and enhance the inclusion of older adults in the digital ecosystem.</p><p><strong>Objective: </strong>This study presents a comparative bibliometric analysis of research on the role of AI in promoting digital wellness, with a particular emphasis on the older population in comparison to the general population. The analysis addressed five key research topics: (1) the evolution of AI's impact on digital wellness over time for both the older and general population, (2) patterns of collaboration globally, (3) leading institutions' contribution to AI-focused research, (4) prominent journals in the field, and (5) emerging themes and trends in AI-related research.</p><p><strong>Methods: </strong>Data were collected from the Web of Science between 2016 and 2025, totaling 3429 documents (344 related to older people), analyzed using bibliometric tools.</p><p><strong>Results: </strong>Results indicate that AI-related digital wellness research for the general population has experienced exponential growth since 2016, with significant contributions from the United States, the United Kingdom, and China. In contrast, research on older people has seen slower growth, with more localized collaboration networks and a steady increase in citations. Key research topics for the general population include digital health, machine learning, and telemedicine, whereas studies on older people focus on dementia, mobile health, and risk management.</p><p><strong>Conclusions: </strong>The results of our analysis highlight an increasing body of research focused on AI-driven solutions intended to improve the digital wellness among older people and identify future research directions to refer to the specific needs of this population segment.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e71248"},"PeriodicalIF":2.0,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12808873/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}