首页 > 最新文献

JMIR AI最新文献

英文 中文
Exploring Clinician Perspectives on Artificial Intelligence in Primary Care: Qualitative Systematic Review and Meta-Synthesis. 探索临床医生对初级保健中人工智能的看法:定性系统评价和元综合。
IF 2 Pub Date : 2026-02-05 DOI: 10.2196/72210
Robin Bogdanffy, Alisa Mundzic, Peter Nymberg, David Sundemo, Anna Moberg, Carl Wikberg, Ronny Kent Gunnarsson, Jonathan Widén, Pär-Daniel Sundvall, Artin Entezarjou
<p><strong>Background: </strong>Recent advances have highlighted the potential of artificial intelligence (AI) systems to assist clinicians with administrative and clinical tasks, but concerns regarding biases, lack of regulation, and potential technical issues pose significant challenges. The lack of a clear definition of AI, combined with limited focus on qualitative research exploring clinicians' perspectives, has limited the understanding of perspectives on AI in primary health care settings.</p><p><strong>Objective: </strong>This review aims to synthesize current qualitative research on the perspectives of clinicians on AI in primary care settings.</p><p><strong>Methods: </strong>A systematic search was conducted in MEDLINE (PubMed), Scopus, Web of Science, and CINAHL (EBSCOhost) databases for publications from inception to February 5, 2024. The search strategy was designed using the Sample, Phenomenon of Interest, Design, Evaluation, and Research type (SPIDER) framework. Studies were eligible if they were published in English, peer-reviewed, and provided qualitative analyses of clinician perspectives on AI in primary health care. Studies were excluded if they were gray literature, used questionnaires, surveys, or similar methods for data collection, or if the perspectives of clinicians were not distinguishable from those of nonclinicians. A qualitative systematic review and thematic synthesis were performed. The Grading of Recommendations Assessment, Development and Evaluation-Confidence in Evidence from Reviews of Qualitative Research (GRADE-CERQual) approach was used to assess confidence in the findings. The CASP (Critical Appraisal Skills Program) checklist for qualitative research was used for risk-of-bias and quality appraisal.</p><p><strong>Results: </strong>A total of 1492 records were identified, of which 13 studies from 6 countries were included, representing qualitative data from 238 primary care physicians, nurses, physiotherapists, and other health care professionals providing direct patient care. Eight descriptive themes were identified and synthesized into 3 analytical themes using thematic synthesis: (1) the human-machine relationship, describing clinicians' thoughts on AI assistance in administration and clinical work, interactions between clinicians, patients, and AI, and resistance and skepticism toward AI; (2) the technologically enhanced clinic, highlighting the effects of AI on the workplace, fear of errors, and desired features; and (3) the societal impact of AI, reflecting concerns about data privacy, medicolegal liability, and bias. GRADE-CERQual assessment rated confidence as high in 15 findings, moderate in 5 findings, and low in 1 finding.</p><p><strong>Conclusions: </strong>Clinicians view AI as a technology that can both enhance and complicate primary health care. While AI can provide substantial support, its integration into health care requires careful consideration of ethical implications, technical reliabili
背景:最近的进展突出了人工智能(AI)系统在协助临床医生完成行政和临床任务方面的潜力,但对偏见、缺乏监管和潜在技术问题的担忧构成了重大挑战。缺乏对人工智能的明确定义,再加上对探讨临床医生观点的定性研究的关注有限,限制了对初级卫生保健机构中人工智能观点的理解。目的:本综述旨在综合目前临床医生对初级保健机构人工智能的定性研究。方法:系统检索MEDLINE (PubMed)、Scopus、Web of Science和CINAHL (EBSCOhost)数据库中自创刊至2024年2月5日的出版物。使用样本、兴趣现象、设计、评估和研究类型(SPIDER)框架设计搜索策略。如果研究以英文发表,经过同行评审,并提供临床医生对初级卫生保健中人工智能观点的定性分析,则符合条件。如果研究是灰色文献,使用问卷、调查或类似的数据收集方法,或者临床医生的观点与非临床医生的观点无法区分,则排除研究。进行了定性系统评价和专题综合。采用建议分级评估、发展和评价-质性研究综述证据的可信度(GRADE-CERQual)方法评估研究结果的可信度。质性研究的CASP(关键评估技能计划)检查表用于偏倚风险和质量评估。结果:共确定了1492份记录,其中包括来自6个国家的13项研究,代表了238名初级保健医生、护士、物理治疗师和其他提供直接患者护理的卫生保健专业人员的定性数据。通过主题综合,将8个描述性主题确定并合成为3个分析性主题:(1)人机关系,描述临床医生对人工智能协助管理和临床工作的想法,临床医生、患者和人工智能之间的互动,以及对人工智能的抵制和怀疑;(2)技术增强的诊所,突出人工智能对工作场所的影响、对错误的恐惧和期望的功能;(3)人工智能的社会影响,反映了对数据隐私、医疗法律责任和偏见的担忧。GRADE-CERQual评估将15个发现的置信度评为高,5个发现的置信度为中等,1个发现的置信度为低。结论:临床医生认为人工智能是一种既可以加强初级卫生保健,也可以使其复杂化的技术。虽然人工智能可以提供大量支持,但将其融入卫生保健需要仔细考虑伦理影响、技术可靠性和维护人类监督。解释受到定性方法的异质性和研究中检验的人工智能技术的多样性的限制。对人工智能对临床医生的职业和自主性的影响进行更深入的定性研究,可能有助于人工智能系统的未来发展。
{"title":"Exploring Clinician Perspectives on Artificial Intelligence in Primary Care: Qualitative Systematic Review and Meta-Synthesis.","authors":"Robin Bogdanffy, Alisa Mundzic, Peter Nymberg, David Sundemo, Anna Moberg, Carl Wikberg, Ronny Kent Gunnarsson, Jonathan Widén, Pär-Daniel Sundvall, Artin Entezarjou","doi":"10.2196/72210","DOIUrl":"10.2196/72210","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Recent advances have highlighted the potential of artificial intelligence (AI) systems to assist clinicians with administrative and clinical tasks, but concerns regarding biases, lack of regulation, and potential technical issues pose significant challenges. The lack of a clear definition of AI, combined with limited focus on qualitative research exploring clinicians' perspectives, has limited the understanding of perspectives on AI in primary health care settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This review aims to synthesize current qualitative research on the perspectives of clinicians on AI in primary care settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;A systematic search was conducted in MEDLINE (PubMed), Scopus, Web of Science, and CINAHL (EBSCOhost) databases for publications from inception to February 5, 2024. The search strategy was designed using the Sample, Phenomenon of Interest, Design, Evaluation, and Research type (SPIDER) framework. Studies were eligible if they were published in English, peer-reviewed, and provided qualitative analyses of clinician perspectives on AI in primary health care. Studies were excluded if they were gray literature, used questionnaires, surveys, or similar methods for data collection, or if the perspectives of clinicians were not distinguishable from those of nonclinicians. A qualitative systematic review and thematic synthesis were performed. The Grading of Recommendations Assessment, Development and Evaluation-Confidence in Evidence from Reviews of Qualitative Research (GRADE-CERQual) approach was used to assess confidence in the findings. The CASP (Critical Appraisal Skills Program) checklist for qualitative research was used for risk-of-bias and quality appraisal.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;A total of 1492 records were identified, of which 13 studies from 6 countries were included, representing qualitative data from 238 primary care physicians, nurses, physiotherapists, and other health care professionals providing direct patient care. Eight descriptive themes were identified and synthesized into 3 analytical themes using thematic synthesis: (1) the human-machine relationship, describing clinicians' thoughts on AI assistance in administration and clinical work, interactions between clinicians, patients, and AI, and resistance and skepticism toward AI; (2) the technologically enhanced clinic, highlighting the effects of AI on the workplace, fear of errors, and desired features; and (3) the societal impact of AI, reflecting concerns about data privacy, medicolegal liability, and bias. GRADE-CERQual assessment rated confidence as high in 15 findings, moderate in 5 findings, and low in 1 finding.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Clinicians view AI as a technology that can both enhance and complicate primary health care. While AI can provide substantial support, its integration into health care requires careful consideration of ethical implications, technical reliabili","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e72210"},"PeriodicalIF":2.0,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12875425/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-Generative AI Interactions and Their Effects on Beliefs About Health Issues: Content Analysis and Experiment. 人类生成的人工智能互动及其对健康问题信念的影响:内容分析和实验。
IF 2 Pub Date : 2026-02-04 DOI: 10.2196/80270
Linqi Lu, Yanshu Sybil Wang, Jiawei Liu, Douglas M McLeod
{"title":"Human-Generative AI Interactions and Their Effects on Beliefs About Health Issues: Content Analysis and Experiment.","authors":"Linqi Lu, Yanshu Sybil Wang, Jiawei Liu, Douglas M McLeod","doi":"10.2196/80270","DOIUrl":"10.2196/80270","url":null,"abstract":"","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80270"},"PeriodicalIF":2.0,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12917482/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explainable AI Approaches in Federated Learning: Systematic Review. 联邦学习中可解释的人工智能方法:系统回顾。
IF 2 Pub Date : 2026-02-03 DOI: 10.2196/69985
Titus Tunduny, Bernard Shibwabo

Background: Artificial intelligence (AI) has, in the recent past, experienced a rebirth with the growth of generative AI systems such as ChatGPT and Bard. These systems are trained with billions of parameters and have enabled widespread accessibility and understanding of AI among different user groups. Widespread adoption of AI has led to the need for understanding how machine learning (ML) models operate to build trust in them. An understanding of how these models generate their results remains a huge challenge that explainable AI seeks to solve. Federated learning (FL) grew out of the need to have privacy-preserving AI by having ML models that are decentralized but still share model parameters with a global model.

Objective: This study sought to examine the extent of development of the explainable AI field within the FL environment in relation to the main contributions made, the types of FL, the sectors it is applied to, the models used, the methods applied by each study, and the databases from which sources are obtained.

Methods: A systematic search in 8 electronic databases, namely, Web of Science Core Collection, Scopus, PubMed, ACM Digital Library, IEEE Xplore, Mendeley, BASE, and Google Scholar, was undertaken.

Results: A review of 26 studies revealed that research on explainable FL is steadily growing despite being concentrated in Europe and Asia. The key determinants of FL use were data privacy and limited training data. Horizontal FL remains the preferred approach for federated ML, whereas post hoc explainability techniques were preferred.

Conclusions: There is potential for development of novel approaches and improvement of existing approaches in the explainable FL field, especially for critical areas.

Trial registration: OSF Registries 10.17605/OSF.IO/Y85WA; https://osf.io/y85wa.

背景:近年来,随着ChatGPT和Bard等生成式人工智能系统的发展,人工智能(AI)经历了一次重生。这些系统经过数十亿个参数的训练,使不同用户群体能够广泛访问和理解人工智能。人工智能的广泛采用导致需要了解机器学习(ML)模型如何运作,以建立对它们的信任。理解这些模型是如何产生结果的,仍然是可解释人工智能寻求解决的一个巨大挑战。联邦学习(FL)源于对保护隐私的人工智能的需求,通过使用分散的ML模型,但仍然与全局模型共享模型参数。目的:本研究试图检查FL环境中可解释的AI领域的发展程度,包括所做的主要贡献、FL的类型、它所应用的部门、所使用的模型、每项研究应用的方法以及从中获得资源的数据库。方法:系统检索Web of Science Core Collection、Scopus、PubMed、ACM Digital Library、IEEE explore、Mendeley、BASE、谷歌Scholar等8个电子数据库。结果:对26项研究的回顾表明,尽管主要集中在欧洲和亚洲,但对可解释性FL的研究正在稳步增长。使用FL的关键决定因素是数据隐私和有限的训练数据。水平FL仍然是联邦ML的首选方法,而事后可解释性技术是首选方法。结论:在可解释的FL领域,特别是在关键领域,存在开发新方法和改进现有方法的潜力。试验注册:OSF registres10.17605 /OSF. io /Y85WA;https://osf.io/y85wa。
{"title":"Explainable AI Approaches in Federated Learning: Systematic Review.","authors":"Titus Tunduny, Bernard Shibwabo","doi":"10.2196/69985","DOIUrl":"10.2196/69985","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) has, in the recent past, experienced a rebirth with the growth of generative AI systems such as ChatGPT and Bard. These systems are trained with billions of parameters and have enabled widespread accessibility and understanding of AI among different user groups. Widespread adoption of AI has led to the need for understanding how machine learning (ML) models operate to build trust in them. An understanding of how these models generate their results remains a huge challenge that explainable AI seeks to solve. Federated learning (FL) grew out of the need to have privacy-preserving AI by having ML models that are decentralized but still share model parameters with a global model.</p><p><strong>Objective: </strong>This study sought to examine the extent of development of the explainable AI field within the FL environment in relation to the main contributions made, the types of FL, the sectors it is applied to, the models used, the methods applied by each study, and the databases from which sources are obtained.</p><p><strong>Methods: </strong>A systematic search in 8 electronic databases, namely, Web of Science Core Collection, Scopus, PubMed, ACM Digital Library, IEEE Xplore, Mendeley, BASE, and Google Scholar, was undertaken.</p><p><strong>Results: </strong>A review of 26 studies revealed that research on explainable FL is steadily growing despite being concentrated in Europe and Asia. The key determinants of FL use were data privacy and limited training data. Horizontal FL remains the preferred approach for federated ML, whereas post hoc explainability techniques were preferred.</p><p><strong>Conclusions: </strong>There is potential for development of novel approaches and improvement of existing approaches in the explainable FL field, especially for critical areas.</p><p><strong>Trial registration: </strong>OSF Registries 10.17605/OSF.IO/Y85WA; https://osf.io/y85wa.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e69985"},"PeriodicalIF":2.0,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12914235/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Message Humanness as a Predictor of AI's Perception as Human: Secondary Data Analysis of the HeartBot Study. 信息人性化作为人工智能感知人类的预测因素:HeartBot研究的次要数据分析。
IF 2 Pub Date : 2026-02-03 DOI: 10.2196/67717
Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka

Background: Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.

Objective: This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.

Methods: This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.

Results: Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).

Conclusions: To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.

背景:人工智能(AI)聊天机器人已经成为医疗保健领域增强健康知识和促进不同人群健康行为的重要工具。然而,影响人工智能聊天机器人和人机交互感知的因素在很大程度上是未知的。目的:本研究旨在确定与人工智能聊天机器人身份感知相关的交互特征,并根据不同女性样本的社会人口状况和以前的聊天机器人使用情况进行调整。方法:本研究是对HeartBot试验数据的二次分析,该试验在2023年10月至2024年1月期间通过社交媒体招募的25岁及以上女性中进行。HeartBot试验的最初目标是评估与全自动AI HeartBot聊天机器人互动后对心脏病发作的意识和知识的变化。所有参与者都与HeartBot互动一次。在对话开始时,聊天机器人介绍自己为HeartBot。然而,它并没有明确指出参与者将与人工智能系统进行交互。在聊天机器人之后的调查中,测量了感知到的聊天机器人身份(人类与人工智能代理)、与HeartBot的对话长度、消息的人性化、消息的有效性以及对人工智能的态度。通过多变量逻辑回归,研究了预测女性对聊天机器人作为人类身份的感知的因素,调整了年龄、种族或民族、教育程度、以前使用过的人工智能聊天机器人、信息人性化、信息有效性和对人工智能的态度。结果:在92名女性(平均年龄45.9岁,标准差11.9,范围26-70岁)中,三分之二(n= 61,66%)的样本正确识别了聊天机器人的身份,而三分之一(n= 31,34%)的样本将聊天机器人误认为是人类。超过一半(n= 53,58%)的人以前有过人工智能聊天机器人的经验。参与者与HeartBot的平均互动时间为13.0分钟(SD 7.8),输入82.5个单词(SD 61.9)。在多变量分析中,与人工智能相比,只有信息的人性与聊天机器人作为人类的身份感知显著相关(调整后的优势比为2.37,95% CI为1.26-4.48;P=.007)。结论:据我们所知,这是第一个明确询问参与者在医疗保健领域,他们是将互动视为人类还是聊天机器人(HeartBot)的研究。这项研究的发现(信息人性化的作用和重要性)为设计聊天机器人提供了新的见解。然而,目前的证据仍然是初步的。未来的研究需要在更大规模的研究中了解聊天机器人身份、信息人性化和健康结果之间的关系。
{"title":"Message Humanness as a Predictor of AI's Perception as Human: Secondary Data Analysis of the HeartBot Study.","authors":"Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka","doi":"10.2196/67717","DOIUrl":"10.2196/67717","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.</p><p><strong>Objective: </strong>This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.</p><p><strong>Methods: </strong>This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.</p><p><strong>Results: </strong>Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).</p><p><strong>Conclusions: </strong>To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e67717"},"PeriodicalIF":2.0,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12914229/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study. 五种人工智能模型在USMLE上的表现step1问题:一项比较观察研究
IF 2 Pub Date : 2026-01-30 DOI: 10.2196/76928
Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham

Background: Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.

Objective: To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.

Methods: This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.

Results: Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.

Conclusions: AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.

Clinicaltrial:

背景:人工智能(AI)模型在医学教育中的应用越来越广泛。虽然ChatGPT等模型之前在usmle风格的问题上表现出色,但现在有了功能增强的新人工智能工具,需要对不同医学领域和问题格式的准确性和可靠性进行比较评估。目的:评估和比较五种公开可用的人工智能模型:Grok、ChatGPT-4、Copilot、Gemini和DeepSeek在USMLE Step 1 Free 120个问题集上的性能,检查它们在问题类型和医学主题上的准确性和一致性。方法:本横断面观察研究于2025年2月10日至3月5日进行。119个usmle风格的问题(不包括一个基于音频的问题)中的每一个都使用标准化的提示周期呈现给每个AI模型。模型回答每个问题三次,以评估信心和一致性。问题分为基于文本或基于图像,基于案例或基于信息。统计分析采用卡方检验和Fisher精确检验,两两比较采用Bonferroni调整。结果:Grok得分最高(91.6%),其次是Copilot(84.9%)、Gemini(84.0%)、ChatGPT-4(79.8%)和DeepSeek(72.3%)。DeepSeek的较低分数是由于无法处理视觉媒体,导致基于图像的项目的准确率为0%。当仅限于文本问题(n = 96)时,DeepSeek的准确率提高到89.6%,与Copilot相当。Grok在基于图像的问题(91.3%)和基于案例的问题(89.7%)上的准确率最高,Grok和DeepSeek在基于案例的问题上的差异有统计学意义(p = 0.011)。模型在生物统计学和流行病学方面表现最好(96.7%),在肌肉骨骼、皮肤和结缔组织方面表现最差(62.9%)。Grok在回答中保持了100%的一致性,而Copilot表现出了最高的自我纠正(一致性为94.1%),在第三次尝试时将其准确性提高到89.9%。结论:人工智能模型在不同领域表现出不同的优势,Grok在该数据集中表现出最高的准确性和一致性,特别是对于基于图像和推理重的问题。虽然ChatGPT-4仍然被广泛使用,但Grok和Copilot等较新的模型也表现得很有竞争力。随着人工智能工具的快速发展,持续评估是必不可少的。临床试验:
{"title":"Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study.","authors":"Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham","doi":"10.2196/76928","DOIUrl":"https://doi.org/10.2196/76928","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.</p><p><strong>Objective: </strong>To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.</p><p><strong>Methods: </strong>This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.</p><p><strong>Results: </strong>Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.</p><p><strong>Conclusions: </strong>AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation. 在NSCLC TNM分期中通过快速工程和监督微调来增强LLM:框架开发和验证。
IF 2 Pub Date : 2026-01-29 DOI: 10.2196/77988
Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li
<p><strong>Background: </strong>Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.</p><p><strong>Objective: </strong>This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).</p><p><strong>Methods: </strong>We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.</p><p><strong>Results: </strong>The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.</p><p><strong>Conclusions: </strong>The proposed hybrid fra
背景:准确的TNM分期是非小细胞肺癌(NSCLC)治疗计划和预后的基础。然而,其复杂性带来了重大挑战,特别是在不同临床环境的标准化解释方面。传统的基于规则的自然语言处理方法受其依赖于人工制定的规则的限制,并且容易受到临床报告不一致的影响。目的:本研究旨在通过先进的即时工程和监督微调(SFT),战略性地增强大型语言模型GLM-4-Air,开发和验证一个强大、准确、高效的NSCLC TNM分期人工智能框架。方法:我们构建了一个精心整理的数据集,包含492份去识别的真实世界医学影像报告,并根据AJCC(美国癌症联合委员会)第8版指南,由资深医生严格验证TNM分期注释。通过多阶段过程对GLM-4-Air模型进行了系统优化:针对所有阶段任务采用思维链推理和领域知识注入的迭代提示工程,然后针对推理密集型的T和N阶段任务采用低秩自适应(Low-Rank Adaptation, LoRA)的参数高效SFT。最终的混合模型在一个完全固定的内部测试集(黑盒)上进行评估,并使用标准指标、统计测试和分期错误的临床影响分析对gpt - 40进行基准测试。结果:优化后的混合GLM-4-Air模型性能可靠。它在黑盒测试集上获得了更高的分期准确性:T为92%(95%置信区间(CI): 0.850-0.959), N为86% (95% CI: 0.779-0.915), M为92% (95% CI: 0.850-0.959),总体临床分期为90%;相比之下,gpt - 40分别达到87% (95% CI: 0.790-0.922)、70% (95% CI: 0.604-0.781)、78% (95% CI: 0.689-0.850)和80%。宏观平均f1得分分别为0.914 (T)、0.815 (N)和0.831 (M),持续优于gpt - 40(0.836、0.620和0.698),进一步证明了模型的稳健性。对混淆矩阵的分析证实了该模型在识别关键分期特征方面的熟练程度,同时有效地减少了假阴性。至关重要的是,临床影响评估显示严重的I类错误大幅减少,这被定义为可能显著影响后续临床决策的错误分类。我们的模型在两个测试集的M阶段中犯了0个第一类错误,在T和N阶段犯了更少的第一类错误。此外,该框架展示了实际的可部署性,实现了对消费级硬件(例如,4个RTX 4090 gpu)的有效推断,延迟适合临床工作流程并可接受。结论:所提出的混合框架,整合了结构化提示工程,并将SFT应用于推理繁重的任务(T/N),使GLM-4-Air模型成为一种高度准确、临床可靠且经济高效的NSCLC TNM自动分期解决方案。这项工作证明了与现成的通才模型相比,领域优化的小型模型的有效性和潜力,有望在资源感知型医疗保健环境中增强诊断标准化。临床试验:
{"title":"Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation.","authors":"Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li","doi":"10.2196/77988","DOIUrl":"https://doi.org/10.2196/77988","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;The proposed hybrid fra","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Model-based Chatbots and Agentic AI for Mental Health Counseling: A Systematic Review of Methodologies, Evaluation Frameworks, and Ethical Safeguards. 基于大型语言模型的聊天机器人和用于心理健康咨询的人工智能代理:方法、评估框架和道德保障的系统回顾。
IF 2 Pub Date : 2026-01-27 DOI: 10.2196/80348
Ha Na Cho, Kai Zheng, Jiayuan Wang, Di Hu
<p><strong>Background: </strong>Large language model (LLM)-based chatbots have rapidly emerged as tools for digital mental health (MH) counseling. However, evidence on their methodological quality, evaluation rigor, and ethical safeguards remains fragmented, limiting interpretation of clinical readiness and deployment safety.</p><p><strong>Objective: </strong>This systematic review aimed to synthesize the methodologies, evaluation practices, and ethical/governance frameworks of LLM-based chatbots developed for MH counseling and to identify gaps affecting validity, reproducibility, and translation.</p><p><strong>Methods: </strong>We searched Google Scholar, PubMed, IEEE Xplore, and ACM Digital Library for studies published between January 2020 and May 2025. Eligible studies reported original development or empirical evaluation of LLM-driven MH counseling chatbots. We excluded studies that did not involve LLM-based conversational agents, were not focused on counseling or supportive MH communication, or lacked evaluable system outputs or outcomes. Screening and data extraction were conducted in Covidence following PRISMA 2020 guidance. Study quality was appraised using a structured traffic-light framework across five methodological domains (design, dataset reporting, evaluation metrics, external validation, and ethics), with an overall judgment derived across domains. We used narrative synthesis with descriptive aggregation to summarize methodological trends, evaluation metrics, and governance considerations.</p><p><strong>Results: </strong>Twenty studies met inclusion criteria. GPT-based models (GPT-2/3/4) were used in 45% (9/20) of studies, while 90% (18/20) used fine-tuned or domain-adaptation using models such as LlaMa, ChatGLM, or Qwen. Reported deployment types were not mutually exclusive; standalone applications were most common (90%, 18/20), and some systems were also implemented as virtual agents (20%, 4/20) or delivered via existing platforms (10%, 2/20). Evaluation approaches were frequently mixed, with qualitative assessment (65%, 13/20), such as thematic analysis or rubric-based scoring, often complemented by quantitative language metrics (90%, 18/20), including BLEU, ROUGE, or perplexity. Quality appraisal indicated consistently low risk for dataset reporting and evaluation metrics, but recurring limitations were observed in external validation and reporting on ethics and safety, including incomplete documentation of safety safeguards and governance practices. No included study reported registered randomized controlled trials or independent clinical validation in real-world care settings.</p><p><strong>Conclusions: </strong>LLM-based MH counseling chatbots show promise for scalable and personalized support, but current evidence is limited by heterogeneous study designs, minimal external validation, and inconsistent reporting of safety and governance practices. Future work should prioritize clinically grounded evaluation frameworks, tra
背景:基于大语言模型(LLM)的聊天机器人已经迅速成为数字心理健康(MH)咨询的工具。然而,关于其方法质量、评估严谨性和伦理保障的证据仍然不完整,限制了对临床准备和部署安全性的解释。目的:本系统综述旨在综合为MH咨询开发的基于法学硕士的聊天机器人的方法、评估实践和伦理/治理框架,并确定影响有效性、可重复性和翻译的差距。方法:我们检索谷歌Scholar、PubMed、IEEE explore和ACM数字图书馆,检索2020年1月至2025年5月间发表的研究。符合条件的研究报告了llm驱动的MH咨询聊天机器人的原始开发或实证评估。我们排除了不涉及基于法学硕士的会话代理,不关注咨询或支持性MH沟通,或缺乏可评估的系统输出或结果的研究。根据PRISMA 2020指南,在冠状病毒期间进行了筛查和数据提取。研究质量通过五个方法学领域(设计、数据集报告、评估指标、外部验证和伦理)的结构化红绿灯框架进行评估,并得出跨领域的总体判断。我们使用叙述性综合和描述性聚合来总结方法论趋势、评估指标和治理考虑。结果:20项研究符合纳入标准。45%(9/20)的研究使用了基于gpt的模型(GPT-2/3/4),而90%(18/20)的研究使用了微调或领域自适应的模型,如LlaMa、ChatGLM或Qwen。报告的部署类型不是相互排斥的;独立应用程序是最常见的(90%,18/20),一些系统也作为虚拟代理实现(20%,4/20)或通过现有平台交付(10%,2/20)。评估方法经常混合,定性评估(65%,13/20),如专题分析或基于规则的评分,通常辅以定量语言指标(90%,18/20),包括BLEU, ROUGE或perplexity。质量评估表明数据集报告和评估指标的风险始终较低,但在伦理和安全的外部验证和报告中观察到反复出现的限制,包括安全保障和治理实践的文件不完整。没有纳入的研究报告注册的随机对照试验或独立的临床验证在现实世界的护理环境。结论:基于法学硕士的MH咨询聊天机器人有望提供可扩展和个性化的支持,但目前的证据受到异质性研究设计、最小的外部验证以及安全性和治理实践不一致报告的限制。未来的工作应优先考虑基于临床的评估框架、透明的模型报告和及时的配置,以及使用标准化结果进行更强的验证,以支持安全、可靠和监管就绪的部署。临床试验:
{"title":"Large Language Model-based Chatbots and Agentic AI for Mental Health Counseling: A Systematic Review of Methodologies, Evaluation Frameworks, and Ethical Safeguards.","authors":"Ha Na Cho, Kai Zheng, Jiayuan Wang, Di Hu","doi":"10.2196/80348","DOIUrl":"https://doi.org/10.2196/80348","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Large language model (LLM)-based chatbots have rapidly emerged as tools for digital mental health (MH) counseling. However, evidence on their methodological quality, evaluation rigor, and ethical safeguards remains fragmented, limiting interpretation of clinical readiness and deployment safety.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This systematic review aimed to synthesize the methodologies, evaluation practices, and ethical/governance frameworks of LLM-based chatbots developed for MH counseling and to identify gaps affecting validity, reproducibility, and translation.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We searched Google Scholar, PubMed, IEEE Xplore, and ACM Digital Library for studies published between January 2020 and May 2025. Eligible studies reported original development or empirical evaluation of LLM-driven MH counseling chatbots. We excluded studies that did not involve LLM-based conversational agents, were not focused on counseling or supportive MH communication, or lacked evaluable system outputs or outcomes. Screening and data extraction were conducted in Covidence following PRISMA 2020 guidance. Study quality was appraised using a structured traffic-light framework across five methodological domains (design, dataset reporting, evaluation metrics, external validation, and ethics), with an overall judgment derived across domains. We used narrative synthesis with descriptive aggregation to summarize methodological trends, evaluation metrics, and governance considerations.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Twenty studies met inclusion criteria. GPT-based models (GPT-2/3/4) were used in 45% (9/20) of studies, while 90% (18/20) used fine-tuned or domain-adaptation using models such as LlaMa, ChatGLM, or Qwen. Reported deployment types were not mutually exclusive; standalone applications were most common (90%, 18/20), and some systems were also implemented as virtual agents (20%, 4/20) or delivered via existing platforms (10%, 2/20). Evaluation approaches were frequently mixed, with qualitative assessment (65%, 13/20), such as thematic analysis or rubric-based scoring, often complemented by quantitative language metrics (90%, 18/20), including BLEU, ROUGE, or perplexity. Quality appraisal indicated consistently low risk for dataset reporting and evaluation metrics, but recurring limitations were observed in external validation and reporting on ethics and safety, including incomplete documentation of safety safeguards and governance practices. No included study reported registered randomized controlled trials or independent clinical validation in real-world care settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;LLM-based MH counseling chatbots show promise for scalable and personalized support, but current evidence is limited by heterogeneous study designs, minimal external validation, and inconsistent reporting of safety and governance practices. Future work should prioritize clinically grounded evaluation frameworks, tra","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Discovery of Leukemia Inhibitors Using AI-Driven Quantitative Structure-Activity Relationship: Algorithm Development and Validation. 利用人工智能驱动的QSAR模型加速发现白血病抑制剂。
IF 2 Pub Date : 2026-01-27 DOI: 10.2196/81552
Samuel Kakraba, Edmund Fosu Agyemang, Robert J Shmookler Reis
<p><strong>Background: </strong>Leukemia treatment remains a major challenge in oncology. While thiadiazolidinone analogs show potential to inhibit leukemia cell proliferation, they often lack sufficient potency and selectivity. Traditional drug discovery struggles to efficiently explore the vast chemical landscape, highlighting the need for innovative computational strategies. Machine learning (ML)-enhanced quantitative structure-activity relationship (QSAR) modeling offers a promising route to identify and optimize inhibitors with improved activity and specificity.</p><p><strong>Objective: </strong>We aimed to develop and validate an integrated ML-enhanced QSAR modeling workflow for the rational design and prediction of thiadiazolidinone analogs with improved antileukemia activity by systematically evaluating molecular descriptors and algorithmic approaches to identify key determinants of potency and guide future inhibitor optimization.</p><p><strong>Methods: </strong>We analyzed 35 thiadiazolidinone derivatives with confirmed antileukemia activity, removing outliers for data quality. Using Schrödinger MAESTRO, we calculated 220 molecular descriptors (1D-4D). Seventeen ML models, including random forests, XGBoost, and neural networks, were trained on 70% of the data and tested on 30%, using stratified random sampling. Model performance was assessed with 12 metrics, including mean squared error (MSE), coefficient of determination (explained variance; R<sup>2</sup>), and Shapley additive explanations (SHAP) values, and optimized via hyperparameter tuning and 5-fold cross-validation. Additional analyses, including train-test gap assessment, comparison to baseline linear models, and cross-validation stability analysis, were performed to assess genuine learning rather than overfitting.</p><p><strong>Results: </strong>Isotonic regression ranked first with the lowest test MSE (0.00031 ± 0.00009), outperforming baseline models by over 15% in explained variance. Ensemble methods, especially LightGBM and random forest, also showed superior predictive performance (LightGBM: MSE=0.00063 ± 0.00012; R<sup>2</sup>=0.9709 ± 0.0084). Training-to-test performance degradation of LightGBM was modest (ΔR<sup>2</sup>=-0.01, ΔMSE=+0.000126), suggesting genuine pattern learning rather than memorization. SHAP analysis revealed that the most influential features contributing to antileukemia activity were global molecular shape (r_qp_glob; mean SHAP value=0.52), weighted polar surface area (r_qp_WPSA; ≈0.50), polarizability (r_qp_QPpolrz; ≈0.49), partition coefficient (r_qp_QPlogPC16; ≈0.48), solvent-accessible surface area (r_qp_SASA; ≈0.48), hydrogen bond donor count (r_qp_donorHB; ≈0.48), and the sum of topological distances between oxygen and chlorine atoms (i_desc_Sum_of_topological_distances_between_O.Cl; ≈0.47). These features highlight the importance of steric complementarity and the 3D arrangement of functional groups. Aqueous solubility (r_qp_QPlogS; ≈0.47) and
背景:白血病治疗仍然是肿瘤学的主要挑战。虽然噻二唑烷酮(TDZD)类似物显示出抑制白血病细胞增殖的潜力,但它们往往缺乏足够的效力和选择性。传统的药物发现努力有效地探索广阔的化学景观,突出需要创新的计算策略。机器学习(ML)增强的QSAR建模为识别和优化具有更高活性和特异性的抑制剂提供了一条有前途的途径。目的:通过系统评估分子描述符和算法方法,开发并验证一个集成的机器学习增强的QSAR建模工作流程,用于合理设计和预测具有改善抗白血病活性的噻二唑烷酮(TDZD)类似物,以确定效力的关键决定因素并指导未来的抑制剂优化。方法:我们分析了35种证实具有抗白血病活性的TDZD衍生物,去除数据质量的异常值。使用Schrödinger MAESTRO,我们计算了220个分子描述符(1D-4D)。17个ML模型,包括随机森林、XGBoost和神经网络,使用分层抽样对70%的数据进行训练,并对30%的数据进行测试。通过MSE、R²和SHAP值等12个指标评估模型性能,并通过超参数调整和5倍交叉验证进行优化。其他分析包括训练测试差距评估、基线线性模型比较和交叉验证稳定性分析,以评估真实学习而不是过拟合。结果:集合方法的预测效果较好,其中以LightGBM和Random Forest的预测效果最好(LightGBM: MSE = 0.00063±0.00012;R²= 0.971±0.0084)。训练到测试的性能下降是适度的(ΔR²= -0.01,ΔMSE = +0.000126),表明真正的模式学习而不是记忆。等渗回归排名第二,在解释方差上优于基线模型15%以上。SHAP分析显示,对抗白血病活性影响最大的特征是整体分子形状(r_qp_glob,平均SHAP值= 0.52)、加权极性表面积(r_qp_WPSA; ~0.50)、极化率(r_qp_QPpolrz; ~0.49)、分配系数(r_qp_QPlogPC16; ~0.48)、溶剂可及表面积(r_qp_SASA; ~0.48)、氢键供体数(r_qp_donorHB;~0.48),氧和氯原子之间的拓扑距离和(i_desc_Sum_of_topological_distances_between_O.Cl; ~0.47)。这些参数突出了立体互补和官能团三维排列的重要性。水溶性(r_qp_QPlogS; ~0.47)和氢键受体计数(r_qp_accptHB; ~0.44)也在前十位特征之列。这些描述符的重要性在多个算法模型中是一致的,包括随机森林、XGBoost和PLS方法。结论:将高级ML与QSAR建模相结合,可以系统地分析该数据集上TDZD类似物的结构-活性关系。虽然集成方法捕获具有高内部验证指标的复杂模式,但在做出广泛的治疗声明之前,对独立化合物和前瞻性实验测试进行外部验证是必不可少的。这项工作为未来的验证工作提供了方法学基础,并确定了分子特征。
{"title":"Accelerating Discovery of Leukemia Inhibitors Using AI-Driven Quantitative Structure-Activity Relationship: Algorithm Development and Validation.","authors":"Samuel Kakraba, Edmund Fosu Agyemang, Robert J Shmookler Reis","doi":"10.2196/81552","DOIUrl":"10.2196/81552","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Leukemia treatment remains a major challenge in oncology. While thiadiazolidinone analogs show potential to inhibit leukemia cell proliferation, they often lack sufficient potency and selectivity. Traditional drug discovery struggles to efficiently explore the vast chemical landscape, highlighting the need for innovative computational strategies. Machine learning (ML)-enhanced quantitative structure-activity relationship (QSAR) modeling offers a promising route to identify and optimize inhibitors with improved activity and specificity.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;We aimed to develop and validate an integrated ML-enhanced QSAR modeling workflow for the rational design and prediction of thiadiazolidinone analogs with improved antileukemia activity by systematically evaluating molecular descriptors and algorithmic approaches to identify key determinants of potency and guide future inhibitor optimization.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We analyzed 35 thiadiazolidinone derivatives with confirmed antileukemia activity, removing outliers for data quality. Using Schrödinger MAESTRO, we calculated 220 molecular descriptors (1D-4D). Seventeen ML models, including random forests, XGBoost, and neural networks, were trained on 70% of the data and tested on 30%, using stratified random sampling. Model performance was assessed with 12 metrics, including mean squared error (MSE), coefficient of determination (explained variance; R&lt;sup&gt;2&lt;/sup&gt;), and Shapley additive explanations (SHAP) values, and optimized via hyperparameter tuning and 5-fold cross-validation. Additional analyses, including train-test gap assessment, comparison to baseline linear models, and cross-validation stability analysis, were performed to assess genuine learning rather than overfitting.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Isotonic regression ranked first with the lowest test MSE (0.00031 ± 0.00009), outperforming baseline models by over 15% in explained variance. Ensemble methods, especially LightGBM and random forest, also showed superior predictive performance (LightGBM: MSE=0.00063 ± 0.00012; R&lt;sup&gt;2&lt;/sup&gt;=0.9709 ± 0.0084). Training-to-test performance degradation of LightGBM was modest (ΔR&lt;sup&gt;2&lt;/sup&gt;=-0.01, ΔMSE=+0.000126), suggesting genuine pattern learning rather than memorization. SHAP analysis revealed that the most influential features contributing to antileukemia activity were global molecular shape (r_qp_glob; mean SHAP value=0.52), weighted polar surface area (r_qp_WPSA; ≈0.50), polarizability (r_qp_QPpolrz; ≈0.49), partition coefficient (r_qp_QPlogPC16; ≈0.48), solvent-accessible surface area (r_qp_SASA; ≈0.48), hydrogen bond donor count (r_qp_donorHB; ≈0.48), and the sum of topological distances between oxygen and chlorine atoms (i_desc_Sum_of_topological_distances_between_O.Cl; ≈0.47). These features highlight the importance of steric complementarity and the 3D arrangement of functional groups. Aqueous solubility (r_qp_QPlogS; ≈0.47) and","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":"e81552"},"PeriodicalIF":2.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12892034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating an AI Decision Support System for the Emergency Department: Retrospective Study. 评估急诊科人工智能决策支持系统:回顾性研究
IF 2 Pub Date : 2026-01-26 DOI: 10.2196/80448
Yvette Van Der Haas, Wiesje Roskamp, Lidwina Elisabeth Maria Chang-Willems, Boudewijn van Dongen, Swetta Jansen, Annemarie de Jong, Renata Medeiros de Carvalho, Dorien Melman, Arjan van de Merwe, Marieke Bastian-Sanders, Bart Overbeek, Rogier Leendert Charles Plas, Marleen Vreeburg, Thomas van Dijk

Background: Overcrowding in the emergency department (ED) is a growing challenge, associated with increased medical errors, longer patient stays, higher morbidity, and increased mortality rates. Artificial intelligence (AI) decision support tools have shown potential in addressing this problem by assisting with faster decision-making regarding patient admissions; yet many studies neglect to focus on the clinical relevance and practical applications of these AI solutions.

Objective: This study aimed to evaluate the clinical relevance of an AI model in predicting patient admission from the ED to hospital wards and its potential impact on reducing the time needed to make an admission decision.

Methods: A retrospective study was conducted using anonymized patient data from St. Antonius Hospital, the Netherlands, from January 2018 to September 2023. An Extreme Gradient Boosting AI model was developed and tested on these data of 154,347 visits to predict admission decisions. The model was evaluated using data segmented into 10-minute intervals, which reflected real-world applicability. The primary outcome measured was the reduction in the decision-making time between the AI model and the admission decision made by the clinician. Secondary outcomes analyzed the performance of the model across various subgroups, including the age of the patient, medical specialty, classification category, and time of day.

Results: The AI model demonstrated a precision of 0.78 and a recall of 0.73, with a median time saving of 111 (IQR 59-169) minutes for true positive predicted patients. Subgroup analysis revealed that older patients and certain specialties such as pulmonology benefited the most from the AI model, with time savings of up to 90 minutes per patient.

Conclusions: The AI model shows significant potential to reduce the time to admission decisions, alleviate ED overcrowding, and improve patient care. The model offers the advantage of always providing weighted advice on admission, even when the ED is under pressure. Future prospective studies are needed to assess the impact in the real world and further enhance the performance of the model in diverse hospital settings.

背景:急诊科(ED)人满为患是一个日益严峻的挑战,与医疗差错增加、患者住院时间延长、发病率升高和死亡率增加有关。人工智能(AI)决策支持工具已经显示出解决这一问题的潜力,它可以帮助患者更快地做出入院决策;然而,许多研究忽视了这些人工智能解决方案的临床相关性和实际应用。目的:本研究旨在评估人工智能模型在预测患者从急诊科到医院病房的住院情况方面的临床相关性,以及它对减少做出住院决定所需时间的潜在影响。方法:对2018年1月至2023年9月荷兰圣安东尼奥医院的匿名患者数据进行回顾性研究。研究人员开发了一个极端梯度增强人工智能模型,并对这些154,347次访问的数据进行了测试,以预测录取决定。该模型使用以10分钟为间隔的数据进行评估,这反映了现实世界的适用性。测量的主要结果是人工智能模型与临床医生做出的入院决定之间的决策时间的减少。次要结果分析了模型在不同亚组中的表现,包括患者的年龄、医学专业、分类类别和一天中的时间。结果:AI模型的准确率为0.78,召回率为0.73,预测真阳性患者的平均节省时间为111分钟(IQR 59-169)。亚组分析显示,老年患者和某些专科(如肺科)从人工智能模型中受益最大,每位患者最多可节省90分钟的时间。结论:人工智能模型在缩短住院决策时间、缓解急诊科人满为患和改善患者护理方面显示出巨大的潜力。这种模式的优势在于,即使在急诊科面临压力的情况下,也总是能在入院时提供有分量的建议。未来的前瞻性研究需要评估其在现实世界中的影响,并进一步提高该模型在不同医院环境中的表现。
{"title":"Evaluating an AI Decision Support System for the Emergency Department: Retrospective Study.","authors":"Yvette Van Der Haas, Wiesje Roskamp, Lidwina Elisabeth Maria Chang-Willems, Boudewijn van Dongen, Swetta Jansen, Annemarie de Jong, Renata Medeiros de Carvalho, Dorien Melman, Arjan van de Merwe, Marieke Bastian-Sanders, Bart Overbeek, Rogier Leendert Charles Plas, Marleen Vreeburg, Thomas van Dijk","doi":"10.2196/80448","DOIUrl":"10.2196/80448","url":null,"abstract":"<p><strong>Background: </strong>Overcrowding in the emergency department (ED) is a growing challenge, associated with increased medical errors, longer patient stays, higher morbidity, and increased mortality rates. Artificial intelligence (AI) decision support tools have shown potential in addressing this problem by assisting with faster decision-making regarding patient admissions; yet many studies neglect to focus on the clinical relevance and practical applications of these AI solutions.</p><p><strong>Objective: </strong>This study aimed to evaluate the clinical relevance of an AI model in predicting patient admission from the ED to hospital wards and its potential impact on reducing the time needed to make an admission decision.</p><p><strong>Methods: </strong>A retrospective study was conducted using anonymized patient data from St. Antonius Hospital, the Netherlands, from January 2018 to September 2023. An Extreme Gradient Boosting AI model was developed and tested on these data of 154,347 visits to predict admission decisions. The model was evaluated using data segmented into 10-minute intervals, which reflected real-world applicability. The primary outcome measured was the reduction in the decision-making time between the AI model and the admission decision made by the clinician. Secondary outcomes analyzed the performance of the model across various subgroups, including the age of the patient, medical specialty, classification category, and time of day.</p><p><strong>Results: </strong>The AI model demonstrated a precision of 0.78 and a recall of 0.73, with a median time saving of 111 (IQR 59-169) minutes for true positive predicted patients. Subgroup analysis revealed that older patients and certain specialties such as pulmonology benefited the most from the AI model, with time savings of up to 90 minutes per patient.</p><p><strong>Conclusions: </strong>The AI model shows significant potential to reduce the time to admission decisions, alleviate ED overcrowding, and improve patient care. The model offers the advantage of always providing weighted advice on admission, even when the ED is under pressure. Future prospective studies are needed to assess the impact in the real world and further enhance the performance of the model in diverse hospital settings.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80448"},"PeriodicalIF":2.0,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12887564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging Large Language Models to Improve the Readability of German Online Medical Texts: Evaluation Study. 利用大型语言模型来提高德语在线医学文本的可读性:评估研究。
IF 2 Pub Date : 2026-01-23 DOI: 10.2196/77149
Amela Miftaroski, Richard Zowalla, Martin Wiesner, Monika Pobiruchin

Background: Patient education materials (PEMs) found online are often written at a complexity level too high for the average reader, which can hinder understanding and informed decision-making. Large language models (LLMs) may offer a solution by simplifying complex medical texts. To date, little is known about how well LLMs can handle simplification tasks for German-language PEMs.

Objective: The study aims to investigate whether LLMs can increase the readability of German online medical texts to a recommended level.

Methods: A sample of 60 German texts originating from online medical resources was compiled. To improve the readability of these texts, four LLMs were selected and used for text simplification: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, and Le Chat. Next, readability scores (Flesch reading ease [FRE] and Wiener Sachtextformel [4th Vienna Formula; WSTF]) of the original texts were computed and compared to the rephrased LLM versions. A Student t test for paired samples was used to test the reduction of readability scores, ideally to or lower than the eighth grade level.

Results: Most of the original texts were rated as difficult to quite difficult (average WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64). On average, the LLMs achieved the following average scores: ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), ChatGPT-4o (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51), and Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58). ChatGPT-3.5, ChatGPT-40, and Microsoft Copilot showed a statistically significant improvement in readability. However, the t tests yielded no statistically significant results for the reduction of scores lower than the eighth grade level.

Conclusions: LLMs can improve the readability of PEMs in German. This moderate improvement can support patients reading PEMs online. LLMs demonstrated their potential to make complex online medical text more accessible to a broader audience by increasing readability. This is the first study to evaluate this for German online medical texts.

背景:网上发现的患者教育材料(PEMs)通常编写的复杂程度对于普通读者来说太高,这可能会阻碍理解和知情决策。大型语言模型(llm)可以通过简化复杂的医学文本提供解决方案。迄今为止,法学硕士如何处理德语PEMs的简化任务尚不清楚。目的:本研究旨在探讨法学硕士是否可以将德语在线医学文本的可读性提高到推荐水平。方法:对60篇来源于网上医学资源的德文文献进行整理。为了提高这些文本的可读性,我们选择了四个llm进行文本简化:ChatGPT-3.5、chatgpt - 40、Microsoft Copilot和Le Chat。接下来,计算原文的可读性分数(Flesch reading ease [FRE]和Wiener Sachtextformel [4th Vienna Formula; WSTF]),并与改写后的LLM版本进行比较。配对样本的学生t检验用于测试可读性分数的降低,理想情况下是达到或低于八年级水平。结果:大部分原始文本被评为困难到相当困难(平均WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64)。平均而言,LLMs的平均得分如下:ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), chatgpt - 40 (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51)和Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58)。ChatGPT-3.5、ChatGPT-40和Microsoft Copilot在可读性上有统计学上的显著改善。然而,对于低于八年级水平的分数的降低,t检验没有产生统计学上显著的结果。结论:llm可以提高德文PEMs的可读性。这种适度的改善可以支持患者在线阅读PEMs。法学硕士展示了它们的潜力,通过提高可读性,使复杂的在线医学文本更容易为更广泛的受众所接受。这是第一个对德国在线医学文本进行评估的研究。
{"title":"Leveraging Large Language Models to Improve the Readability of German Online Medical Texts: Evaluation Study.","authors":"Amela Miftaroski, Richard Zowalla, Martin Wiesner, Monika Pobiruchin","doi":"10.2196/77149","DOIUrl":"10.2196/77149","url":null,"abstract":"<p><strong>Background: </strong>Patient education materials (PEMs) found online are often written at a complexity level too high for the average reader, which can hinder understanding and informed decision-making. Large language models (LLMs) may offer a solution by simplifying complex medical texts. To date, little is known about how well LLMs can handle simplification tasks for German-language PEMs.</p><p><strong>Objective: </strong>The study aims to investigate whether LLMs can increase the readability of German online medical texts to a recommended level.</p><p><strong>Methods: </strong>A sample of 60 German texts originating from online medical resources was compiled. To improve the readability of these texts, four LLMs were selected and used for text simplification: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, and Le Chat. Next, readability scores (Flesch reading ease [FRE] and Wiener Sachtextformel [4th Vienna Formula; WSTF]) of the original texts were computed and compared to the rephrased LLM versions. A Student t test for paired samples was used to test the reduction of readability scores, ideally to or lower than the eighth grade level.</p><p><strong>Results: </strong>Most of the original texts were rated as difficult to quite difficult (average WSTF 11.24, SD 1.29; FRE 35.92, SD 7.64). On average, the LLMs achieved the following average scores: ChatGPT-3.5 (WSTF 9.96, SD 1.52; FRE 45.04, SD 8.62), ChatGPT-4o (WSTF 10.6, SD 1.37; FRE 39.23, SD 7.45), Microsoft Copilot (WSTF 8.99, SD 1.10; FRE 49.0, SD 6.51), and Le Chat (WSTF 11.71, SD 1.47; FRE 33.72, SD 8.58). ChatGPT-3.5, ChatGPT-40, and Microsoft Copilot showed a statistically significant improvement in readability. However, the t tests yielded no statistically significant results for the reduction of scores lower than the eighth grade level.</p><p><strong>Conclusions: </strong>LLMs can improve the readability of PEMs in German. This moderate improvement can support patients reading PEMs online. LLMs demonstrated their potential to make complex online medical text more accessible to a broader audience by increasing readability. This is the first study to evaluate this for German online medical texts.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e77149"},"PeriodicalIF":2.0,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12829587/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR AI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1