首页 > 最新文献

JMIR AI最新文献

英文 中文
A Pragmatic Framework for Federated Learning Risk and Governance in Academic Medical Centers. 学术医疗中心联邦学习风险和治理的实用框架。
IF 2 Pub Date : 2026-02-27 DOI: 10.2196/80022
Daniel Bottomly, Bridget Barnes, Kuli Mavuwa, Nikki Lee, Holger R Roth, Chester Chen, Shannon K McWeeney

Unlabelled: With the rapid development of artificial intelligence (AI), particularly large language models, there is growing interest in adopting AI approaches within academic medical centers (AMCs). However, the vast amounts of data required for AI and the sensitive nature of medical information pose significant challenges to developing high-performing models at individual institutions. Furthermore, recent changes in government funding priorities may result in the decentralization of biomedical data repositories that risk creating significant barriers to effective data sharing and robust model development. This has generated significant interest in federated learning (FL), which enables collaborative model training without transferring data between institutions, thereby enhancing the protection of proprietary and sensitive information. While FL offers a crucial pathway to enable multi-institutional AI development while maintaining data privacy, it also exposes AMCs to novel governance, security, and operational risks that are not fully addressed by existing procedures. In response, this manuscript provides a perspective grounded in both leading international standards (NIST AI RMF [National Institute of Standards and Technology Artificial Intelligence Risk Management Framework], International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) 42001) and in the real-world governance experience of AMC leadership. We present a risk differentiation framework, an FL risk matrix, and a set of essential governance artifacts-each mapped to key institutional challenges and reviewed for alignment with core standards but offered as pragmatic, illustrative guides rather than prescriptive checklists. Together, these tools represent a novel resource to support AMC security, privacy, and governance leaders with standards-informed, context-sensitive tools for addressing the evolving risks of FL in biomedical research and clinical environments.

未标记:随着人工智能(AI)的快速发展,特别是大型语言模型,人们对在学术医疗中心(amc)采用人工智能方法的兴趣越来越大。然而,人工智能所需的大量数据和医疗信息的敏感性对在个别机构开发高性能模型构成了重大挑战。此外,最近政府供资优先事项的变化可能导致生物医学数据存储库的分散化,这可能对有效的数据共享和稳健的模型开发造成重大障碍。这引起了人们对联邦学习(FL)的极大兴趣,它支持协作模型训练,而无需在机构之间传输数据,从而增强了对专有和敏感信息的保护。虽然人工智能为在保持数据隐私的同时实现多机构人工智能开发提供了重要途径,但它也使资产管理公司面临新的治理、安全和运营风险,而现有程序无法完全解决这些风险。作为回应,本文提供了一个基于领先的国际标准(NIST AI RMF[国家标准与技术研究所人工智能风险管理框架],国际标准化组织(ISO)和国际电工委员会(IEC) 42001)以及AMC领导的现实世界治理经验的视角。我们提出了一个风险区分框架、一个FL风险矩阵和一组基本的治理工件——每一个都映射到关键的制度挑战,并根据核心标准进行审查,但作为实用的、说明性的指南而不是说明性的检查清单提供。总之,这些工具代表了一种新的资源,为AMC安全、隐私和治理领导者提供了标准信息、上下文敏感的工具,以解决生物医学研究和临床环境中FL不断变化的风险。
{"title":"A Pragmatic Framework for Federated Learning Risk and Governance in Academic Medical Centers.","authors":"Daniel Bottomly, Bridget Barnes, Kuli Mavuwa, Nikki Lee, Holger R Roth, Chester Chen, Shannon K McWeeney","doi":"10.2196/80022","DOIUrl":"10.2196/80022","url":null,"abstract":"<p><strong>Unlabelled: </strong>With the rapid development of artificial intelligence (AI), particularly large language models, there is growing interest in adopting AI approaches within academic medical centers (AMCs). However, the vast amounts of data required for AI and the sensitive nature of medical information pose significant challenges to developing high-performing models at individual institutions. Furthermore, recent changes in government funding priorities may result in the decentralization of biomedical data repositories that risk creating significant barriers to effective data sharing and robust model development. This has generated significant interest in federated learning (FL), which enables collaborative model training without transferring data between institutions, thereby enhancing the protection of proprietary and sensitive information. While FL offers a crucial pathway to enable multi-institutional AI development while maintaining data privacy, it also exposes AMCs to novel governance, security, and operational risks that are not fully addressed by existing procedures. In response, this manuscript provides a perspective grounded in both leading international standards (NIST AI RMF [National Institute of Standards and Technology Artificial Intelligence Risk Management Framework], International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) 42001) and in the real-world governance experience of AMC leadership. We present a risk differentiation framework, an FL risk matrix, and a set of essential governance artifacts-each mapped to key institutional challenges and reviewed for alignment with core standards but offered as pragmatic, illustrative guides rather than prescriptive checklists. Together, these tools represent a novel resource to support AMC security, privacy, and governance leaders with standards-informed, context-sensitive tools for addressing the evolving risks of FL in biomedical research and clinical environments.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80022"},"PeriodicalIF":2.0,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12977002/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147437902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Facial Expression-Based Evaluation of the Emotion Estimation Software Kokoro Sensor in Healthy Individuals: Validation and Reliability Pilot Study. 基于面部表情的情绪估计软件Kokoro传感器在健康个体中的评价:验证性和可靠性先导研究。
IF 2 Pub Date : 2026-02-26 DOI: 10.2196/81868
Shota Yoshihara, Satoru Amano, Kayoko Takahashi
<p><strong>Background: </strong>In recent years, artificial intelligence (AI) systems have increasingly been used to assess emotional states in health care. AI offers a safe, quick, user-friendly, and objective emotional evaluation method. However, evidence supporting its implementation in health care remains limited.</p><p><strong>Objective: </strong>This study aimed to explore the concurrent validity and test-retest reliability of emotion recognition AI based on facial expressions.</p><p><strong>Methods: </strong>In this study, we used the Kokoro Sensor, an accurate and widely recognized automated facial expression recognition system. The Japanese version of the Profile of Mood States-Short Form was used to screen the potential influence of mental states on facial expressions. The study participants made positive, negative, and neutral expressions, which were analyzed by the emotion recognition AI. Agreement between the results of the AI and subjective evaluations was assessed by participants and a researcher using a 4-point Likert-type scale. The facial expressions and emotion analysis process were repeated after a 30-minute interval to investigate reliability. Concurrent validity was evaluated using the content validity index (CVI) and κ coefficient, and test-retest reliability was determined using the κ coefficient.</p><p><strong>Results: </strong>The study participants were 40 individuals whose mental states did not deviate from the reference range of the Profile of Mood States manual. Among the participants, the CVI values for positive, neutral, and negative expressions were 95%, 98%, and 85%, respectively. Among the researchers, the corresponding CVI values were 100%, 100%, and 70%, respectively. The overall weighted κ coefficient was 0.55 (CI 0.44-0.67), indicating moderate agreement. The agreement was almost perfect for distinguishing positive from neutral expressions (κ=0.83, 95% CI 0.70-0.95) but not statistically significant for distinguishing negative from neutral expressions (κ=0.15, 95% CI -0.07 to 0.37). Test-retest reliability analysis showed an overall weighted κ coefficient of 0.66, reflecting substantial reliability. Almost perfect agreement was observed for distinguishing positive from neutral expressions (κ=0.85, 95% CI 0.73-0.97), while distinguishing negative from neutral expressions showed limited reliability (κ=0.36, 95% CI 0.16-0.57).</p><p><strong>Conclusions: </strong>Our findings suggest that the Kokoro Sensor may be useful for identifying positive affect, given its acceptable concurrent validity for overall valence estimation and its high agreement for distinguishing positive from neutral expressions. However, concurrent validity for negative expressions did not meet the prespecified benchmark based on the researcher's ratings, and agreement for distinguishing negative from neutral expressions was limited, which may constrain clinical utility for detecting negative affect. Therefore, in clinical settings, the Kokor
背景:近年来,人工智能(AI)系统越来越多地用于评估医疗保健中的情绪状态。人工智能提供了一种安全、快速、人性化、客观的情绪评估方法。然而,支持其在卫生保健领域实施的证据仍然有限。目的:探讨基于面部表情的人工智能情绪识别的并发效度和重测信度。方法:在本研究中,我们使用了Kokoro传感器,这是一种准确且被广泛认可的自动面部表情识别系统。日本版情绪状态简表被用来筛选心理状态对面部表情的潜在影响。研究参与者做出积极、消极和中性的表情,并通过情绪识别人工智能进行分析。人工智能结果与主观评价之间的一致性由参与者和研究人员使用4点李克特量表进行评估。面部表情和情绪分析过程在间隔30分钟后重复进行,以调查可靠性。采用内容效度指数(CVI)和κ系数评估并发效度,以κ系数确定重测信度。结果:研究参与者是40个人,他们的精神状态没有偏离《情绪状态概况》手册的参考范围。在参与者中,阳性、中性和阴性表达的CVI值分别为95%、98%和85%。研究者对应的CVI值分别为100%、100%、70%。总体加权κ系数为0.55 (CI 0.44-0.67),表明一致性中等。在区分阳性和中性表达方面,一致性几乎是完美的(κ=0.83, 95% CI 0.70-0.95),但在区分阴性和中性表达方面,一致性没有统计学意义(κ=0.15, 95% CI -0.07 - 0.37)。重测信度分析显示,总体加权κ系数为0.66,具有较高的信度。在区分阳性和中性表达方面几乎完全一致(κ=0.85, 95% CI 0.73-0.97),而区分阴性和中性表达的可靠性有限(κ=0.36, 95% CI 0.16-0.57)。结论:我们的研究结果表明,Kokoro传感器可能有助于识别积极情绪,因为它在总体效价估计方面具有可接受的并发效度,并且在区分积极和中性表达方面具有很高的一致性。然而,负面情绪的并发效度并没有达到研究者打分的标准,并且在区分负面情绪和中性情绪方面的一致性有限,这可能会限制负面情绪检测的临床应用。因此,在临床环境中,Kokoro传感器应被用作辅助工具,而不是一个独立的方法。
{"title":"Facial Expression-Based Evaluation of the Emotion Estimation Software Kokoro Sensor in Healthy Individuals: Validation and Reliability Pilot Study.","authors":"Shota Yoshihara, Satoru Amano, Kayoko Takahashi","doi":"10.2196/81868","DOIUrl":"10.2196/81868","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;In recent years, artificial intelligence (AI) systems have increasingly been used to assess emotional states in health care. AI offers a safe, quick, user-friendly, and objective emotional evaluation method. However, evidence supporting its implementation in health care remains limited.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to explore the concurrent validity and test-retest reliability of emotion recognition AI based on facial expressions.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;In this study, we used the Kokoro Sensor, an accurate and widely recognized automated facial expression recognition system. The Japanese version of the Profile of Mood States-Short Form was used to screen the potential influence of mental states on facial expressions. The study participants made positive, negative, and neutral expressions, which were analyzed by the emotion recognition AI. Agreement between the results of the AI and subjective evaluations was assessed by participants and a researcher using a 4-point Likert-type scale. The facial expressions and emotion analysis process were repeated after a 30-minute interval to investigate reliability. Concurrent validity was evaluated using the content validity index (CVI) and κ coefficient, and test-retest reliability was determined using the κ coefficient.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The study participants were 40 individuals whose mental states did not deviate from the reference range of the Profile of Mood States manual. Among the participants, the CVI values for positive, neutral, and negative expressions were 95%, 98%, and 85%, respectively. Among the researchers, the corresponding CVI values were 100%, 100%, and 70%, respectively. The overall weighted κ coefficient was 0.55 (CI 0.44-0.67), indicating moderate agreement. The agreement was almost perfect for distinguishing positive from neutral expressions (κ=0.83, 95% CI 0.70-0.95) but not statistically significant for distinguishing negative from neutral expressions (κ=0.15, 95% CI -0.07 to 0.37). Test-retest reliability analysis showed an overall weighted κ coefficient of 0.66, reflecting substantial reliability. Almost perfect agreement was observed for distinguishing positive from neutral expressions (κ=0.85, 95% CI 0.73-0.97), while distinguishing negative from neutral expressions showed limited reliability (κ=0.36, 95% CI 0.16-0.57).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Our findings suggest that the Kokoro Sensor may be useful for identifying positive affect, given its acceptable concurrent validity for overall valence estimation and its high agreement for distinguishing positive from neutral expressions. However, concurrent validity for negative expressions did not meet the prespecified benchmark based on the researcher's ratings, and agreement for distinguishing negative from neutral expressions was limited, which may constrain clinical utility for detecting negative affect. Therefore, in clinical settings, the Kokor","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81868"},"PeriodicalIF":2.0,"publicationDate":"2026-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12945095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147313280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessment of the Modified Rankin Scale in Electronic Health Records With a Fine-Tuned Large Language Model: Development and Internal Validation. 用微调的大语言模型评估电子健康记录中修改的兰金量表:开发和内部验证。
IF 2 Pub Date : 2026-02-25 DOI: 10.2196/82607
Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompal Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib

Background: The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHRs), but this process is labor-intensive and prone to interrater variability. Large language models (LLMs) have demonstrated potential in automating text classification.

Objective: We aimed to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.

Methods: We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) at hospital discharge and (2) approximately 90 days post discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS scores 0-2) versus non-independence (mRS scores 3-6). Four-fold cross-validation was conducted using accuracy and the Cohen κ as model performance metrics.

Results: A total of 2290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen κ of 0.92. Class-specific accuracy was the highest for mRS score 4 (90%) and the lowest for mRS score 2 (28%). The binary model-considering only functional independence versus non-independence-attained an accuracy of 92% and a Cohen κ of 0.84.

Conclusions: Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis; however, improving discrimination between intermediate scores is required.

背景:改良Rankin量表(mRS)是脑卒中研究中的一项重要指标,常被用作临床试验和观察性研究的主要指标。可以通过电子健康记录(EHRs)对mr进行回顾性评估,但这一过程是劳动密集型的,并且容易发生相互差异。大型语言模型(llm)在自动文本分类方面已经显示出潜力。目的:我们的目标是创建一个微调LLM,可以分析电子病历文本和分类mRS评分用于临床和研究应用。方法:我们对2020年8月至2023年6月在一家大型学术医院系统接受中风神经病学专科服务的患者进行了一项回顾性队列研究。每个病人的医疗记录在两个时间点进行审查:(1)出院时和(2)出院后约90天。两个独立的研究人员在每个时间点分配了一个mRS评分。以相应的mRS评分作为标记结果,在EHR传代上训练了两个独立的模型:(1)一个多类模型,用于对所有7个mRS评分进行分类;(2)一个二元模型,用于对功能独立性(mRS评分0-2)和非独立性(mRS评分3-6)进行分类。采用准确率和Cohen κ作为模型性能指标进行四重交叉验证。结果:共纳入2290篇相应mRS评分的EHR传代进行模型训练。考虑到mrs的所有7个分数,多类模型的准确率为77%,加权Cohen κ为0.92。mRS评分4的分类准确率最高(90%),mRS评分2的分类准确率最低(28%)。二元模型-仅考虑功能独立性与非独立性-获得了92%的准确率和0.84的Cohen κ。结论:我们的研究结果表明,法学硕士可以成功地通过EHR文本分析来确定mRS评分;但是,需要提高中间分数之间的区别。
{"title":"Assessment of the Modified Rankin Scale in Electronic Health Records With a Fine-Tuned Large Language Model: Development and Internal Validation.","authors":"Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompal Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib","doi":"10.2196/82607","DOIUrl":"10.2196/82607","url":null,"abstract":"<p><strong>Background: </strong>The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHRs), but this process is labor-intensive and prone to interrater variability. Large language models (LLMs) have demonstrated potential in automating text classification.</p><p><strong>Objective: </strong>We aimed to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.</p><p><strong>Methods: </strong>We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) at hospital discharge and (2) approximately 90 days post discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS scores 0-2) versus non-independence (mRS scores 3-6). Four-fold cross-validation was conducted using accuracy and the Cohen κ as model performance metrics.</p><p><strong>Results: </strong>A total of 2290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen κ of 0.92. Class-specific accuracy was the highest for mRS score 4 (90%) and the lowest for mRS score 2 (28%). The binary model-considering only functional independence versus non-independence-attained an accuracy of 92% and a Cohen κ of 0.84.</p><p><strong>Conclusions: </strong>Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis; however, improving discrimination between intermediate scores is required.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e82607"},"PeriodicalIF":2.0,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12935414/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147292138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explaining the Slow Adoption of AI Innovations in Health Care: Network Analysis Approach. 解释人工智能创新在医疗保健中的缓慢采用:网络分析方法。
IF 2 Pub Date : 2026-02-23 DOI: 10.2196/60458
Petra Apell, Sara Locher, Annie Milde, Henrik Eriksson

Background: Artificial intelligence (AI) is a topic of considerable hype, with many actors sensing its high potential for health care applications. Despite this, the adoption has been slow, with few applications being implemented in clinical practice.

Objective: The aim of our study was to investigate the challenges associated with using AI in health care, as well as provide suggestions for how further adoption of AI within health care organizations can be facilitated.

Methods: A qualitative case study with a mixed methods approach was conducted at one of Sweden's largest hospitals. Regulatory approved AI medical devices were analyzed, and primary qualitative data from 14 expert interviews were collected and cross-referenced with secondary quantitative data. The framework of technological innovation systems was used to analyze the system factors and their dynamics to identify blocking mechanisms and areas for improvement.

Results: The challenges related to knowledge development, diffusion, legitimation, and resource mobilization could trigger a cascade of positive activities, thereby significantly enhancing the overall performance of the innovation system. Creating dedicated testing environments to evaluate safety and efficacy would facilitate the routine clinical use and reinforce the use of AI innovations in health care organizations.

Conclusions: This analysis shows that the adoption of AI health care technology innovations can be accelerated through targeted strategies and supportive mechanisms triggering virtuous cycles that facilitate clinical validation and generate compelling use cases. The interconnection between guidance of search and entrepreneurial experimentation has been confirmed, providing the initial conditions for knowledge development, diffusion, and legitimation in the early stages of emerging technologies.

背景:人工智能(AI)是一个相当炒作的话题,许多参与者都感觉到它在医疗保健应用方面的巨大潜力。尽管如此,该技术的应用进展缓慢,在临床实践中很少得到应用。目的:我们研究的目的是调查与在医疗保健中使用人工智能相关的挑战,并为如何在医疗保健组织中进一步采用人工智能提供建议。方法:在瑞典最大的医院之一进行了定性案例研究,采用混合方法。对监管部门批准的人工智能医疗器械进行分析,收集14位专家访谈的主要定性数据,并与次要定量数据交叉引用。利用技术创新系统的框架,分析系统因素及其动态,找出阻碍机制和需要改进的地方。结果:知识开发、传播、合法化和资源调动等方面的挑战可以引发一系列积极活动,从而显著提高创新系统的整体绩效。创建专门的测试环境来评估安全性和有效性,将促进常规临床使用,并加强卫生保健组织对人工智能创新的使用。结论:该分析表明,通过有针对性的战略和支持机制,可以加速采用人工智能医疗技术创新,从而触发良性循环,促进临床验证并产生令人信服的用例。搜索指导与创业实验之间的相互联系已得到证实,为新兴技术早期阶段的知识发展、传播和合法化提供了初始条件。
{"title":"Explaining the Slow Adoption of AI Innovations in Health Care: Network Analysis Approach.","authors":"Petra Apell, Sara Locher, Annie Milde, Henrik Eriksson","doi":"10.2196/60458","DOIUrl":"10.2196/60458","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is a topic of considerable hype, with many actors sensing its high potential for health care applications. Despite this, the adoption has been slow, with few applications being implemented in clinical practice.</p><p><strong>Objective: </strong>The aim of our study was to investigate the challenges associated with using AI in health care, as well as provide suggestions for how further adoption of AI within health care organizations can be facilitated.</p><p><strong>Methods: </strong>A qualitative case study with a mixed methods approach was conducted at one of Sweden's largest hospitals. Regulatory approved AI medical devices were analyzed, and primary qualitative data from 14 expert interviews were collected and cross-referenced with secondary quantitative data. The framework of technological innovation systems was used to analyze the system factors and their dynamics to identify blocking mechanisms and areas for improvement.</p><p><strong>Results: </strong>The challenges related to knowledge development, diffusion, legitimation, and resource mobilization could trigger a cascade of positive activities, thereby significantly enhancing the overall performance of the innovation system. Creating dedicated testing environments to evaluate safety and efficacy would facilitate the routine clinical use and reinforce the use of AI innovations in health care organizations.</p><p><strong>Conclusions: </strong>This analysis shows that the adoption of AI health care technology innovations can be accelerated through targeted strategies and supportive mechanisms triggering virtuous cycles that facilitate clinical validation and generate compelling use cases. The interconnection between guidance of search and entrepreneurial experimentation has been confirmed, providing the initial conditions for knowledge development, diffusion, and legitimation in the early stages of emerging technologies.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e60458"},"PeriodicalIF":2.0,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12972688/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation. 医疗保健应用中输入可变性下大型语言模型的性能:数据集开发和实验评估。
IF 2 Pub Date : 2026-02-20 DOI: 10.2196/83640
Saubhagya Joshi, Monjil Mehta, Sarjak Maniar, Mengqian Wang, Vivek Kumar Singh

Background: Large language models (LLMs) are increasingly integrated into health care, where they contribute to patient care, administrative efficiency, and clinical decision-making. Despite their growing role, the ability of LLMs to handle imperfect inputs remains underexplored. These imperfections, which are common in clinical documentation and patient-generated data, may affect model reliability.

Objective: This study investigates the impact of input perturbations on LLM performance across three dimensions: (1) overall effectiveness in different health-related applications, (2) comparative effects of different types and levels of perturbations, and (3) differential impact of perturbations on health-related terms versus non-health-related terms.

Methods: We systematically evaluate 3 LLMs on 3 health-related tasks using a novel dataset containing 3 types of human-like variations (redaction, homophones, and typographical errors) at different perturbation levels.

Results: Contrary to expectations, LLMs demonstrate notable robustness to common variations, and in more than half of the cases (151/270, 55.92%), the performance was stable or improved. In some cases (38/270, 14.07%), variations resulted in an increased performance, especially when dealing with lower perturbation levels. Redactions, often stemming from privacy concerns or cognitive lapses, are more detrimental than other variations.

Conclusions: Our findings highlight the need for health care applications powered by LLMs to be designed with input variability in mind. Robustness to noisy or imperfect inputs is essential for maintaining reliability in real-world clinical settings, where data quality can vary widely. By identifying specific vulnerabilities and strengths, this study provides actionable insights for improving model resilience and guiding the development of safer, more effective artificial intelligence tools in health care. The accompanying dataset offers a valuable resource for further research into LLM performance under diverse conditions.

背景:大型语言模型(llm)越来越多地集成到医疗保健中,它们有助于患者护理、管理效率和临床决策。尽管法学硕士的作用越来越大,但它们处理不完美输入的能力仍未得到充分探索。这些缺陷在临床文献和患者生成的数据中很常见,可能会影响模型的可靠性。目的:本研究从三个维度调查输入扰动对LLM绩效的影响:(1)不同健康相关应用的总体有效性,(2)不同类型和水平扰动的比较效应,以及(3)扰动对健康相关条款与非健康相关条款的差异影响。方法:我们使用一个新的数据集系统地评估了3个llm在3个健康相关任务上的表现,该数据集包含3种不同扰动水平下的类人变异(编校、同音异义词和排版错误)。结果:与预期相反,llm对常见变异表现出显著的鲁棒性,超过一半的病例(151/270,55.92%)表现稳定或改善。在某些情况下(38/270,14.07%),变化导致性能提高,特别是在处理较低扰动水平时。删节,通常源于隐私问题或认知失误,比其他变化更有害。结论:我们的研究结果强调,在设计由法学硕士提供支持的医疗保健应用程序时,需要考虑到输入的可变性。对噪声或不完美输入的鲁棒性对于在现实世界的临床环境中保持可靠性至关重要,其中数据质量可能差异很大。通过识别特定的漏洞和优势,本研究为提高模型弹性和指导医疗保健领域更安全、更有效的人工智能工具的开发提供了可操作的见解。随附的数据集为进一步研究不同条件下的LLM性能提供了宝贵的资源。
{"title":"Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation.","authors":"Saubhagya Joshi, Monjil Mehta, Sarjak Maniar, Mengqian Wang, Vivek Kumar Singh","doi":"10.2196/83640","DOIUrl":"10.2196/83640","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) are increasingly integrated into health care, where they contribute to patient care, administrative efficiency, and clinical decision-making. Despite their growing role, the ability of LLMs to handle imperfect inputs remains underexplored. These imperfections, which are common in clinical documentation and patient-generated data, may affect model reliability.</p><p><strong>Objective: </strong>This study investigates the impact of input perturbations on LLM performance across three dimensions: (1) overall effectiveness in different health-related applications, (2) comparative effects of different types and levels of perturbations, and (3) differential impact of perturbations on health-related terms versus non-health-related terms.</p><p><strong>Methods: </strong>We systematically evaluate 3 LLMs on 3 health-related tasks using a novel dataset containing 3 types of human-like variations (redaction, homophones, and typographical errors) at different perturbation levels.</p><p><strong>Results: </strong>Contrary to expectations, LLMs demonstrate notable robustness to common variations, and in more than half of the cases (151/270, 55.92%), the performance was stable or improved. In some cases (38/270, 14.07%), variations resulted in an increased performance, especially when dealing with lower perturbation levels. Redactions, often stemming from privacy concerns or cognitive lapses, are more detrimental than other variations.</p><p><strong>Conclusions: </strong>Our findings highlight the need for health care applications powered by LLMs to be designed with input variability in mind. Robustness to noisy or imperfect inputs is essential for maintaining reliability in real-world clinical settings, where data quality can vary widely. By identifying specific vulnerabilities and strengths, this study provides actionable insights for improving model resilience and guiding the development of safer, more effective artificial intelligence tools in health care. The accompanying dataset offers a valuable resource for further research into LLM performance under diverse conditions.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e83640"},"PeriodicalIF":2.0,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12923095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146260202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI-Generated Images of Substance Use and Recovery: Mixed Methods Case Study. 人工智能生成的物质使用和恢复图像:混合方法案例研究。
IF 2 Pub Date : 2026-02-19 DOI: 10.2196/81977
Kathryn Heley, Jeffrey K Hom, Linnea Laestadius

Background: Images created with generative artificial intelligence (AI) tools are increasingly used for health communication due to their ease of use, speed, accessibility, and low cost. However, AI-generated images may bring practical and ethical risks to health practitioners and the public, including through the perpetuation of stigma against vulnerable and historically marginalized groups.

Objective: To understand the potential value of AI-generated images for health care and public health communication, we sought to analyze images of substance use disorder and recovery generated with ChatGPT. Specifically, we sought to investigate: (1) the default visual outputs produced in response to a range of prompts about substance use disorder and recovery, and (2) the extent to which prompt modification and guideline-informed prompting could mitigate potentially stigmatizing imagery.

Methods: We performed a mixed-methods case study examining depictions of substance use and recovery in images generated by ChatGPT 4.o. We generated images (n=84) using (1) prompts with colloquial and stigmatizing language, (2) prompts that follow best practices for person-first language, (3) image prompts written by ChatGPT, and (4) a custom GPT informed by guidelines for images of SUD. We then used a mixed-methods approach to analyze images for demographics and stigmatizing elements.

Results: Images produced in the default ChatGPT model featured primarily White men (81%, n=34). Further, images tended to be stigmatizing, featuring injection drug use, dark colors, and symbolic elements such as chains. These trends persisted even when person-first language prompts were used. Images produced by the guideline-informed custom GPT were markedly less stigmatizing; however, they featured almost only Black women (74%, n=31).

Conclusions: Our findings confirm prior research about stigma and biases in AI-generated images and extend this literature to substance use. However, our findings also suggest that (1) images can be improved when clear guidelines are provided and (2) even with guidelines, iteration is needed to create an image that fully concords with best practices.

背景:生成式人工智能(AI)工具创建的图像因其易于使用、速度快、可访问性和低成本而越来越多地用于健康沟通。然而,人工智能生成的图像可能给卫生从业人员和公众带来实际和道德风险,包括使弱势群体和历史上被边缘化的群体长期蒙受耻辱。目的:为了了解人工智能生成的图像在医疗保健和公共卫生传播中的潜在价值,我们试图分析ChatGPT生成的物质使用障碍和康复图像。具体来说,我们试图调查:(1)对一系列关于物质使用障碍和恢复的提示所产生的默认视觉输出,以及(2)提示修改和指导信息提示在多大程度上可以减轻潜在的污名化图像。方法:我们进行了一项混合方法的案例研究,检查了ChatGPT 4.0生成的图像中对物质使用和恢复的描述。我们使用以下方法生成图像(n=84):(1)使用口语化和污名化语言的提示符,(2)遵循以人为本语言的最佳实践的提示符,(3)ChatGPT编写的图像提示符,以及(4)根据SUD图像指南编写的自定义GPT。然后,我们使用混合方法来分析图像的人口统计学和污名化元素。结果:在默认ChatGPT模型中生成的图像主要以白人男性为特征(81%,n=34)。此外,图像往往带有污名化,以注射吸毒、深色和锁链等象征性元素为特征。即使使用了以人为本的语言提示,这些趋势仍然存在。根据指南制作的定制GPT图像明显减少了污名化;然而,他们几乎只有黑人女性(74%,n=31)。结论:我们的研究结果证实了先前关于人工智能生成图像中的耻辱和偏见的研究,并将该文献扩展到药物使用。然而,我们的研究结果也表明(1)当提供明确的指导方针时,图像可以得到改进;(2)即使有指导方针,也需要迭代来创建完全符合最佳实践的图像。
{"title":"AI-Generated Images of Substance Use and Recovery: Mixed Methods Case Study.","authors":"Kathryn Heley, Jeffrey K Hom, Linnea Laestadius","doi":"10.2196/81977","DOIUrl":"10.2196/81977","url":null,"abstract":"<p><strong>Background: </strong>Images created with generative artificial intelligence (AI) tools are increasingly used for health communication due to their ease of use, speed, accessibility, and low cost. However, AI-generated images may bring practical and ethical risks to health practitioners and the public, including through the perpetuation of stigma against vulnerable and historically marginalized groups.</p><p><strong>Objective: </strong>To understand the potential value of AI-generated images for health care and public health communication, we sought to analyze images of substance use disorder and recovery generated with ChatGPT. Specifically, we sought to investigate: (1) the default visual outputs produced in response to a range of prompts about substance use disorder and recovery, and (2) the extent to which prompt modification and guideline-informed prompting could mitigate potentially stigmatizing imagery.</p><p><strong>Methods: </strong>We performed a mixed-methods case study examining depictions of substance use and recovery in images generated by ChatGPT 4.o. We generated images (n=84) using (1) prompts with colloquial and stigmatizing language, (2) prompts that follow best practices for person-first language, (3) image prompts written by ChatGPT, and (4) a custom GPT informed by guidelines for images of SUD. We then used a mixed-methods approach to analyze images for demographics and stigmatizing elements.</p><p><strong>Results: </strong>Images produced in the default ChatGPT model featured primarily White men (81%, n=34). Further, images tended to be stigmatizing, featuring injection drug use, dark colors, and symbolic elements such as chains. These trends persisted even when person-first language prompts were used. Images produced by the guideline-informed custom GPT were markedly less stigmatizing; however, they featured almost only Black women (74%, n=31).</p><p><strong>Conclusions: </strong>Our findings confirm prior research about stigma and biases in AI-generated images and extend this literature to substance use. However, our findings also suggest that (1) images can be improved when clear guidelines are provided and (2) even with guidelines, iteration is needed to create an image that fully concords with best practices.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81977"},"PeriodicalIF":2.0,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12919905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146229993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of AI Models for Preventing Surgical Complications: Scoping Review of Clinical Readiness and Barriers to Implementation. 人工智能模型在预防手术并发症中的应用:临床准备和实施障碍的范围审查。
IF 2 Pub Date : 2026-02-17 DOI: 10.2196/75064
Kjersti Mevik, Ashenafi Zebene Woldaregay, Eva Lindell Jonsson, Miguel Tejedor, Claire Temple-Oberle
<p><strong>Background: </strong>The impact of surgical complications is substantial and multifaceted, affecting patients and their families, surgeons, and health care systems. Despite the remarkable progress in artificial intelligence (AI), there remains a notable gap in the prospective implementation of AI models in surgery that use real-time data to support decision-making and enable proactive intervention to reduce the risk of surgical complications.</p><p><strong>Objective: </strong>This scoping review aims to assess and analyze the adoption and use of AI models for preventing surgical complications. Furthermore, this review aims to identify barriers and facilitators for implementation at the bedside.</p><p><strong>Methods: </strong>Following PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, we conducted a literature search using IEEE Xplore, Scopus, Web of Science, MEDLINE, ProQuest, PubMed, ABI, Embase, Epistemonikos, CINAHL, and Cochrane registries. The inclusion criteria included empirical, peer-reviewed studies published in English between January 2013 and January 2025, involving AI models for preventing surgical complications (surgical site infections, and heart and lung complications or stroke) in real-world settings. Exclusions included retrospective algorithm-only validations, nonempirical research (eg, editorials or protocols), and non-English studies. Study characteristics and AI model development details were extracted, along with performance statistics (eg, sensitivity and area under the receiver operating characteristic curve). We then used thematic analysis to synthesize findings related to AI models, prediction outputs, and validation methods. Studies were grouped into three main themes: (1) duration of hypotension, (2) risk for complications, and (3) decision support tool.</p><p><strong>Results: </strong>Of the 275 identified records, 19 were included. The included models frequently demonstrated strong technical accuracy with high sensitivity and area under the receiver operating characteristic curve, particularly among studies evaluating decision support tools. However, only a few models were adopted routinely in clinical practice. Two studies evaluated the clinicians' perceptions regarding the use of AI models, reporting predominantly positive assessments of their usefulness.</p><p><strong>Conclusions: </strong>Overall, AI models hold potential to predict and prevent surgical complications as the validation studies demonstrated high accuracy. However, implementation in routine practice remains limited by usability barriers, workflow misalignment, trust concerns, and financial and ethical constraints. The evidence included in this scoping review was limited by the heterogeneity in study design and the predominance of small-scale feasibility studies, particularly for hypotension prediction. Future research should prioritize prospectively validated models
背景:手术并发症的影响是实质性的和多方面的,影响患者及其家属、外科医生和卫生保健系统。尽管人工智能(AI)取得了显著进展,但在使用实时数据支持决策并进行主动干预以降低手术并发症风险的手术中,AI模型的预期实施仍存在显着差距。目的:本综述旨在评估和分析人工智能模型在预防手术并发症方面的采用和使用。此外,本综述旨在确定在床边实施的障碍和促进因素。方法:根据PRISMA-ScR(首选报告项目为系统评价和元分析扩展范围评价)指南,我们使用IEEE Xplore, Scopus, Web of Science, MEDLINE, ProQuest, PubMed, ABI, Embase, Epistemonikos, CINAHL和Cochrane注册表进行文献检索。纳入标准包括2013年1月至2025年1月期间以英文发表的经验性同行评审研究,涉及在现实环境中预防手术并发症(手术部位感染、心肺并发症或中风)的人工智能模型。排除包括回顾性算法验证、非实证研究(如社论或协议)和非英语研究。提取研究特征和AI模型开发细节,以及性能统计数据(例如灵敏度和接收者工作特征曲线下的面积)。然后,我们使用主题分析来综合与人工智能模型、预测输出和验证方法相关的发现。研究分为三个主题:(1)低血压持续时间,(2)并发症风险,(3)决策支持工具。结果:经鉴定的275份病历中,有19份被纳入。纳入的模型经常表现出很强的技术准确性,具有高灵敏度和接受者工作特征曲线下的面积,特别是在评估决策支持工具的研究中。然而,只有少数模型在临床实践中被常规采用。两项研究评估了临床医生对使用人工智能模型的看法,主要报告了对其有用性的积极评价。结论:总体而言,AI模型具有预测和预防手术并发症的潜力,因为验证研究显示出较高的准确性。然而,在日常实践中的实现仍然受到可用性障碍、工作流程不一致、信任问题以及财务和道德约束的限制。由于研究设计的异质性和小规模可行性研究的优势,特别是低血压预测,本综述纳入的证据受到限制。未来的研究应优先考虑使用其他生理特征的前瞻性验证模型,并解决临床医生对推广和采用的担忧。
{"title":"Application of AI Models for Preventing Surgical Complications: Scoping Review of Clinical Readiness and Barriers to Implementation.","authors":"Kjersti Mevik, Ashenafi Zebene Woldaregay, Eva Lindell Jonsson, Miguel Tejedor, Claire Temple-Oberle","doi":"10.2196/75064","DOIUrl":"10.2196/75064","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;The impact of surgical complications is substantial and multifaceted, affecting patients and their families, surgeons, and health care systems. Despite the remarkable progress in artificial intelligence (AI), there remains a notable gap in the prospective implementation of AI models in surgery that use real-time data to support decision-making and enable proactive intervention to reduce the risk of surgical complications.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This scoping review aims to assess and analyze the adoption and use of AI models for preventing surgical complications. Furthermore, this review aims to identify barriers and facilitators for implementation at the bedside.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;Following PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, we conducted a literature search using IEEE Xplore, Scopus, Web of Science, MEDLINE, ProQuest, PubMed, ABI, Embase, Epistemonikos, CINAHL, and Cochrane registries. The inclusion criteria included empirical, peer-reviewed studies published in English between January 2013 and January 2025, involving AI models for preventing surgical complications (surgical site infections, and heart and lung complications or stroke) in real-world settings. Exclusions included retrospective algorithm-only validations, nonempirical research (eg, editorials or protocols), and non-English studies. Study characteristics and AI model development details were extracted, along with performance statistics (eg, sensitivity and area under the receiver operating characteristic curve). We then used thematic analysis to synthesize findings related to AI models, prediction outputs, and validation methods. Studies were grouped into three main themes: (1) duration of hypotension, (2) risk for complications, and (3) decision support tool.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Of the 275 identified records, 19 were included. The included models frequently demonstrated strong technical accuracy with high sensitivity and area under the receiver operating characteristic curve, particularly among studies evaluating decision support tools. However, only a few models were adopted routinely in clinical practice. Two studies evaluated the clinicians' perceptions regarding the use of AI models, reporting predominantly positive assessments of their usefulness.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Overall, AI models hold potential to predict and prevent surgical complications as the validation studies demonstrated high accuracy. However, implementation in routine practice remains limited by usability barriers, workflow misalignment, trust concerns, and financial and ethical constraints. The evidence included in this scoping review was limited by the heterogeneity in study design and the predominance of small-scale feasibility studies, particularly for hypotension prediction. Future research should prioritize prospectively validated models ","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e75064"},"PeriodicalIF":2.0,"publicationDate":"2026-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12912657/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146215163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Models for Health Care Text Classification: Systematic Review. 医疗保健文本分类的大型语言模型:系统综述。
IF 2 Pub Date : 2026-02-11 DOI: 10.2196/79202
Hajar Sakai, Sarah S Lam

Background: Large language models (LLMs) have fundamentally transformed approaches to natural language processing tasks across diverse domains. In health care, accurate and cost-efficient text classification is crucial-whether for clinical note analysis, diagnosis coding, or other related tasks-and LLMs present promising potential. Text classification has long faced multiple challenges, including the need for manual annotation during training, the handling of imbalanced data, and the development of scalable approaches. In health care, additional challenges arise, particularly the critical need to preserve patient data privacy and the complexity of medical terminology. Numerous studies have leveraged LLMs for automated health care text classification and compared their performance with traditional machine learning-based methods, which typically require embedding, annotation, and training. However, existing systematic reviews of LLMs either do not specialize in text classification or do not focus specifically on the health care domain.

Objective: This research synthesizes and critically evaluates the current evidence in the literature on the use of LLMs for text classification in health care settings.

Methods: Major databases (eg, Google Scholar, Scopus, PubMed, ScienceDirect) and other resources were queried for papers published between 2018 and 2024, following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, resulting in 65 eligible research articles. These studies were categorized by text classification type (eg, binary classification, multilabel classification), application (eg, clinical decision support, public health and opinion analysis), methodology, type of health care text, and the metrics used for evaluation and validation.

Results: The systematic review includes 65 research articles published between 2020 and Q3 2024, showing a significant increase in publications over time, with 28 papers published in Q1-Q3 2024 alone. Fine-tuning was the most common LLM-based approach (35 papers), followed by prompt engineering (17 papers). BERT (Bidirectional Encoder Representations from Transformers) variants were predominantly used for multilabel classification (50%), whereas closed-source LLMs were most commonly applied to binary (44.0%) and multiclass (30.6%) classification tasks. Clinical decision support was the most frequent application (29 papers). Over 80% of studies used English-language datasets, with clinical notes being the most common text type. All studies employed accuracy-related metrics for evaluation, and the findings consistently showed that LLMs outperformed traditional machine learning approaches in health care text classification tasks.

Conclusions: This review identifies existing gaps in the literature and highlights future research directions for further investigation.

背景:大型语言模型(llm)已经从根本上改变了跨不同领域的自然语言处理任务的方法。在医疗保健领域,准确且经济高效的文本分类是至关重要的——无论是临床记录分析、诊断编码还是其他相关任务——法学硕士都有很大的潜力。文本分类长期以来面临着多种挑战,包括在训练期间需要手动注释、处理不平衡数据以及开发可扩展方法。在医疗保健领域,出现了更多的挑战,特别是迫切需要保护患者数据隐私和医学术语的复杂性。许多研究利用llm进行自动医疗文本分类,并将其性能与传统的基于机器学习的方法进行比较,后者通常需要嵌入、注释和训练。然而,现有的法学硕士系统综述要么不专门研究文本分类,要么不专门关注医疗保健领域。目的:本研究综合并批判性地评估了目前文献中使用llm进行卫生保健设置文本分类的证据。方法:按照PRISMA (Preferred Reporting Items for Systematic Reviews and meta - analysis)指南,对谷歌Scholar、Scopus、PubMed、ScienceDirect等主要数据库检索2018 - 2024年间发表的论文,共纳入65篇符合条件的研究论文。这些研究按文本分类类型(如二元分类、多标签分类)、应用(如临床决策支持、公共卫生和意见分析)、方法学、卫生保健文本类型以及用于评估和验证的指标进行分类。结果:系统综述包括2020年至2024年第三季度发表的65篇研究论文,随着时间的推移,论文发表量显著增加,仅2024年第一季度至第三季度就有28篇论文发表。微调是最常见的基于法学硕士的方法(35篇论文),其次是快速工程(17篇论文)。BERT(来自变压器的双向编码器表示)变体主要用于多标签分类(50%),而闭源llm最常用于二进制(44.0%)和多类别(30.6%)分类任务。临床决策支持是最常见的应用(29篇论文)。超过80%的研究使用英语数据集,临床笔记是最常见的文本类型。所有的研究都采用了与准确性相关的指标进行评估,结果一致表明llm在医疗文本分类任务中优于传统的机器学习方法。结论:本综述明确了文献中存在的空白,并指出了未来进一步研究的方向。
{"title":"Large Language Models for Health Care Text Classification: Systematic Review.","authors":"Hajar Sakai, Sarah S Lam","doi":"10.2196/79202","DOIUrl":"10.2196/79202","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have fundamentally transformed approaches to natural language processing tasks across diverse domains. In health care, accurate and cost-efficient text classification is crucial-whether for clinical note analysis, diagnosis coding, or other related tasks-and LLMs present promising potential. Text classification has long faced multiple challenges, including the need for manual annotation during training, the handling of imbalanced data, and the development of scalable approaches. In health care, additional challenges arise, particularly the critical need to preserve patient data privacy and the complexity of medical terminology. Numerous studies have leveraged LLMs for automated health care text classification and compared their performance with traditional machine learning-based methods, which typically require embedding, annotation, and training. However, existing systematic reviews of LLMs either do not specialize in text classification or do not focus specifically on the health care domain.</p><p><strong>Objective: </strong>This research synthesizes and critically evaluates the current evidence in the literature on the use of LLMs for text classification in health care settings.</p><p><strong>Methods: </strong>Major databases (eg, Google Scholar, Scopus, PubMed, ScienceDirect) and other resources were queried for papers published between 2018 and 2024, following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, resulting in 65 eligible research articles. These studies were categorized by text classification type (eg, binary classification, multilabel classification), application (eg, clinical decision support, public health and opinion analysis), methodology, type of health care text, and the metrics used for evaluation and validation.</p><p><strong>Results: </strong>The systematic review includes 65 research articles published between 2020 and Q3 2024, showing a significant increase in publications over time, with 28 papers published in Q1-Q3 2024 alone. Fine-tuning was the most common LLM-based approach (35 papers), followed by prompt engineering (17 papers). BERT (Bidirectional Encoder Representations from Transformers) variants were predominantly used for multilabel classification (50%), whereas closed-source LLMs were most commonly applied to binary (44.0%) and multiclass (30.6%) classification tasks. Clinical decision support was the most frequent application (29 papers). Over 80% of studies used English-language datasets, with clinical notes being the most common text type. All studies employed accuracy-related metrics for evaluation, and the findings consistently showed that LLMs outperformed traditional machine learning approaches in health care text classification tasks.</p><p><strong>Conclusions: </strong>This review identifies existing gaps in the literature and highlights future research directions for further investigation.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e79202"},"PeriodicalIF":2.0,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12936667/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146168242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation of Large Language Models for Peer Review in Transplantation Research: Algorithm Validation Study. 移植研究中同行评审大型语言模型的评价:算法验证研究。
IF 2 Pub Date : 2026-02-11 DOI: 10.2196/84322
Selena Ming Shen, Zifu Wang, Krittika Paul, Meng-Hao Li, Xiao Huang, Naoru Koizumi
<p><strong>Background: </strong>Peer review remains central to ensuring research quality, yet it is constrained by reviewer fatigue and human bias. The rapid rise in scientific publishing has worsened these challenges, prompting interest in whether large language models (LLMs) can support or improve the peer review process.</p><p><strong>Objective: </strong>This study aimed to address critical gaps in the use of LLMs for peer review of papers in the field of organ transplantation by (1) comparing the performance of 5 recent open-source LLMs; (2) evaluating the impact of author affiliations-prestigious, less prestigious, and none-on LLM review outcomes; and (3) examining the influence of prompt engineering strategies, including zero-shot prompting, few-shot prompting, tree of thoughts (ToT) prompting, and retrieval-augmented generation (RAG), on review decisions.</p><p><strong>Methods: </strong>A dataset of 200 transplantation papers published between 2024 and 2025 across 4 journal quartiles was evaluated using 5 state-of-the-art open-source LLMs (Llama 3.3, Mistral 7B, Gemma 2, DeepSeek r1-distill Qwen, and Qwen 2.5). The 4 prompting techniques (zero-shot prompting, few-shot prompting, ToT prompting, and RAG) were tested under multiple temperature settings. Models were instructed to categorize papers into quartiles. To assess fairness, each paper was evaluated 3 times: with no affiliation, a prestigious affiliation, and a less prestigious affiliation. Accuracy, decisions, runtime, and computing resource use were recorded. Chi-square tests and adjusted Pearson residuals were used to examine the presence of affiliation bias.</p><p><strong>Results: </strong>RAG with a temperature of 0.5 achieved the best overall performance (exact match accuracy: 0.35; loose match accuracy: 0.78). Across all models, LLMs frequently assigned manuscripts to quartile 2 and quartile 3 while avoiding extreme quartiles (quartile 1 and quartile 4). None of the models demonstrated affiliation bias, though Gemma 2 (P=.08) and Qwen 2.5 (P=.054) were substantially biased. Each model displayed unique "personalities" in quartile predictions, influencing consistency. Mistral had the highest exact match accuracy (0.35) despite having both the lowest average runtime (1246.378 seconds) and computing resource use (7 billion parameters). While accuracy was insufficient for independent review, LLMs showed value in supporting preliminary triage tasks.</p><p><strong>Conclusions: </strong>Current open-source LLMs are not reliable enough to replace human peer reviewers. The largely absent affiliation bias suggests potential advantages in fairness, but these benefits do not offset the low decision accuracy. Mistral demonstrated the greatest accuracy and computational efficiency, and RAG with a moderate temperature emerged as the most effective prompting strategy. If LLMs are used to assist in peer review, their outputs require nonnegotiable human supervision to ensure correct judgment and a
背景:同行评议仍然是确保研究质量的核心,但它受到审稿人疲劳和人为偏见的限制。科学出版的快速增长加剧了这些挑战,促使人们对大型语言模型(llm)是否能够支持或改进同行评审过程产生兴趣。目的:本研究旨在通过(1)比较5种最新的开源法学硕士的性能,解决在器官移植领域使用法学硕士进行论文同行评议的关键差距;(2)评估作者隶属关系(知名、非知名和非知名)对法学硕士评审结果的影响;(3)研究提示工程策略对评审决策的影响,包括零提示、少提示、思想树(ToT)提示和检索增强生成(RAG)。方法:采用5种最先进的开源llm (Llama 3.3、Mistral 7B、Gemma 2、DeepSeek 51 -distill Qwen和Qwen 2.5)对2024年至2025年间发表的200篇移植论文数据集进行评估。在多种温度设置下,对4种提示技术(零提示、少提示、ToT提示和RAG提示)进行了测试。模型被指示将论文分成四分位数。为了评估公平性,每篇论文被评估3次:无隶属关系、有名望的隶属关系和不那么有名望的隶属关系。记录准确性、决策、运行时间和计算资源使用情况。使用卡方检验和校正Pearson残差来检验隶属关系偏差的存在。结果:温度为0.5时,RAG的综合性能最佳(精确匹配精度为0.35,松散匹配精度为0.78)。在所有模型中,法学硕士经常将手稿分配到四分位数2和四分位数3,同时避免极端四分位数(四分位数1和四分位数4)。虽然Gemma 2 (P= 0.08)和qwen2.5 (P= 0.054)存在显著偏倚,但没有模型显示出隶属关系偏倚。每个模型在四分位数预测中都显示出独特的“个性”,从而影响一致性。Mistral具有最高的精确匹配精度(0.35),尽管具有最低的平均运行时间(1246.378秒)和计算资源使用(70亿个参数)。虽然准确性不足以进行独立审查,但llm在支持初步分类任务方面显示出价值。结论:目前的开源法学硕士不够可靠,不足以取代人类同行审稿人。在很大程度上不存在隶属偏见表明了公平的潜在优势,但这些好处并不能抵消低决策准确性。Mistral显示出最高的准确性和计算效率,而中等温度的RAG是最有效的提示策略。如果法学硕士被用来协助同行评审,他们的产出需要不可协商的人力监督,以确保正确的判断和适当的编辑决定。
{"title":"Evaluation of Large Language Models for Peer Review in Transplantation Research: Algorithm Validation Study.","authors":"Selena Ming Shen, Zifu Wang, Krittika Paul, Meng-Hao Li, Xiao Huang, Naoru Koizumi","doi":"10.2196/84322","DOIUrl":"10.2196/84322","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Peer review remains central to ensuring research quality, yet it is constrained by reviewer fatigue and human bias. The rapid rise in scientific publishing has worsened these challenges, prompting interest in whether large language models (LLMs) can support or improve the peer review process.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to address critical gaps in the use of LLMs for peer review of papers in the field of organ transplantation by (1) comparing the performance of 5 recent open-source LLMs; (2) evaluating the impact of author affiliations-prestigious, less prestigious, and none-on LLM review outcomes; and (3) examining the influence of prompt engineering strategies, including zero-shot prompting, few-shot prompting, tree of thoughts (ToT) prompting, and retrieval-augmented generation (RAG), on review decisions.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;A dataset of 200 transplantation papers published between 2024 and 2025 across 4 journal quartiles was evaluated using 5 state-of-the-art open-source LLMs (Llama 3.3, Mistral 7B, Gemma 2, DeepSeek r1-distill Qwen, and Qwen 2.5). The 4 prompting techniques (zero-shot prompting, few-shot prompting, ToT prompting, and RAG) were tested under multiple temperature settings. Models were instructed to categorize papers into quartiles. To assess fairness, each paper was evaluated 3 times: with no affiliation, a prestigious affiliation, and a less prestigious affiliation. Accuracy, decisions, runtime, and computing resource use were recorded. Chi-square tests and adjusted Pearson residuals were used to examine the presence of affiliation bias.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;RAG with a temperature of 0.5 achieved the best overall performance (exact match accuracy: 0.35; loose match accuracy: 0.78). Across all models, LLMs frequently assigned manuscripts to quartile 2 and quartile 3 while avoiding extreme quartiles (quartile 1 and quartile 4). None of the models demonstrated affiliation bias, though Gemma 2 (P=.08) and Qwen 2.5 (P=.054) were substantially biased. Each model displayed unique \"personalities\" in quartile predictions, influencing consistency. Mistral had the highest exact match accuracy (0.35) despite having both the lowest average runtime (1246.378 seconds) and computing resource use (7 billion parameters). While accuracy was insufficient for independent review, LLMs showed value in supporting preliminary triage tasks.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Current open-source LLMs are not reliable enough to replace human peer reviewers. The largely absent affiliation bias suggests potential advantages in fairness, but these benefits do not offset the low decision accuracy. Mistral demonstrated the greatest accuracy and computational efficiency, and RAG with a moderate temperature emerged as the most effective prompting strategy. If LLMs are used to assist in peer review, their outputs require nonnegotiable human supervision to ensure correct judgment and a","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e84322"},"PeriodicalIF":2.0,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12936655/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146168177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Large Language Model-Generated Clinical Summaries Through a Dual-Perspective Framework: Retrospective Observational Study. 通过双视角框架评估大型语言模型生成的临床总结:回顾性观察研究。
IF 2 Pub Date : 2026-02-10 DOI: 10.2196/85221
Brian Han, Traci Barnes, Charitha D Reddy, Andrew Y Shin

Large language models (LLMs) are increasingly used by patients and families to interpret complex medical documentation, yet most evaluations focus only on clinician-judged accuracy. In this study, 50 pediatric cardiac intensive care unit notes were summarized using GPT-4o mini and reviewed by both physicians and parents, who rated readability, clinical fidelity, and helpfulness. There were important discrepancies between parents and clinicians in the realm of helpfulness, along with important insights by clinicians assessing clinical accuracy and parents assessing readability. This study highlights the need for dual-perspective frameworks that balance clinical precision with patient understanding.

大型语言模型(llm)越来越多地被患者和家属用于解释复杂的医疗文件,然而大多数评估只关注临床判断的准确性。在这项研究中,使用gpt - 40mini对50份儿科心脏重症监护病房记录进行了总结,并由医生和家长对其可读性、临床保真度和有用性进行了评估。父母和临床医生在帮助方面存在重要差异,同时临床医生评估临床准确性和父母评估可读性的重要见解也存在差异。这项研究强调需要双重视角框架来平衡临床准确性和患者的理解。
{"title":"Evaluating Large Language Model-Generated Clinical Summaries Through a Dual-Perspective Framework: Retrospective Observational Study.","authors":"Brian Han, Traci Barnes, Charitha D Reddy, Andrew Y Shin","doi":"10.2196/85221","DOIUrl":"10.2196/85221","url":null,"abstract":"<p><p>Large language models (LLMs) are increasingly used by patients and families to interpret complex medical documentation, yet most evaluations focus only on clinician-judged accuracy. In this study, 50 pediatric cardiac intensive care unit notes were summarized using GPT-4o mini and reviewed by both physicians and parents, who rated readability, clinical fidelity, and helpfulness. There were important discrepancies between parents and clinicians in the realm of helpfulness, along with important insights by clinicians assessing clinical accuracy and parents assessing readability. This study highlights the need for dual-perspective frameworks that balance clinical precision with patient understanding.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e85221"},"PeriodicalIF":2.0,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12933168/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR AI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1