Daniel Bottomly, Bridget Barnes, Kuli Mavuwa, Nikki Lee, Holger R Roth, Chester Chen, Shannon K McWeeney
Unlabelled: With the rapid development of artificial intelligence (AI), particularly large language models, there is growing interest in adopting AI approaches within academic medical centers (AMCs). However, the vast amounts of data required for AI and the sensitive nature of medical information pose significant challenges to developing high-performing models at individual institutions. Furthermore, recent changes in government funding priorities may result in the decentralization of biomedical data repositories that risk creating significant barriers to effective data sharing and robust model development. This has generated significant interest in federated learning (FL), which enables collaborative model training without transferring data between institutions, thereby enhancing the protection of proprietary and sensitive information. While FL offers a crucial pathway to enable multi-institutional AI development while maintaining data privacy, it also exposes AMCs to novel governance, security, and operational risks that are not fully addressed by existing procedures. In response, this manuscript provides a perspective grounded in both leading international standards (NIST AI RMF [National Institute of Standards and Technology Artificial Intelligence Risk Management Framework], International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) 42001) and in the real-world governance experience of AMC leadership. We present a risk differentiation framework, an FL risk matrix, and a set of essential governance artifacts-each mapped to key institutional challenges and reviewed for alignment with core standards but offered as pragmatic, illustrative guides rather than prescriptive checklists. Together, these tools represent a novel resource to support AMC security, privacy, and governance leaders with standards-informed, context-sensitive tools for addressing the evolving risks of FL in biomedical research and clinical environments.
未标记:随着人工智能(AI)的快速发展,特别是大型语言模型,人们对在学术医疗中心(amc)采用人工智能方法的兴趣越来越大。然而,人工智能所需的大量数据和医疗信息的敏感性对在个别机构开发高性能模型构成了重大挑战。此外,最近政府供资优先事项的变化可能导致生物医学数据存储库的分散化,这可能对有效的数据共享和稳健的模型开发造成重大障碍。这引起了人们对联邦学习(FL)的极大兴趣,它支持协作模型训练,而无需在机构之间传输数据,从而增强了对专有和敏感信息的保护。虽然人工智能为在保持数据隐私的同时实现多机构人工智能开发提供了重要途径,但它也使资产管理公司面临新的治理、安全和运营风险,而现有程序无法完全解决这些风险。作为回应,本文提供了一个基于领先的国际标准(NIST AI RMF[国家标准与技术研究所人工智能风险管理框架],国际标准化组织(ISO)和国际电工委员会(IEC) 42001)以及AMC领导的现实世界治理经验的视角。我们提出了一个风险区分框架、一个FL风险矩阵和一组基本的治理工件——每一个都映射到关键的制度挑战,并根据核心标准进行审查,但作为实用的、说明性的指南而不是说明性的检查清单提供。总之,这些工具代表了一种新的资源,为AMC安全、隐私和治理领导者提供了标准信息、上下文敏感的工具,以解决生物医学研究和临床环境中FL不断变化的风险。
{"title":"A Pragmatic Framework for Federated Learning Risk and Governance in Academic Medical Centers.","authors":"Daniel Bottomly, Bridget Barnes, Kuli Mavuwa, Nikki Lee, Holger R Roth, Chester Chen, Shannon K McWeeney","doi":"10.2196/80022","DOIUrl":"10.2196/80022","url":null,"abstract":"<p><strong>Unlabelled: </strong>With the rapid development of artificial intelligence (AI), particularly large language models, there is growing interest in adopting AI approaches within academic medical centers (AMCs). However, the vast amounts of data required for AI and the sensitive nature of medical information pose significant challenges to developing high-performing models at individual institutions. Furthermore, recent changes in government funding priorities may result in the decentralization of biomedical data repositories that risk creating significant barriers to effective data sharing and robust model development. This has generated significant interest in federated learning (FL), which enables collaborative model training without transferring data between institutions, thereby enhancing the protection of proprietary and sensitive information. While FL offers a crucial pathway to enable multi-institutional AI development while maintaining data privacy, it also exposes AMCs to novel governance, security, and operational risks that are not fully addressed by existing procedures. In response, this manuscript provides a perspective grounded in both leading international standards (NIST AI RMF [National Institute of Standards and Technology Artificial Intelligence Risk Management Framework], International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) 42001) and in the real-world governance experience of AMC leadership. We present a risk differentiation framework, an FL risk matrix, and a set of essential governance artifacts-each mapped to key institutional challenges and reviewed for alignment with core standards but offered as pragmatic, illustrative guides rather than prescriptive checklists. Together, these tools represent a novel resource to support AMC security, privacy, and governance leaders with standards-informed, context-sensitive tools for addressing the evolving risks of FL in biomedical research and clinical environments.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80022"},"PeriodicalIF":2.0,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12977002/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147437902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p><strong>Background: </strong>In recent years, artificial intelligence (AI) systems have increasingly been used to assess emotional states in health care. AI offers a safe, quick, user-friendly, and objective emotional evaluation method. However, evidence supporting its implementation in health care remains limited.</p><p><strong>Objective: </strong>This study aimed to explore the concurrent validity and test-retest reliability of emotion recognition AI based on facial expressions.</p><p><strong>Methods: </strong>In this study, we used the Kokoro Sensor, an accurate and widely recognized automated facial expression recognition system. The Japanese version of the Profile of Mood States-Short Form was used to screen the potential influence of mental states on facial expressions. The study participants made positive, negative, and neutral expressions, which were analyzed by the emotion recognition AI. Agreement between the results of the AI and subjective evaluations was assessed by participants and a researcher using a 4-point Likert-type scale. The facial expressions and emotion analysis process were repeated after a 30-minute interval to investigate reliability. Concurrent validity was evaluated using the content validity index (CVI) and κ coefficient, and test-retest reliability was determined using the κ coefficient.</p><p><strong>Results: </strong>The study participants were 40 individuals whose mental states did not deviate from the reference range of the Profile of Mood States manual. Among the participants, the CVI values for positive, neutral, and negative expressions were 95%, 98%, and 85%, respectively. Among the researchers, the corresponding CVI values were 100%, 100%, and 70%, respectively. The overall weighted κ coefficient was 0.55 (CI 0.44-0.67), indicating moderate agreement. The agreement was almost perfect for distinguishing positive from neutral expressions (κ=0.83, 95% CI 0.70-0.95) but not statistically significant for distinguishing negative from neutral expressions (κ=0.15, 95% CI -0.07 to 0.37). Test-retest reliability analysis showed an overall weighted κ coefficient of 0.66, reflecting substantial reliability. Almost perfect agreement was observed for distinguishing positive from neutral expressions (κ=0.85, 95% CI 0.73-0.97), while distinguishing negative from neutral expressions showed limited reliability (κ=0.36, 95% CI 0.16-0.57).</p><p><strong>Conclusions: </strong>Our findings suggest that the Kokoro Sensor may be useful for identifying positive affect, given its acceptable concurrent validity for overall valence estimation and its high agreement for distinguishing positive from neutral expressions. However, concurrent validity for negative expressions did not meet the prespecified benchmark based on the researcher's ratings, and agreement for distinguishing negative from neutral expressions was limited, which may constrain clinical utility for detecting negative affect. Therefore, in clinical settings, the Kokor
背景:近年来,人工智能(AI)系统越来越多地用于评估医疗保健中的情绪状态。人工智能提供了一种安全、快速、人性化、客观的情绪评估方法。然而,支持其在卫生保健领域实施的证据仍然有限。目的:探讨基于面部表情的人工智能情绪识别的并发效度和重测信度。方法:在本研究中,我们使用了Kokoro传感器,这是一种准确且被广泛认可的自动面部表情识别系统。日本版情绪状态简表被用来筛选心理状态对面部表情的潜在影响。研究参与者做出积极、消极和中性的表情,并通过情绪识别人工智能进行分析。人工智能结果与主观评价之间的一致性由参与者和研究人员使用4点李克特量表进行评估。面部表情和情绪分析过程在间隔30分钟后重复进行,以调查可靠性。采用内容效度指数(CVI)和κ系数评估并发效度,以κ系数确定重测信度。结果:研究参与者是40个人,他们的精神状态没有偏离《情绪状态概况》手册的参考范围。在参与者中,阳性、中性和阴性表达的CVI值分别为95%、98%和85%。研究者对应的CVI值分别为100%、100%、70%。总体加权κ系数为0.55 (CI 0.44-0.67),表明一致性中等。在区分阳性和中性表达方面,一致性几乎是完美的(κ=0.83, 95% CI 0.70-0.95),但在区分阴性和中性表达方面,一致性没有统计学意义(κ=0.15, 95% CI -0.07 - 0.37)。重测信度分析显示,总体加权κ系数为0.66,具有较高的信度。在区分阳性和中性表达方面几乎完全一致(κ=0.85, 95% CI 0.73-0.97),而区分阴性和中性表达的可靠性有限(κ=0.36, 95% CI 0.16-0.57)。结论:我们的研究结果表明,Kokoro传感器可能有助于识别积极情绪,因为它在总体效价估计方面具有可接受的并发效度,并且在区分积极和中性表达方面具有很高的一致性。然而,负面情绪的并发效度并没有达到研究者打分的标准,并且在区分负面情绪和中性情绪方面的一致性有限,这可能会限制负面情绪检测的临床应用。因此,在临床环境中,Kokoro传感器应被用作辅助工具,而不是一个独立的方法。
{"title":"Facial Expression-Based Evaluation of the Emotion Estimation Software Kokoro Sensor in Healthy Individuals: Validation and Reliability Pilot Study.","authors":"Shota Yoshihara, Satoru Amano, Kayoko Takahashi","doi":"10.2196/81868","DOIUrl":"10.2196/81868","url":null,"abstract":"<p><strong>Background: </strong>In recent years, artificial intelligence (AI) systems have increasingly been used to assess emotional states in health care. AI offers a safe, quick, user-friendly, and objective emotional evaluation method. However, evidence supporting its implementation in health care remains limited.</p><p><strong>Objective: </strong>This study aimed to explore the concurrent validity and test-retest reliability of emotion recognition AI based on facial expressions.</p><p><strong>Methods: </strong>In this study, we used the Kokoro Sensor, an accurate and widely recognized automated facial expression recognition system. The Japanese version of the Profile of Mood States-Short Form was used to screen the potential influence of mental states on facial expressions. The study participants made positive, negative, and neutral expressions, which were analyzed by the emotion recognition AI. Agreement between the results of the AI and subjective evaluations was assessed by participants and a researcher using a 4-point Likert-type scale. The facial expressions and emotion analysis process were repeated after a 30-minute interval to investigate reliability. Concurrent validity was evaluated using the content validity index (CVI) and κ coefficient, and test-retest reliability was determined using the κ coefficient.</p><p><strong>Results: </strong>The study participants were 40 individuals whose mental states did not deviate from the reference range of the Profile of Mood States manual. Among the participants, the CVI values for positive, neutral, and negative expressions were 95%, 98%, and 85%, respectively. Among the researchers, the corresponding CVI values were 100%, 100%, and 70%, respectively. The overall weighted κ coefficient was 0.55 (CI 0.44-0.67), indicating moderate agreement. The agreement was almost perfect for distinguishing positive from neutral expressions (κ=0.83, 95% CI 0.70-0.95) but not statistically significant for distinguishing negative from neutral expressions (κ=0.15, 95% CI -0.07 to 0.37). Test-retest reliability analysis showed an overall weighted κ coefficient of 0.66, reflecting substantial reliability. Almost perfect agreement was observed for distinguishing positive from neutral expressions (κ=0.85, 95% CI 0.73-0.97), while distinguishing negative from neutral expressions showed limited reliability (κ=0.36, 95% CI 0.16-0.57).</p><p><strong>Conclusions: </strong>Our findings suggest that the Kokoro Sensor may be useful for identifying positive affect, given its acceptable concurrent validity for overall valence estimation and its high agreement for distinguishing positive from neutral expressions. However, concurrent validity for negative expressions did not meet the prespecified benchmark based on the researcher's ratings, and agreement for distinguishing negative from neutral expressions was limited, which may constrain clinical utility for detecting negative affect. Therefore, in clinical settings, the Kokor","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81868"},"PeriodicalIF":2.0,"publicationDate":"2026-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12945095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147313280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompal Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib
Background: The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHRs), but this process is labor-intensive and prone to interrater variability. Large language models (LLMs) have demonstrated potential in automating text classification.
Objective: We aimed to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.
Methods: We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) at hospital discharge and (2) approximately 90 days post discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS scores 0-2) versus non-independence (mRS scores 3-6). Four-fold cross-validation was conducted using accuracy and the Cohen κ as model performance metrics.
Results: A total of 2290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen κ of 0.92. Class-specific accuracy was the highest for mRS score 4 (90%) and the lowest for mRS score 2 (28%). The binary model-considering only functional independence versus non-independence-attained an accuracy of 92% and a Cohen κ of 0.84.
Conclusions: Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis; however, improving discrimination between intermediate scores is required.
{"title":"Assessment of the Modified Rankin Scale in Electronic Health Records With a Fine-Tuned Large Language Model: Development and Internal Validation.","authors":"Luis Silva, Marcus Milani, Sohum Bindra, Salman Ikramuddin, Megan Tessmer, Kaylee Frederickson, Abhigyan Datta, Halil Ergen, Alex Stangebye, Dawson Cooper, Kompal Kumar, Jeremy Yeung, Kamakshi Lakshminarayan, Christopher Streib","doi":"10.2196/82607","DOIUrl":"10.2196/82607","url":null,"abstract":"<p><strong>Background: </strong>The modified Rankin scale (mRS) is an important metric in stroke research, often used as a primary outcome in clinical trials and observational studies. The mRS can be assessed retrospectively from electronic health records (EHRs), but this process is labor-intensive and prone to interrater variability. Large language models (LLMs) have demonstrated potential in automating text classification.</p><p><strong>Objective: </strong>We aimed to create a fine-tuned LLM that can analyze EHR text and classify mRS scores for clinical and research applications.</p><p><strong>Methods: </strong>We performed a retrospective cohort study of patients admitted to a specialist stroke neurology service at a large academic hospital system between August 2020 and June 2023. Each patient's medical record was reviewed at two time points: (1) at hospital discharge and (2) approximately 90 days post discharge. Two independent researchers assigned an mRS score at each time point. Two separate models were trained on EHR passages with corresponding mRS scores as labeled outcomes: (1) a multiclass model to classify all seven mRS scores and (2) a binary model to classify functional independence (mRS scores 0-2) versus non-independence (mRS scores 3-6). Four-fold cross-validation was conducted using accuracy and the Cohen κ as model performance metrics.</p><p><strong>Results: </strong>A total of 2290 EHR passages with corresponding mRS scores were included in model training. The multiclass model-considering all seven scores of the mRS-attained an accuracy of 77% and a weighted Cohen κ of 0.92. Class-specific accuracy was the highest for mRS score 4 (90%) and the lowest for mRS score 2 (28%). The binary model-considering only functional independence versus non-independence-attained an accuracy of 92% and a Cohen κ of 0.84.</p><p><strong>Conclusions: </strong>Our findings demonstrate that LLMs can be successfully trained to determine mRS scores through EHR text analysis; however, improving discrimination between intermediate scores is required.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e82607"},"PeriodicalIF":2.0,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12935414/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147292138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Petra Apell, Sara Locher, Annie Milde, Henrik Eriksson
Background: Artificial intelligence (AI) is a topic of considerable hype, with many actors sensing its high potential for health care applications. Despite this, the adoption has been slow, with few applications being implemented in clinical practice.
Objective: The aim of our study was to investigate the challenges associated with using AI in health care, as well as provide suggestions for how further adoption of AI within health care organizations can be facilitated.
Methods: A qualitative case study with a mixed methods approach was conducted at one of Sweden's largest hospitals. Regulatory approved AI medical devices were analyzed, and primary qualitative data from 14 expert interviews were collected and cross-referenced with secondary quantitative data. The framework of technological innovation systems was used to analyze the system factors and their dynamics to identify blocking mechanisms and areas for improvement.
Results: The challenges related to knowledge development, diffusion, legitimation, and resource mobilization could trigger a cascade of positive activities, thereby significantly enhancing the overall performance of the innovation system. Creating dedicated testing environments to evaluate safety and efficacy would facilitate the routine clinical use and reinforce the use of AI innovations in health care organizations.
Conclusions: This analysis shows that the adoption of AI health care technology innovations can be accelerated through targeted strategies and supportive mechanisms triggering virtuous cycles that facilitate clinical validation and generate compelling use cases. The interconnection between guidance of search and entrepreneurial experimentation has been confirmed, providing the initial conditions for knowledge development, diffusion, and legitimation in the early stages of emerging technologies.
{"title":"Explaining the Slow Adoption of AI Innovations in Health Care: Network Analysis Approach.","authors":"Petra Apell, Sara Locher, Annie Milde, Henrik Eriksson","doi":"10.2196/60458","DOIUrl":"10.2196/60458","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is a topic of considerable hype, with many actors sensing its high potential for health care applications. Despite this, the adoption has been slow, with few applications being implemented in clinical practice.</p><p><strong>Objective: </strong>The aim of our study was to investigate the challenges associated with using AI in health care, as well as provide suggestions for how further adoption of AI within health care organizations can be facilitated.</p><p><strong>Methods: </strong>A qualitative case study with a mixed methods approach was conducted at one of Sweden's largest hospitals. Regulatory approved AI medical devices were analyzed, and primary qualitative data from 14 expert interviews were collected and cross-referenced with secondary quantitative data. The framework of technological innovation systems was used to analyze the system factors and their dynamics to identify blocking mechanisms and areas for improvement.</p><p><strong>Results: </strong>The challenges related to knowledge development, diffusion, legitimation, and resource mobilization could trigger a cascade of positive activities, thereby significantly enhancing the overall performance of the innovation system. Creating dedicated testing environments to evaluate safety and efficacy would facilitate the routine clinical use and reinforce the use of AI innovations in health care organizations.</p><p><strong>Conclusions: </strong>This analysis shows that the adoption of AI health care technology innovations can be accelerated through targeted strategies and supportive mechanisms triggering virtuous cycles that facilitate clinical validation and generate compelling use cases. The interconnection between guidance of search and entrepreneurial experimentation has been confirmed, providing the initial conditions for knowledge development, diffusion, and legitimation in the early stages of emerging technologies.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e60458"},"PeriodicalIF":2.0,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12972688/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Large language models (LLMs) are increasingly integrated into health care, where they contribute to patient care, administrative efficiency, and clinical decision-making. Despite their growing role, the ability of LLMs to handle imperfect inputs remains underexplored. These imperfections, which are common in clinical documentation and patient-generated data, may affect model reliability.
Objective: This study investigates the impact of input perturbations on LLM performance across three dimensions: (1) overall effectiveness in different health-related applications, (2) comparative effects of different types and levels of perturbations, and (3) differential impact of perturbations on health-related terms versus non-health-related terms.
Methods: We systematically evaluate 3 LLMs on 3 health-related tasks using a novel dataset containing 3 types of human-like variations (redaction, homophones, and typographical errors) at different perturbation levels.
Results: Contrary to expectations, LLMs demonstrate notable robustness to common variations, and in more than half of the cases (151/270, 55.92%), the performance was stable or improved. In some cases (38/270, 14.07%), variations resulted in an increased performance, especially when dealing with lower perturbation levels. Redactions, often stemming from privacy concerns or cognitive lapses, are more detrimental than other variations.
Conclusions: Our findings highlight the need for health care applications powered by LLMs to be designed with input variability in mind. Robustness to noisy or imperfect inputs is essential for maintaining reliability in real-world clinical settings, where data quality can vary widely. By identifying specific vulnerabilities and strengths, this study provides actionable insights for improving model resilience and guiding the development of safer, more effective artificial intelligence tools in health care. The accompanying dataset offers a valuable resource for further research into LLM performance under diverse conditions.
{"title":"Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation.","authors":"Saubhagya Joshi, Monjil Mehta, Sarjak Maniar, Mengqian Wang, Vivek Kumar Singh","doi":"10.2196/83640","DOIUrl":"10.2196/83640","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) are increasingly integrated into health care, where they contribute to patient care, administrative efficiency, and clinical decision-making. Despite their growing role, the ability of LLMs to handle imperfect inputs remains underexplored. These imperfections, which are common in clinical documentation and patient-generated data, may affect model reliability.</p><p><strong>Objective: </strong>This study investigates the impact of input perturbations on LLM performance across three dimensions: (1) overall effectiveness in different health-related applications, (2) comparative effects of different types and levels of perturbations, and (3) differential impact of perturbations on health-related terms versus non-health-related terms.</p><p><strong>Methods: </strong>We systematically evaluate 3 LLMs on 3 health-related tasks using a novel dataset containing 3 types of human-like variations (redaction, homophones, and typographical errors) at different perturbation levels.</p><p><strong>Results: </strong>Contrary to expectations, LLMs demonstrate notable robustness to common variations, and in more than half of the cases (151/270, 55.92%), the performance was stable or improved. In some cases (38/270, 14.07%), variations resulted in an increased performance, especially when dealing with lower perturbation levels. Redactions, often stemming from privacy concerns or cognitive lapses, are more detrimental than other variations.</p><p><strong>Conclusions: </strong>Our findings highlight the need for health care applications powered by LLMs to be designed with input variability in mind. Robustness to noisy or imperfect inputs is essential for maintaining reliability in real-world clinical settings, where data quality can vary widely. By identifying specific vulnerabilities and strengths, this study provides actionable insights for improving model resilience and guiding the development of safer, more effective artificial intelligence tools in health care. The accompanying dataset offers a valuable resource for further research into LLM performance under diverse conditions.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e83640"},"PeriodicalIF":2.0,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12923095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146260202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Images created with generative artificial intelligence (AI) tools are increasingly used for health communication due to their ease of use, speed, accessibility, and low cost. However, AI-generated images may bring practical and ethical risks to health practitioners and the public, including through the perpetuation of stigma against vulnerable and historically marginalized groups.
Objective: To understand the potential value of AI-generated images for health care and public health communication, we sought to analyze images of substance use disorder and recovery generated with ChatGPT. Specifically, we sought to investigate: (1) the default visual outputs produced in response to a range of prompts about substance use disorder and recovery, and (2) the extent to which prompt modification and guideline-informed prompting could mitigate potentially stigmatizing imagery.
Methods: We performed a mixed-methods case study examining depictions of substance use and recovery in images generated by ChatGPT 4.o. We generated images (n=84) using (1) prompts with colloquial and stigmatizing language, (2) prompts that follow best practices for person-first language, (3) image prompts written by ChatGPT, and (4) a custom GPT informed by guidelines for images of SUD. We then used a mixed-methods approach to analyze images for demographics and stigmatizing elements.
Results: Images produced in the default ChatGPT model featured primarily White men (81%, n=34). Further, images tended to be stigmatizing, featuring injection drug use, dark colors, and symbolic elements such as chains. These trends persisted even when person-first language prompts were used. Images produced by the guideline-informed custom GPT were markedly less stigmatizing; however, they featured almost only Black women (74%, n=31).
Conclusions: Our findings confirm prior research about stigma and biases in AI-generated images and extend this literature to substance use. However, our findings also suggest that (1) images can be improved when clear guidelines are provided and (2) even with guidelines, iteration is needed to create an image that fully concords with best practices.
{"title":"AI-Generated Images of Substance Use and Recovery: Mixed Methods Case Study.","authors":"Kathryn Heley, Jeffrey K Hom, Linnea Laestadius","doi":"10.2196/81977","DOIUrl":"10.2196/81977","url":null,"abstract":"<p><strong>Background: </strong>Images created with generative artificial intelligence (AI) tools are increasingly used for health communication due to their ease of use, speed, accessibility, and low cost. However, AI-generated images may bring practical and ethical risks to health practitioners and the public, including through the perpetuation of stigma against vulnerable and historically marginalized groups.</p><p><strong>Objective: </strong>To understand the potential value of AI-generated images for health care and public health communication, we sought to analyze images of substance use disorder and recovery generated with ChatGPT. Specifically, we sought to investigate: (1) the default visual outputs produced in response to a range of prompts about substance use disorder and recovery, and (2) the extent to which prompt modification and guideline-informed prompting could mitigate potentially stigmatizing imagery.</p><p><strong>Methods: </strong>We performed a mixed-methods case study examining depictions of substance use and recovery in images generated by ChatGPT 4.o. We generated images (n=84) using (1) prompts with colloquial and stigmatizing language, (2) prompts that follow best practices for person-first language, (3) image prompts written by ChatGPT, and (4) a custom GPT informed by guidelines for images of SUD. We then used a mixed-methods approach to analyze images for demographics and stigmatizing elements.</p><p><strong>Results: </strong>Images produced in the default ChatGPT model featured primarily White men (81%, n=34). Further, images tended to be stigmatizing, featuring injection drug use, dark colors, and symbolic elements such as chains. These trends persisted even when person-first language prompts were used. Images produced by the guideline-informed custom GPT were markedly less stigmatizing; however, they featured almost only Black women (74%, n=31).</p><p><strong>Conclusions: </strong>Our findings confirm prior research about stigma and biases in AI-generated images and extend this literature to substance use. However, our findings also suggest that (1) images can be improved when clear guidelines are provided and (2) even with guidelines, iteration is needed to create an image that fully concords with best practices.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81977"},"PeriodicalIF":2.0,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12919905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146229993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kjersti Mevik, Ashenafi Zebene Woldaregay, Eva Lindell Jonsson, Miguel Tejedor, Claire Temple-Oberle
<p><strong>Background: </strong>The impact of surgical complications is substantial and multifaceted, affecting patients and their families, surgeons, and health care systems. Despite the remarkable progress in artificial intelligence (AI), there remains a notable gap in the prospective implementation of AI models in surgery that use real-time data to support decision-making and enable proactive intervention to reduce the risk of surgical complications.</p><p><strong>Objective: </strong>This scoping review aims to assess and analyze the adoption and use of AI models for preventing surgical complications. Furthermore, this review aims to identify barriers and facilitators for implementation at the bedside.</p><p><strong>Methods: </strong>Following PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, we conducted a literature search using IEEE Xplore, Scopus, Web of Science, MEDLINE, ProQuest, PubMed, ABI, Embase, Epistemonikos, CINAHL, and Cochrane registries. The inclusion criteria included empirical, peer-reviewed studies published in English between January 2013 and January 2025, involving AI models for preventing surgical complications (surgical site infections, and heart and lung complications or stroke) in real-world settings. Exclusions included retrospective algorithm-only validations, nonempirical research (eg, editorials or protocols), and non-English studies. Study characteristics and AI model development details were extracted, along with performance statistics (eg, sensitivity and area under the receiver operating characteristic curve). We then used thematic analysis to synthesize findings related to AI models, prediction outputs, and validation methods. Studies were grouped into three main themes: (1) duration of hypotension, (2) risk for complications, and (3) decision support tool.</p><p><strong>Results: </strong>Of the 275 identified records, 19 were included. The included models frequently demonstrated strong technical accuracy with high sensitivity and area under the receiver operating characteristic curve, particularly among studies evaluating decision support tools. However, only a few models were adopted routinely in clinical practice. Two studies evaluated the clinicians' perceptions regarding the use of AI models, reporting predominantly positive assessments of their usefulness.</p><p><strong>Conclusions: </strong>Overall, AI models hold potential to predict and prevent surgical complications as the validation studies demonstrated high accuracy. However, implementation in routine practice remains limited by usability barriers, workflow misalignment, trust concerns, and financial and ethical constraints. The evidence included in this scoping review was limited by the heterogeneity in study design and the predominance of small-scale feasibility studies, particularly for hypotension prediction. Future research should prioritize prospectively validated models
背景:手术并发症的影响是实质性的和多方面的,影响患者及其家属、外科医生和卫生保健系统。尽管人工智能(AI)取得了显著进展,但在使用实时数据支持决策并进行主动干预以降低手术并发症风险的手术中,AI模型的预期实施仍存在显着差距。目的:本综述旨在评估和分析人工智能模型在预防手术并发症方面的采用和使用。此外,本综述旨在确定在床边实施的障碍和促进因素。方法:根据PRISMA-ScR(首选报告项目为系统评价和元分析扩展范围评价)指南,我们使用IEEE Xplore, Scopus, Web of Science, MEDLINE, ProQuest, PubMed, ABI, Embase, Epistemonikos, CINAHL和Cochrane注册表进行文献检索。纳入标准包括2013年1月至2025年1月期间以英文发表的经验性同行评审研究,涉及在现实环境中预防手术并发症(手术部位感染、心肺并发症或中风)的人工智能模型。排除包括回顾性算法验证、非实证研究(如社论或协议)和非英语研究。提取研究特征和AI模型开发细节,以及性能统计数据(例如灵敏度和接收者工作特征曲线下的面积)。然后,我们使用主题分析来综合与人工智能模型、预测输出和验证方法相关的发现。研究分为三个主题:(1)低血压持续时间,(2)并发症风险,(3)决策支持工具。结果:经鉴定的275份病历中,有19份被纳入。纳入的模型经常表现出很强的技术准确性,具有高灵敏度和接受者工作特征曲线下的面积,特别是在评估决策支持工具的研究中。然而,只有少数模型在临床实践中被常规采用。两项研究评估了临床医生对使用人工智能模型的看法,主要报告了对其有用性的积极评价。结论:总体而言,AI模型具有预测和预防手术并发症的潜力,因为验证研究显示出较高的准确性。然而,在日常实践中的实现仍然受到可用性障碍、工作流程不一致、信任问题以及财务和道德约束的限制。由于研究设计的异质性和小规模可行性研究的优势,特别是低血压预测,本综述纳入的证据受到限制。未来的研究应优先考虑使用其他生理特征的前瞻性验证模型,并解决临床医生对推广和采用的担忧。
{"title":"Application of AI Models for Preventing Surgical Complications: Scoping Review of Clinical Readiness and Barriers to Implementation.","authors":"Kjersti Mevik, Ashenafi Zebene Woldaregay, Eva Lindell Jonsson, Miguel Tejedor, Claire Temple-Oberle","doi":"10.2196/75064","DOIUrl":"10.2196/75064","url":null,"abstract":"<p><strong>Background: </strong>The impact of surgical complications is substantial and multifaceted, affecting patients and their families, surgeons, and health care systems. Despite the remarkable progress in artificial intelligence (AI), there remains a notable gap in the prospective implementation of AI models in surgery that use real-time data to support decision-making and enable proactive intervention to reduce the risk of surgical complications.</p><p><strong>Objective: </strong>This scoping review aims to assess and analyze the adoption and use of AI models for preventing surgical complications. Furthermore, this review aims to identify barriers and facilitators for implementation at the bedside.</p><p><strong>Methods: </strong>Following PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, we conducted a literature search using IEEE Xplore, Scopus, Web of Science, MEDLINE, ProQuest, PubMed, ABI, Embase, Epistemonikos, CINAHL, and Cochrane registries. The inclusion criteria included empirical, peer-reviewed studies published in English between January 2013 and January 2025, involving AI models for preventing surgical complications (surgical site infections, and heart and lung complications or stroke) in real-world settings. Exclusions included retrospective algorithm-only validations, nonempirical research (eg, editorials or protocols), and non-English studies. Study characteristics and AI model development details were extracted, along with performance statistics (eg, sensitivity and area under the receiver operating characteristic curve). We then used thematic analysis to synthesize findings related to AI models, prediction outputs, and validation methods. Studies were grouped into three main themes: (1) duration of hypotension, (2) risk for complications, and (3) decision support tool.</p><p><strong>Results: </strong>Of the 275 identified records, 19 were included. The included models frequently demonstrated strong technical accuracy with high sensitivity and area under the receiver operating characteristic curve, particularly among studies evaluating decision support tools. However, only a few models were adopted routinely in clinical practice. Two studies evaluated the clinicians' perceptions regarding the use of AI models, reporting predominantly positive assessments of their usefulness.</p><p><strong>Conclusions: </strong>Overall, AI models hold potential to predict and prevent surgical complications as the validation studies demonstrated high accuracy. However, implementation in routine practice remains limited by usability barriers, workflow misalignment, trust concerns, and financial and ethical constraints. The evidence included in this scoping review was limited by the heterogeneity in study design and the predominance of small-scale feasibility studies, particularly for hypotension prediction. Future research should prioritize prospectively validated models ","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e75064"},"PeriodicalIF":2.0,"publicationDate":"2026-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12912657/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146215163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Large language models (LLMs) have fundamentally transformed approaches to natural language processing tasks across diverse domains. In health care, accurate and cost-efficient text classification is crucial-whether for clinical note analysis, diagnosis coding, or other related tasks-and LLMs present promising potential. Text classification has long faced multiple challenges, including the need for manual annotation during training, the handling of imbalanced data, and the development of scalable approaches. In health care, additional challenges arise, particularly the critical need to preserve patient data privacy and the complexity of medical terminology. Numerous studies have leveraged LLMs for automated health care text classification and compared their performance with traditional machine learning-based methods, which typically require embedding, annotation, and training. However, existing systematic reviews of LLMs either do not specialize in text classification or do not focus specifically on the health care domain.
Objective: This research synthesizes and critically evaluates the current evidence in the literature on the use of LLMs for text classification in health care settings.
Methods: Major databases (eg, Google Scholar, Scopus, PubMed, ScienceDirect) and other resources were queried for papers published between 2018 and 2024, following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, resulting in 65 eligible research articles. These studies were categorized by text classification type (eg, binary classification, multilabel classification), application (eg, clinical decision support, public health and opinion analysis), methodology, type of health care text, and the metrics used for evaluation and validation.
Results: The systematic review includes 65 research articles published between 2020 and Q3 2024, showing a significant increase in publications over time, with 28 papers published in Q1-Q3 2024 alone. Fine-tuning was the most common LLM-based approach (35 papers), followed by prompt engineering (17 papers). BERT (Bidirectional Encoder Representations from Transformers) variants were predominantly used for multilabel classification (50%), whereas closed-source LLMs were most commonly applied to binary (44.0%) and multiclass (30.6%) classification tasks. Clinical decision support was the most frequent application (29 papers). Over 80% of studies used English-language datasets, with clinical notes being the most common text type. All studies employed accuracy-related metrics for evaluation, and the findings consistently showed that LLMs outperformed traditional machine learning approaches in health care text classification tasks.
Conclusions: This review identifies existing gaps in the literature and highlights future research directions for further investigation.
背景:大型语言模型(llm)已经从根本上改变了跨不同领域的自然语言处理任务的方法。在医疗保健领域,准确且经济高效的文本分类是至关重要的——无论是临床记录分析、诊断编码还是其他相关任务——法学硕士都有很大的潜力。文本分类长期以来面临着多种挑战,包括在训练期间需要手动注释、处理不平衡数据以及开发可扩展方法。在医疗保健领域,出现了更多的挑战,特别是迫切需要保护患者数据隐私和医学术语的复杂性。许多研究利用llm进行自动医疗文本分类,并将其性能与传统的基于机器学习的方法进行比较,后者通常需要嵌入、注释和训练。然而,现有的法学硕士系统综述要么不专门研究文本分类,要么不专门关注医疗保健领域。目的:本研究综合并批判性地评估了目前文献中使用llm进行卫生保健设置文本分类的证据。方法:按照PRISMA (Preferred Reporting Items for Systematic Reviews and meta - analysis)指南,对谷歌Scholar、Scopus、PubMed、ScienceDirect等主要数据库检索2018 - 2024年间发表的论文,共纳入65篇符合条件的研究论文。这些研究按文本分类类型(如二元分类、多标签分类)、应用(如临床决策支持、公共卫生和意见分析)、方法学、卫生保健文本类型以及用于评估和验证的指标进行分类。结果:系统综述包括2020年至2024年第三季度发表的65篇研究论文,随着时间的推移,论文发表量显著增加,仅2024年第一季度至第三季度就有28篇论文发表。微调是最常见的基于法学硕士的方法(35篇论文),其次是快速工程(17篇论文)。BERT(来自变压器的双向编码器表示)变体主要用于多标签分类(50%),而闭源llm最常用于二进制(44.0%)和多类别(30.6%)分类任务。临床决策支持是最常见的应用(29篇论文)。超过80%的研究使用英语数据集,临床笔记是最常见的文本类型。所有的研究都采用了与准确性相关的指标进行评估,结果一致表明llm在医疗文本分类任务中优于传统的机器学习方法。结论:本综述明确了文献中存在的空白,并指出了未来进一步研究的方向。
{"title":"Large Language Models for Health Care Text Classification: Systematic Review.","authors":"Hajar Sakai, Sarah S Lam","doi":"10.2196/79202","DOIUrl":"10.2196/79202","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have fundamentally transformed approaches to natural language processing tasks across diverse domains. In health care, accurate and cost-efficient text classification is crucial-whether for clinical note analysis, diagnosis coding, or other related tasks-and LLMs present promising potential. Text classification has long faced multiple challenges, including the need for manual annotation during training, the handling of imbalanced data, and the development of scalable approaches. In health care, additional challenges arise, particularly the critical need to preserve patient data privacy and the complexity of medical terminology. Numerous studies have leveraged LLMs for automated health care text classification and compared their performance with traditional machine learning-based methods, which typically require embedding, annotation, and training. However, existing systematic reviews of LLMs either do not specialize in text classification or do not focus specifically on the health care domain.</p><p><strong>Objective: </strong>This research synthesizes and critically evaluates the current evidence in the literature on the use of LLMs for text classification in health care settings.</p><p><strong>Methods: </strong>Major databases (eg, Google Scholar, Scopus, PubMed, ScienceDirect) and other resources were queried for papers published between 2018 and 2024, following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, resulting in 65 eligible research articles. These studies were categorized by text classification type (eg, binary classification, multilabel classification), application (eg, clinical decision support, public health and opinion analysis), methodology, type of health care text, and the metrics used for evaluation and validation.</p><p><strong>Results: </strong>The systematic review includes 65 research articles published between 2020 and Q3 2024, showing a significant increase in publications over time, with 28 papers published in Q1-Q3 2024 alone. Fine-tuning was the most common LLM-based approach (35 papers), followed by prompt engineering (17 papers). BERT (Bidirectional Encoder Representations from Transformers) variants were predominantly used for multilabel classification (50%), whereas closed-source LLMs were most commonly applied to binary (44.0%) and multiclass (30.6%) classification tasks. Clinical decision support was the most frequent application (29 papers). Over 80% of studies used English-language datasets, with clinical notes being the most common text type. All studies employed accuracy-related metrics for evaluation, and the findings consistently showed that LLMs outperformed traditional machine learning approaches in health care text classification tasks.</p><p><strong>Conclusions: </strong>This review identifies existing gaps in the literature and highlights future research directions for further investigation.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e79202"},"PeriodicalIF":2.0,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12936667/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146168242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p><strong>Background: </strong>Peer review remains central to ensuring research quality, yet it is constrained by reviewer fatigue and human bias. The rapid rise in scientific publishing has worsened these challenges, prompting interest in whether large language models (LLMs) can support or improve the peer review process.</p><p><strong>Objective: </strong>This study aimed to address critical gaps in the use of LLMs for peer review of papers in the field of organ transplantation by (1) comparing the performance of 5 recent open-source LLMs; (2) evaluating the impact of author affiliations-prestigious, less prestigious, and none-on LLM review outcomes; and (3) examining the influence of prompt engineering strategies, including zero-shot prompting, few-shot prompting, tree of thoughts (ToT) prompting, and retrieval-augmented generation (RAG), on review decisions.</p><p><strong>Methods: </strong>A dataset of 200 transplantation papers published between 2024 and 2025 across 4 journal quartiles was evaluated using 5 state-of-the-art open-source LLMs (Llama 3.3, Mistral 7B, Gemma 2, DeepSeek r1-distill Qwen, and Qwen 2.5). The 4 prompting techniques (zero-shot prompting, few-shot prompting, ToT prompting, and RAG) were tested under multiple temperature settings. Models were instructed to categorize papers into quartiles. To assess fairness, each paper was evaluated 3 times: with no affiliation, a prestigious affiliation, and a less prestigious affiliation. Accuracy, decisions, runtime, and computing resource use were recorded. Chi-square tests and adjusted Pearson residuals were used to examine the presence of affiliation bias.</p><p><strong>Results: </strong>RAG with a temperature of 0.5 achieved the best overall performance (exact match accuracy: 0.35; loose match accuracy: 0.78). Across all models, LLMs frequently assigned manuscripts to quartile 2 and quartile 3 while avoiding extreme quartiles (quartile 1 and quartile 4). None of the models demonstrated affiliation bias, though Gemma 2 (P=.08) and Qwen 2.5 (P=.054) were substantially biased. Each model displayed unique "personalities" in quartile predictions, influencing consistency. Mistral had the highest exact match accuracy (0.35) despite having both the lowest average runtime (1246.378 seconds) and computing resource use (7 billion parameters). While accuracy was insufficient for independent review, LLMs showed value in supporting preliminary triage tasks.</p><p><strong>Conclusions: </strong>Current open-source LLMs are not reliable enough to replace human peer reviewers. The largely absent affiliation bias suggests potential advantages in fairness, but these benefits do not offset the low decision accuracy. Mistral demonstrated the greatest accuracy and computational efficiency, and RAG with a moderate temperature emerged as the most effective prompting strategy. If LLMs are used to assist in peer review, their outputs require nonnegotiable human supervision to ensure correct judgment and a
{"title":"Evaluation of Large Language Models for Peer Review in Transplantation Research: Algorithm Validation Study.","authors":"Selena Ming Shen, Zifu Wang, Krittika Paul, Meng-Hao Li, Xiao Huang, Naoru Koizumi","doi":"10.2196/84322","DOIUrl":"10.2196/84322","url":null,"abstract":"<p><strong>Background: </strong>Peer review remains central to ensuring research quality, yet it is constrained by reviewer fatigue and human bias. The rapid rise in scientific publishing has worsened these challenges, prompting interest in whether large language models (LLMs) can support or improve the peer review process.</p><p><strong>Objective: </strong>This study aimed to address critical gaps in the use of LLMs for peer review of papers in the field of organ transplantation by (1) comparing the performance of 5 recent open-source LLMs; (2) evaluating the impact of author affiliations-prestigious, less prestigious, and none-on LLM review outcomes; and (3) examining the influence of prompt engineering strategies, including zero-shot prompting, few-shot prompting, tree of thoughts (ToT) prompting, and retrieval-augmented generation (RAG), on review decisions.</p><p><strong>Methods: </strong>A dataset of 200 transplantation papers published between 2024 and 2025 across 4 journal quartiles was evaluated using 5 state-of-the-art open-source LLMs (Llama 3.3, Mistral 7B, Gemma 2, DeepSeek r1-distill Qwen, and Qwen 2.5). The 4 prompting techniques (zero-shot prompting, few-shot prompting, ToT prompting, and RAG) were tested under multiple temperature settings. Models were instructed to categorize papers into quartiles. To assess fairness, each paper was evaluated 3 times: with no affiliation, a prestigious affiliation, and a less prestigious affiliation. Accuracy, decisions, runtime, and computing resource use were recorded. Chi-square tests and adjusted Pearson residuals were used to examine the presence of affiliation bias.</p><p><strong>Results: </strong>RAG with a temperature of 0.5 achieved the best overall performance (exact match accuracy: 0.35; loose match accuracy: 0.78). Across all models, LLMs frequently assigned manuscripts to quartile 2 and quartile 3 while avoiding extreme quartiles (quartile 1 and quartile 4). None of the models demonstrated affiliation bias, though Gemma 2 (P=.08) and Qwen 2.5 (P=.054) were substantially biased. Each model displayed unique \"personalities\" in quartile predictions, influencing consistency. Mistral had the highest exact match accuracy (0.35) despite having both the lowest average runtime (1246.378 seconds) and computing resource use (7 billion parameters). While accuracy was insufficient for independent review, LLMs showed value in supporting preliminary triage tasks.</p><p><strong>Conclusions: </strong>Current open-source LLMs are not reliable enough to replace human peer reviewers. The largely absent affiliation bias suggests potential advantages in fairness, but these benefits do not offset the low decision accuracy. Mistral demonstrated the greatest accuracy and computational efficiency, and RAG with a moderate temperature emerged as the most effective prompting strategy. If LLMs are used to assist in peer review, their outputs require nonnegotiable human supervision to ensure correct judgment and a","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e84322"},"PeriodicalIF":2.0,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12936655/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146168177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brian Han, Traci Barnes, Charitha D Reddy, Andrew Y Shin
Large language models (LLMs) are increasingly used by patients and families to interpret complex medical documentation, yet most evaluations focus only on clinician-judged accuracy. In this study, 50 pediatric cardiac intensive care unit notes were summarized using GPT-4o mini and reviewed by both physicians and parents, who rated readability, clinical fidelity, and helpfulness. There were important discrepancies between parents and clinicians in the realm of helpfulness, along with important insights by clinicians assessing clinical accuracy and parents assessing readability. This study highlights the need for dual-perspective frameworks that balance clinical precision with patient understanding.
{"title":"Evaluating Large Language Model-Generated Clinical Summaries Through a Dual-Perspective Framework: Retrospective Observational Study.","authors":"Brian Han, Traci Barnes, Charitha D Reddy, Andrew Y Shin","doi":"10.2196/85221","DOIUrl":"10.2196/85221","url":null,"abstract":"<p><p>Large language models (LLMs) are increasingly used by patients and families to interpret complex medical documentation, yet most evaluations focus only on clinician-judged accuracy. In this study, 50 pediatric cardiac intensive care unit notes were summarized using GPT-4o mini and reviewed by both physicians and parents, who rated readability, clinical fidelity, and helpfulness. There were important discrepancies between parents and clinicians in the realm of helpfulness, along with important insights by clinicians assessing clinical accuracy and parents assessing readability. This study highlights the need for dual-perspective frameworks that balance clinical precision with patient understanding.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e85221"},"PeriodicalIF":2.0,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12933168/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146159603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}