首页 > 最新文献

NEJM AI最新文献

英文 中文
Assessing Generative AI Chatbots for Alcohol Misuse Support: A Longitudinal Simulation Study. 评估生成AI聊天机器人对酒精滥用的支持:一项纵向模拟研究。
Pub Date : 2026-02-01 Epub Date: 2026-01-22 DOI: 10.1056/aics2500676
Lori Uscher-Pines, Jessica L Sousa, Pushpa Raja, Lynsay Ayer, Ateev Mehrotra, Haiden A Huskamp, Alisa B Busch

Large language model (LLM)-based chatbots are increasingly used for behavioral health support. Few studies have rigorously evaluated their advice on alcohol misuse. We evaluated seven publicly available chatbots-including general-purpose and behavioral health-focused tools-in responding to alcohol misuse-related questions. Using a fictional case, we simulated longitudinal chatbot interactions over seven days, using 25 prompts derived from real-world Reddit posts. Using an evaluation framework specific to chatbots, four clinicians independently rated each chatbot's transcript along five domains: empathy, quality of information, usefulness, responsiveness, and scope awareness. Clinicians also assessed secondary dimensions, including stigmatizing language and challenging the user (vs. only validating feelings). We generated descriptive statistics on performance and identified examples of problematic output. Across all chatbots, empathy was the highest-rated domain (mean score 4.6/5) while quality of information was the lowest (mean 2.7/5). There was considerable variation in overall mean performance scores across the chatbots, ranging from 2.1 (SD 1.1) to 4.5 (SD 0.8). There were no significant differences in performance between behavioral health and general-purpose chatbots. All chatbots had one or more examples of guidance deemed inappropriate, over-stated, or inaccurate. All avoided stigmatizing or judgmental language and supported self-efficacy. Chatbots were perceived to vary widely in their ability to support individuals with alcohol misuse. While generally strong in empathy, there is room for improvement in response quality. As chatbot use expands, users and clinicians should be aware of the strengths and weaknesses of chatbots in providing advice on alcohol misuse.

基于大语言模型(LLM)的聊天机器人越来越多地用于行为健康支持。很少有研究严格评估他们对酒精滥用的建议。我们评估了七个公开可用的聊天机器人——包括通用的和以行为健康为重点的工具——来回答与酒精滥用有关的问题。通过一个虚构的案例,我们模拟了7天的纵向聊天机器人互动,使用了25个来自真实世界Reddit帖子的提示。使用一个专门针对聊天机器人的评估框架,四位临床医生根据五个领域对每个聊天机器人的记录进行了独立评估:同理心、信息质量、有用性、响应能力和范围意识。临床医生还评估了次要维度,包括污名化语言和挑战用户(与仅验证感觉相比)。我们生成了关于性能的描述性统计数据,并确定了有问题的输出示例。在所有聊天机器人中,共情是评分最高的领域(平均得分4.6/5),而信息质量最低(平均得分2.7/5)。聊天机器人的总体平均表现得分差异很大,从2.1 (SD 1.1)到4.5 (SD 0.8)不等。行为健康聊天机器人和通用聊天机器人在表现上没有显著差异。所有聊天机器人都有一个或多个被认为不恰当、夸大或不准确的指导例子。所有人都避免了污名化或评判性的语言,并支持自我效能。人们认为聊天机器人在帮助酗酒者方面的能力差异很大。虽然移情能力普遍很强,但反应质量仍有改进的空间。随着聊天机器人使用的扩大,用户和临床医生应该意识到聊天机器人在提供酒精滥用建议方面的优势和劣势。
{"title":"Assessing Generative AI Chatbots for Alcohol Misuse Support: A Longitudinal Simulation Study.","authors":"Lori Uscher-Pines, Jessica L Sousa, Pushpa Raja, Lynsay Ayer, Ateev Mehrotra, Haiden A Huskamp, Alisa B Busch","doi":"10.1056/aics2500676","DOIUrl":"10.1056/aics2500676","url":null,"abstract":"<p><p>Large language model (LLM)-based chatbots are increasingly used for behavioral health support. Few studies have rigorously evaluated their advice on alcohol misuse. We evaluated seven publicly available chatbots-including general-purpose and behavioral health-focused tools-in responding to alcohol misuse-related questions. Using a fictional case, we simulated longitudinal chatbot interactions over seven days, using 25 prompts derived from real-world Reddit posts. Using an evaluation framework specific to chatbots, four clinicians independently rated each chatbot's transcript along five domains: empathy, quality of information, usefulness, responsiveness, and scope awareness. Clinicians also assessed secondary dimensions, including stigmatizing language and challenging the user (vs. only validating feelings). We generated descriptive statistics on performance and identified examples of problematic output. Across all chatbots, empathy was the highest-rated domain (mean score 4.6/5) while quality of information was the lowest (mean 2.7/5). There was considerable variation in overall mean performance scores across the chatbots, ranging from 2.1 (SD 1.1) to 4.5 (SD 0.8). There were no significant differences in performance between behavioral health and general-purpose chatbots. All chatbots had one or more examples of guidance deemed inappropriate, over-stated, or inaccurate. All avoided stigmatizing or judgmental language and supported self-efficacy. Chatbots were perceived to vary widely in their ability to support individuals with alcohol misuse. While generally strong in empathy, there is room for improvement in response quality. As chatbot use expands, users and clinicians should be aware of the strengths and weaknesses of chatbots in providing advice on alcohol misuse.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"3 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12829918/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ambient AI Scribes in Clinical Practice: A Randomized Trial. 临床实践中的环境人工智能抄写员:一项随机试验。
Pub Date : 2025-12-01 Epub Date: 2025-11-26 DOI: 10.1056/aioa2501000
Paul J Lukac, William Turner, Sitaram Vangala, Aaron T Chin, Joshua Khalili, Ya-Chen Tina Shih, Catherine Sarkisian, Eric M Cheng, John N Mafi

Background: Ambient artificial intelligence (AI) scribes record patient encounters and rapidly generate visit notes, representing a promising solution to documentation burden and physician burnout. However, the scribes' impacts have not been examined in randomized clinical trials.

Methods: In this parallel three-group pragmatic randomized clinical trial, 238 outpatient physicians, representing 14 specialties, were assigned 1:1:1 via covariate-constrained randomization (balancing on time-in-note, baseline burnout score, and clinic days per week) to either one of two AI scribe applications - Microsoft Dragon Ambient eXperience (DAX) Copilot or Nabla - or a usual-care control group from November 4, 2024, to January 3, 2025. The primary outcome was the change from baseline log writing time-in-note. Secondary end points measured by surveys included the Mini-Z 2.0, a four-item physician task load (PTL), and Professional Fulfillment Index - Work Exhaustion (PFI-WE) scores to evaluate aspects of burnout; work environment; stress; and targeted questions addressing safety, accuracy, and usability.

Results: DAX was used in 33.5% of 24,696 visits; Nabla was used in 29.5% of 23,653 visits. Nabla users experienced a 9.5% (95% confidence interval [CI], -17.2% to -1.8%; P=0.02) decrease in time-in-note versus the control group, whereas DAX users exhibited no significant change versus the control group (-1.7%; 95% CI, -9.4% to +5.9%; P=0.66). Increases in total Mini-Z (scale 10-50; DAX 2.83 [95% CI, +1.28 to +4.37]; Nabla +2.69 [95% CI, +1.14 to +4.23]) and reductions in PTL (scale 0-400; DAX -39.9 [95% CI, -71.9 to -7.9]; Nabla -31.7 [95% CI, -63.8 to +0.4]), and PFI-WE (scale 0-4; DAX 0.32 [95% CI,-0.55 to -0.08]; Nabla -0.23 [95% CI, -0.46 to +0.01]) scores suggest improvement for users of either scribe versus the control. One grade 1 (mild) adverse event was reported, while clinically significant inaccuracies were noted "occasionally" on five-point Likert questions (DAX 2.7 [95% CI, 2.4 to 3.0]; Nabla 2.8 [95% CI, 2.6 to 3.0]).

Conclusions: Nabla reduced time-in-note versus the control. Both DAX and Nabla resulted in potential improvements in burnout, task load, and work exhaustion, but these secondary end point findings need confirmation in larger, multicenter trials. Clinicians reported that performance was similar across the two distinct platforms, and occasional inaccuracies observed in either scribe require ongoing vigilance. (Funded by the University of California, Los Angeles, Department of Medicine and others; ClinicalTrials.gov number, NCT06792890.).

背景:环境人工智能(AI)抄写员记录病人的就诊情况并快速生成就诊记录,代表了一个有希望的解决文件负担和医生倦怠的解决方案。然而,抄写员的影响尚未在随机临床试验中得到检验。方法:在这项平行的三组实用随机临床试验中,238名门诊医生,代表14个专业,通过共变量约束随机化(平衡时间记录、基线倦怠评分和每周就诊天数)以1:1:1的比例分配到两个人工智能记录应用程序之一-微软Dragon Ambient eXperience (DAX) Copilot或Nabla -或常规护理对照组,时间为2024年11月4日至2025年1月3日。主要结果是基线日志记录时间的变化。调查测量的次要终点包括Mini-Z 2.0,一个四项医生任务负荷(PTL)和职业实现指数-工作疲劳(pfi -我们)分数,用于评估职业倦怠的各个方面;工作环境;压力;以及针对安全性、准确性和可用性的针对性问题。结果:在24,696次就诊中,使用DAX的占33.5%;在23,653次就诊中,29.5%的患者使用了Nabla。与对照组相比,Nabla使用者的注意时间减少了9.5%(95%置信区间[CI], -17.2%至-1.8%;P=0.02),而DAX使用者与对照组相比没有显著变化(-1.7%;95% CI, -9.4%至+5.9%;P=0.66)。Mini-Z(量表10-50;DAX 2.83 [95% CI, +1.28至+4.37];Nabla +2.69 [95% CI, +1.14至+4.23])总分的增加和PTL(量表0-400;DAX -39.9 [95% CI, -71.9至-7.9];Nabla -31.7 [95% CI, -63.8至+0.4])和PFI-WE(量表0-4;DAX 0.32 [95% CI,-0.55至-0.08];Nabla -0.23 [95% CI, -0.46至+0.01])总分的降低表明,与对照组相比,使用任何一种scribe的用户都得到了改善。报告了1例1级(轻度)不良事件,而在5点李克特问题上“偶尔”发现临床显著不准确(DAX 2.7 [95% CI, 2.4至3.0];Nabla 2.8 [95% CI, 2.6至3.0])。结论:与对照组相比,Nabla缩短了记录时间。DAX和Nabla在职业倦怠、任务负荷和工作疲劳方面都有潜在的改善,但这些次要终点的发现需要在更大的多中心试验中得到证实。临床医生报告说,在两个不同的平台上,表现相似,在任何一个平台上观察到的偶尔不准确都需要持续警惕。(由加州大学洛杉矶分校医学部等资助;ClinicalTrials.gov编号:NCT06792890)。
{"title":"Ambient AI Scribes in Clinical Practice: A Randomized Trial.","authors":"Paul J Lukac, William Turner, Sitaram Vangala, Aaron T Chin, Joshua Khalili, Ya-Chen Tina Shih, Catherine Sarkisian, Eric M Cheng, John N Mafi","doi":"10.1056/aioa2501000","DOIUrl":"10.1056/aioa2501000","url":null,"abstract":"<p><strong>Background: </strong>Ambient artificial intelligence (AI) scribes record patient encounters and rapidly generate visit notes, representing a promising solution to documentation burden and physician burnout. However, the scribes' impacts have not been examined in randomized clinical trials.</p><p><strong>Methods: </strong>In this parallel three-group pragmatic randomized clinical trial, 238 outpatient physicians, representing 14 specialties, were assigned 1:1:1 via covariate-constrained randomization (balancing on time-in-note, baseline burnout score, and clinic days per week) to either one of two AI scribe applications - Microsoft Dragon Ambient eXperience (DAX) Copilot or Nabla - or a usual-care control group from November 4, 2024, to January 3, 2025. The primary outcome was the change from baseline log writing time-in-note. Secondary end points measured by surveys included the Mini-Z 2.0, a four-item physician task load (PTL), and Professional Fulfillment Index - Work Exhaustion (PFI-WE) scores to evaluate aspects of burnout; work environment; stress; and targeted questions addressing safety, accuracy, and usability.</p><p><strong>Results: </strong>DAX was used in 33.5% of 24,696 visits; Nabla was used in 29.5% of 23,653 visits. Nabla users experienced a 9.5% (95% confidence interval [CI], -17.2% to -1.8%; P=0.02) decrease in time-in-note versus the control group, whereas DAX users exhibited no significant change versus the control group (-1.7%; 95% CI, -9.4% to +5.9%; P=0.66). Increases in total Mini-Z (scale 10-50; DAX 2.83 [95% CI, +1.28 to +4.37]; Nabla +2.69 [95% CI, +1.14 to +4.23]) and reductions in PTL (scale 0-400; DAX -39.9 [95% CI, -71.9 to -7.9]; Nabla -31.7 [95% CI, -63.8 to +0.4]), and PFI-WE (scale 0-4; DAX 0.32 [95% CI,-0.55 to -0.08]; Nabla -0.23 [95% CI, -0.46 to +0.01]) scores suggest improvement for users of either scribe versus the control. One grade 1 (mild) adverse event was reported, while clinically significant inaccuracies were noted \"occasionally\" on five-point Likert questions (DAX 2.7 [95% CI, 2.4 to 3.0]; Nabla 2.8 [95% CI, 2.6 to 3.0]).</p><p><strong>Conclusions: </strong>Nabla reduced time-in-note versus the control. Both DAX and Nabla resulted in potential improvements in burnout, task load, and work exhaustion, but these secondary end point findings need confirmation in larger, multicenter trials. Clinicians reported that performance was similar across the two distinct platforms, and occasional inaccuracies observed in either scribe require ongoing vigilance. (Funded by the University of California, Los Angeles, Department of Medicine and others; ClinicalTrials.gov number, NCT06792890.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12768499/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145914441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Pragmatic Randomized Controlled Trial of Ambient Artificial Intelligence to Improve Health Practitioner Well-Being. 环境人工智能改善健康从业者幸福感的实用随机对照试验。
Pub Date : 2025-12-01 Epub Date: 2025-11-26 DOI: 10.1056/aioa2500945
Majid Afshar, Mary Ryan Baumann, Felice Resnik, Josie Hintzke, Anne Gravel Sullivan, Graham Wills, Kayla Lemmon, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Thomas B Brazelton, Margaret A Leaf, Heidi Twedt, David Kunstman, Brian Patterson, Frank Liao, Stacy Rasmussen, Elizabeth S Burnside, Cherodeep Goswami, Joel Gordon

Background: Electronic health record (EHR) documentation is a major contributor to work-related practitioner exhaustion and the interpersonal disengagement known as burnout. Generative artificial intelligence (AI) scribes that passively capture clinical conversations and draft visit notes may alleviate this burden, but evidence remains limited.

Methods: A 24-week, stepped-wedge, individually randomized pragmatic trial was conducted across ambulatory clinics in two states. Sixty-six health care practitioners were randomly assigned to three 6-week sequences of ambient AI. The coprimary outcomes were professional fulfillment and work exhaustion/interpersonal disengagement from the Stanford Professional Fulfillment Index. Secondary measures included time spent on notes, work outside work (WoW), documentation quality with the Provider Documentation Summarization Quality Instrument 9 (PDSQI-9), and billing diagnostic codes reviewed by professional staff coders. Linear mixed models were used for intention-to-treat (ITT) analyses.

Results: A total of 71,487 notes were authored, of which 27,092 (38%) were generated using ambient AI. Ambient AI use had a significant reduction in work exhaustion/interpersonal disengagement (-0.44 points; 95% confidence interval [CI], -0.62 to -0.25; P<0.001), and a nonsignificant increase in professional fulfillment (+0.14 points; 95% CI, 0.004 to 0.28; P=0.04) on a five-point Likert scale. Among secondary measures, time spent on notes decreased (-0.36 hours per day; 95% CI, -0.55 to -0.17). The reduction in WoW (-0.50 hours per day; 95% CI, -0.90 to -0.09) was sensitive to exclusion of extreme values and was no longer significant after removing the top 3% of daily observations. Diagnostic billing codes improved with ambient AI use (P<0.001). Documentation quality, assessed with the PDSQI-9, demonstrated mean scores ranging from 3.97 to 4.99 across domains on a five-point scale. No drift in software performance was detected.

Conclusions: In a real-world randomized implementation, ambient AI reduced health care practitioners' work exhaustion/interpersonal disengagement but did not significantly increase professional fulfillment. Documentation time decreased without compromising diagnosis, billing compliance, or note quality. (Funded by the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award; ClinicalTrials.gov number, NCT06517082.).

背景:电子健康记录(EHR)文件是工作相关从业者疲劳和人际脱离(即倦怠)的主要贡献者。生成式人工智能(AI)抄写员被动地捕捉临床对话和起草病历,可能会减轻这一负担,但证据仍然有限。方法:在两个州的门诊诊所进行了一项为期24周的、阶梯形的、随机的实用试验。66名卫生保健从业人员被随机分配到三个为期6周的环境人工智能序列。主要结果为斯坦福职业成就感指数中的职业成就感和工作疲劳/人际脱离。次要度量包括花在笔记上的时间、外部工作(WoW)、使用提供者文档摘要质量工具9 (PDSQI-9)的文档质量,以及由专业人员编码人员审查的账单诊断代码。意向治疗(ITT)分析采用线性混合模型。结果:共撰写了71,487篇笔记,其中27,092篇(38%)是使用环境人工智能生成的。环境人工智能的使用显著降低了工作疲劳/人际脱离(-0.44点;95%可信区间[CI], -0.62至-0.25;结论:在现实世界的随机实施中,环境人工智能降低了医疗从业人员的工作疲劳/人际脱离,但没有显著增加职业成就感。在不影响诊断、账单合规性或笔记质量的情况下,减少了记录时间。(由威斯康星大学医院和诊所以及美国国立卫生研究院临床和转化科学奖资助;ClinicalTrials.gov编号,NCT06517082。)
{"title":"A Pragmatic Randomized Controlled Trial of Ambient Artificial Intelligence to Improve Health Practitioner Well-Being.","authors":"Majid Afshar, Mary Ryan Baumann, Felice Resnik, Josie Hintzke, Anne Gravel Sullivan, Graham Wills, Kayla Lemmon, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Thomas B Brazelton, Margaret A Leaf, Heidi Twedt, David Kunstman, Brian Patterson, Frank Liao, Stacy Rasmussen, Elizabeth S Burnside, Cherodeep Goswami, Joel Gordon","doi":"10.1056/aioa2500945","DOIUrl":"10.1056/aioa2500945","url":null,"abstract":"<p><strong>Background: </strong>Electronic health record (EHR) documentation is a major contributor to work-related practitioner exhaustion and the interpersonal disengagement known as burnout. Generative artificial intelligence (AI) scribes that passively capture clinical conversations and draft visit notes may alleviate this burden, but evidence remains limited.</p><p><strong>Methods: </strong>A 24-week, stepped-wedge, individually randomized pragmatic trial was conducted across ambulatory clinics in two states. Sixty-six health care practitioners were randomly assigned to three 6-week sequences of ambient AI. The coprimary outcomes were professional fulfillment and work exhaustion/interpersonal disengagement from the Stanford Professional Fulfillment Index. Secondary measures included time spent on notes, work outside work (WoW), documentation quality with the Provider Documentation Summarization Quality Instrument 9 (PDSQI-9), and billing diagnostic codes reviewed by professional staff coders. Linear mixed models were used for intention-to-treat (ITT) analyses.</p><p><strong>Results: </strong>A total of 71,487 notes were authored, of which 27,092 (38%) were generated using ambient AI. Ambient AI use had a significant reduction in work exhaustion/interpersonal disengagement (-0.44 points; 95% confidence interval [CI], -0.62 to -0.25; P<0.001), and a nonsignificant increase in professional fulfillment (+0.14 points; 95% CI, 0.004 to 0.28; P=0.04) on a five-point Likert scale. Among secondary measures, time spent on notes decreased (-0.36 hours per day; 95% CI, -0.55 to -0.17). The reduction in WoW (-0.50 hours per day; 95% CI, -0.90 to -0.09) was sensitive to exclusion of extreme values and was no longer significant after removing the top 3% of daily observations. Diagnostic billing codes improved with ambient AI use (P<0.001). Documentation quality, assessed with the PDSQI-9, demonstrated mean scores ranging from 3.97 to 4.99 across domains on a five-point scale. No drift in software performance was detected.</p><p><strong>Conclusions: </strong>In a real-world randomized implementation, ambient AI reduced health care practitioners' work exhaustion/interpersonal disengagement but did not significantly increase professional fulfillment. Documentation time decreased without compromising diagnosis, billing compliance, or note quality. (Funded by the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award; ClinicalTrials.gov number, NCT06517082.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12858090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146109427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A foundation transformer model with self-supervised learning for ECG-based assessment of cardiac and coronary function. 基于心电图评估心脏和冠状动脉功能的自监督学习基础变压器模型。
Pub Date : 2025-12-01 Epub Date: 2025-11-26 DOI: 10.1056/aioa2500164
Jonathan B Moody, Alexis Poitrasson-Rivière, Jennifer M Renaud, Tomoe Hagio, Fares Alahdab, Mouaz H Al-Mallah, Michael D Vanderver, Sascha N Goonewardena, Edward P Ficaro, Venkatesh L Murthy

Background: The wide availability of labeled electrocardiogram (ECG) data has driven major advances in artificial intelligence (AI)-based detection of structural and functional cardiac abnormalities and thus ECG-based diagnosis. However, many critical, high value clinical diagnostic applications, such as assessing myocardial ischemia and coronary microvascular dysfunction, remain underserved due to the limited availability of labeled datasets. We developed a self-supervised ECG foundation model and demonstrate how this approach can overcome this limitation.

Methods: A modified vision transformer model was pretrained using a large database of unlabeled ECG waveforms (MIMIC-IV-ECG, N=800,035). The model was then fine-tuned using smaller databases that included high-quality labels derived from positron emission tomography (N=3,126) and clinical reports (N=13,704) for 12 clinical, demographic, and traditional ECG prediction tasks. Diagnostic accuracy and model generalizability were evaluated across five additional cohorts including the publicly available PTB-XL and UK Biobank databases and labels from cardiac magnetic resonance imaging (MRI) and single photon emission computed tomography (SPECT).

Results: Diagnostic performance varied across tasks with area under the receiver operating characteristic curve (AUROC) ranging from 0.763 for detection of impaired myocardial flow reserve (MFR < 2) to 0.955 for impaired left ventricular ejection fraction (LVEF < 35%). Self-supervised learning (SSL) pretraining greatly improved diagnostic accuracy in 11 of the 12 prediction tasks compared to conventional de novo supervised training. The model retained strong performance across three external and two internal cross-modality databases, with AUROC ranging from 0.771 for impaired MFR to 0.949 for impaired LVEF.

Conclusion: This versatile ECG foundation model demonstrates that SSL pretraining enhances diagnostic accuracy and generalizability across diverse cardiac diagnostic applications. By enabling effective learning from limited labeled data, this approach supports AI development for complex but clinically critical tasks, such as detecting myocardial ischemia and coronary microvascular dysfunction, where high-quality labels are costly and scarce.

背景:标记心电图(ECG)数据的广泛可用性推动了基于人工智能(AI)的心脏结构和功能异常检测的重大进展,从而推动了基于心电图的诊断。然而,许多关键的、高价值的临床诊断应用,如评估心肌缺血和冠状动脉微血管功能障碍,由于标记数据集的可用性有限,仍然缺乏服务。我们开发了一个自我监督的心电图基础模型,并演示了这种方法如何克服这一限制。方法:使用大型未标记ECG波形数据库(MIMIC-IV-ECG, N=800,035)对改进的视觉变压器模型进行预训练。然后使用较小的数据库对模型进行微调,这些数据库包括来自正电子发射断层扫描(N=3,126)和临床报告(N=13,704)的高质量标签,用于12个临床、人口统计学和传统ECG预测任务。在另外五个队列中评估了诊断准确性和模型的可泛化性,包括公开可用的PTB-XL和UK Biobank数据库以及心脏磁共振成像(MRI)和单光子发射计算机断层扫描(SPECT)的标签。结果:受试者工作特征曲线下面积(AUROC)在检测心肌血流储备受损(MFR < 2)时为0.763,检测左室射血分数受损(LVEF < 35%)时为0.955。与传统的从头开始监督训练相比,自我监督学习(SSL)预训练在12个预测任务中的11个中大大提高了诊断准确性。该模型在三个外部和两个内部跨模态数据库中保持了良好的性能,AUROC范围从MFR受损的0.771到LVEF受损的0.949。结论:这个多功能ECG基础模型表明,SSL预训练提高了诊断准确性和在各种心脏诊断应用中的通用性。通过从有限的标记数据中进行有效的学习,这种方法支持人工智能开发复杂但临床关键的任务,如检测心肌缺血和冠状动脉微血管功能障碍,在这些领域,高质量的标签是昂贵和稀缺的。
{"title":"A foundation transformer model with self-supervised learning for ECG-based assessment of cardiac and coronary function.","authors":"Jonathan B Moody, Alexis Poitrasson-Rivière, Jennifer M Renaud, Tomoe Hagio, Fares Alahdab, Mouaz H Al-Mallah, Michael D Vanderver, Sascha N Goonewardena, Edward P Ficaro, Venkatesh L Murthy","doi":"10.1056/aioa2500164","DOIUrl":"10.1056/aioa2500164","url":null,"abstract":"<p><strong>Background: </strong>The wide availability of labeled electrocardiogram (ECG) data has driven major advances in artificial intelligence (AI)-based detection of structural and functional cardiac abnormalities and thus ECG-based diagnosis. However, many critical, high value clinical diagnostic applications, such as assessing myocardial ischemia and coronary microvascular dysfunction, remain underserved due to the limited availability of labeled datasets. We developed a self-supervised ECG foundation model and demonstrate how this approach can overcome this limitation.</p><p><strong>Methods: </strong>A modified vision transformer model was pretrained using a large database of unlabeled ECG waveforms (MIMIC-IV-ECG, N=800,035). The model was then fine-tuned using smaller databases that included high-quality labels derived from positron emission tomography (N=3,126) and clinical reports (N=13,704) for 12 clinical, demographic, and traditional ECG prediction tasks. Diagnostic accuracy and model generalizability were evaluated across five additional cohorts including the publicly available PTB-XL and UK Biobank databases and labels from cardiac magnetic resonance imaging (MRI) and single photon emission computed tomography (SPECT).</p><p><strong>Results: </strong>Diagnostic performance varied across tasks with area under the receiver operating characteristic curve (AUROC) ranging from 0.763 for detection of impaired myocardial flow reserve (MFR < 2) to 0.955 for impaired left ventricular ejection fraction (LVEF < 35%). Self-supervised learning (SSL) pretraining greatly improved diagnostic accuracy in 11 of the 12 prediction tasks compared to conventional <i>de novo</i> supervised training. The model retained strong performance across three external and two internal cross-modality databases, with AUROC ranging from 0.771 for impaired MFR to 0.949 for impaired LVEF.</p><p><strong>Conclusion: </strong>This versatile ECG foundation model demonstrates that SSL pretraining enhances diagnostic accuracy and generalizability across diverse cardiac diagnostic applications. By enabling effective learning from limited labeled data, this approach supports AI development for complex but clinically critical tasks, such as detecting myocardial ischemia and coronary microvascular dysfunction, where high-quality labels are costly and scarce.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12724683/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145829815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Translational AI: A Two-Way Moving Target Problem. 评估平移AI:一个双向移动目标问题。
Pub Date : 2025-12-01 Epub Date: 2025-11-14 DOI: 10.1056/aip2500705
Richard K Leuchter, William B Turner, David Ouyang

Predictive artificial intelligence models are being deployed across health systems with dangerously inconsistent oversight, creating two critical gaps: a compliance gap, where clinical tools that likely qualify as software as a medical device are implemented without seeking U.S. Food and Drug Administration authorization; and a regulatory gap, where administrative and operational models are deployed without any external review despite their potential to influence care and widen disparities. Given that comprehensive U.S. Food and Drug Administration oversight of all such models is infeasible, the de facto onus of ensuring their safety and efficacy falls on the implementing institutions. However, this imperative for self-governance is undermined by a fundamental and previously unarticulated two-way moving target problem: (1) prior to implementation, concurrent-intervention confounding moves the target as practice and operational changes shift the outcome during the time it takes to develop models; and (2) after implementation, action-induced outcome bias moves the target again when prediction-triggered interventions alter or censor the outcome. Together, these pitfalls render traditional evaluation methods inadequate. The authors argue that health systems must adopt a new default standard for implementing any model that predicts patient outcomes or utilization: short-term randomized deployment with a control group. This approach provides a crucial counterfactual for rigorous, independent assessment of model performance and intervention effectiveness. It offers a practical path forward for institutions to ensure that their artificial intelligence tools are safe, effective, and equitable, thereby building a foundation of trust that is worthy of the patients they serve. (Funded by the National Institutes of Health National Heart, Lung, and Blood Institute.).

预测性人工智能模型正被部署在监管不一致的卫生系统中,造成了两个关键缺口:合规性缺口,即可能符合医疗设备软件资格的临床工具在未获得美国食品和药物管理局授权的情况下实施;还有监管缺口,尽管行政和业务模式有可能影响医疗服务并扩大差距,但它们的部署没有经过任何外部审查。鉴于美国食品和药物管理局对所有这些模型的全面监督是不可实现的,因此确保其安全性和有效性的事实上的责任落在了实施机构身上。然而,这种自我管理的必要性被一个基本的、以前未明确的双向移动目标问题所破坏:(1)在实施之前,由于实践和操作变化在开发模型所需的时间内改变了结果,并发干预混淆了目标;(2)在实施后,当预测触发的干预改变或审查结果时,行动诱导的结果偏差再次移动目标。总之,这些缺陷使得传统的评估方法不够充分。这组作者认为,卫生系统必须采用一种新的默认标准来实施任何预测患者结果或利用情况的模型:与对照组进行短期随机部署。这种方法为模型性能和干预效果的严格、独立评估提供了关键的反事实。它为机构提供了一条切实可行的前进道路,以确保他们的人工智能工具是安全、有效和公平的,从而建立一个值得他们所服务的患者信任的基础。(由美国国立卫生研究院国立心脏、肺和血液研究所资助。)
{"title":"Evaluating Translational AI: A Two-Way Moving Target Problem.","authors":"Richard K Leuchter, William B Turner, David Ouyang","doi":"10.1056/aip2500705","DOIUrl":"10.1056/aip2500705","url":null,"abstract":"<p><p>Predictive artificial intelligence models are being deployed across health systems with dangerously inconsistent oversight, creating two critical gaps: a compliance gap, where clinical tools that likely qualify as software as a medical device are implemented without seeking U.S. Food and Drug Administration authorization; and a regulatory gap, where administrative and operational models are deployed without any external review despite their potential to influence care and widen disparities. Given that comprehensive U.S. Food and Drug Administration oversight of all such models is infeasible, the de facto onus of ensuring their safety and efficacy falls on the implementing institutions. However, this imperative for self-governance is undermined by a fundamental and previously unarticulated two-way moving target problem: (1) prior to implementation, concurrent-intervention confounding moves the target as practice and operational changes shift the outcome during the time it takes to develop models; and (2) after implementation, action-induced outcome bias moves the target again when prediction-triggered interventions alter or censor the outcome. Together, these pitfalls render traditional evaluation methods inadequate. The authors argue that health systems must adopt a new default standard for implementing any model that predicts patient outcomes or utilization: short-term randomized deployment with a control group. This approach provides a crucial counterfactual for rigorous, independent assessment of model performance and intervention effectiveness. It offers a practical path forward for institutions to ensure that their artificial intelligence tools are safe, effective, and equitable, thereby building a foundation of trust that is worthy of the patients they serve. (Funded by the National Institutes of Health National Heart, Lung, and Blood Institute.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12851562/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mitigating Limited Data Challenges to Improve Artificial Intelligence Integration in Rare Disease Drug Development. 缓解有限的数据挑战,提高罕见病药物开发中的人工智能集成。
Pub Date : 2025-11-24 DOI: 10.1056/AIp2500802
Atasi Poddar, Gabriel K Innes, Qi Liu, Anindita Saha, Morgan Hanger, Kelly Franzetti, M Khair ElZarrad, Tala H Fakhouri

The Orphan Drug Act defines a rare disease as a condition affecting fewer than 200,000 people in the United States. However, most rare diseases are categorized as ultrarare or hyper-rare, impacting fewer than 100 individuals worldwide. Developing drugs for these conditions involves multiple challenges, such as geographically dispersed and small patient populations, limited natural history data, and poor disease characterization. Issues related to small patient numbers, scarce natural history information, and clinical heterogeneity within rare diseases can be addressed by various strategies, including using artificial intelligence and advanced analytical methods, leveraging detailed individual-level data, and exploring synthetic data generation to overcome the limitations of small datasets. Moreover, establishing centralized databases and promoting public-private partnerships can help build a more comprehensive repository of available data.

《孤儿药法案》将罕见病定义为影响美国不到20万人的疾病。然而,大多数罕见病被归类为超罕见病或超罕见病,在全世界影响不到100人。开发针对这些疾病的药物涉及多重挑战,例如地理分散和患者群体小,自然病史数据有限以及疾病特征不佳。与患者数量少、自然病史信息稀缺和罕见疾病的临床异质性相关的问题可以通过各种策略来解决,包括使用人工智能和先进的分析方法,利用详细的个人层面数据,以及探索合成数据生成来克服小数据集的局限性。此外,建立中央数据库和促进公私伙伴关系有助于建立一个更全面的可用数据储存库。
{"title":"Mitigating Limited Data Challenges to Improve Artificial Intelligence Integration in Rare Disease Drug Development.","authors":"Atasi Poddar, Gabriel K Innes, Qi Liu, Anindita Saha, Morgan Hanger, Kelly Franzetti, M Khair ElZarrad, Tala H Fakhouri","doi":"10.1056/AIp2500802","DOIUrl":"10.1056/AIp2500802","url":null,"abstract":"<p><p>The Orphan Drug Act defines a rare disease as a condition affecting fewer than 200,000 people in the United States. However, most rare diseases are categorized as ultrarare or hyper-rare, impacting fewer than 100 individuals worldwide. Developing drugs for these conditions involves multiple challenges, such as geographically dispersed and small patient populations, limited natural history data, and poor disease characterization. Issues related to small patient numbers, scarce natural history information, and clinical heterogeneity within rare diseases can be addressed by various strategies, including using artificial intelligence and advanced analytical methods, leveraging detailed individual-level data, and exploring synthetic data generation to overcome the limitations of small datasets. Moreover, establishing centralized databases and promoting public-private partnerships can help build a more comprehensive repository of available data.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12690552/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145746540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging Artificial Intelligence in Drug and Biological Product Development: An FDA and Clinical Trial Transformation Initiative Workshop Report. 在药物和生物产品开发中利用人工智能:FDA和临床试验转化倡议研讨会报告。
Pub Date : 2025-11-24 DOI: 10.1056/aipc2500801
Atasi Poddar, Marsha Samson, Gabriel K Innes, Qi Liu, Anindita Saha, Morgan Hanger, Kelly Franzetti, M Khair ElZarrad, Tala H Fakhouri

Artificial intelligence (AI) holds immense potential to transform drug development by improving the efficiency and accuracy of key processes across the drug product life cycle. However, the scalable adoption of this technology may be influenced by new and unique challenges. The U.S. Food and Drug Administration collaborated with the Clinical Trial Transformation Initiative to organize a public workshop on Artificial Intelligence in Drug and Biological Product Development in August 2024 with medical product sponsors, technology innovators, academicians, and regulators to discuss guiding principles for the use of AI in drug and biological product development in order to realize its transformative potential. This article synthesizes key insights from the workshop and discusses the emerging current need for policy development to enhance the integration of AI in drug and biological product development.

人工智能(AI)通过提高整个药物生命周期关键过程的效率和准确性,在改变药物开发方面具有巨大的潜力。然而,这种技术的可扩展采用可能会受到新的和独特的挑战的影响。美国食品和药物管理局与临床试验转化倡议合作,于2024年8月与医疗产品赞助商、技术创新者、院士和监管机构组织了一次关于药物和生物产品开发中的人工智能的公共研讨会,讨论在药物和生物产品开发中使用人工智能的指导原则,以实现其变革潜力。本文综合了研讨会的主要见解,并讨论了当前政策制定的新需求,以加强人工智能在药物和生物制品开发中的整合。
{"title":"Leveraging Artificial Intelligence in Drug and Biological Product Development: An FDA and Clinical Trial Transformation Initiative Workshop Report.","authors":"Atasi Poddar, Marsha Samson, Gabriel K Innes, Qi Liu, Anindita Saha, Morgan Hanger, Kelly Franzetti, M Khair ElZarrad, Tala H Fakhouri","doi":"10.1056/aipc2500801","DOIUrl":"10.1056/aipc2500801","url":null,"abstract":"<p><p>Artificial intelligence (AI) holds immense potential to transform drug development by improving the efficiency and accuracy of key processes across the drug product life cycle. However, the scalable adoption of this technology may be influenced by new and unique challenges. The U.S. Food and Drug Administration collaborated with the Clinical Trial Transformation Initiative to organize a public workshop on Artificial Intelligence in Drug and Biological Product Development in August 2024 with medical product sponsors, technology innovators, academicians, and regulators to discuss guiding principles for the use of AI in drug and biological product development in order to realize its transformative potential. This article synthesizes key insights from the workshop and discusses the emerging current need for policy development to enhance the integration of AI in drug and biological product development.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12690500/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145746516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Novel Playbook for Pragmatic Trial Operations to Monitor and Evaluate Ambient Artificial Intelligence in Clinical Practice. 临床实践中监测和评估环境人工智能的实用试验操作的新剧本。
Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1056/aidbp2401267
Majid Afshar, Felice Resnik, Mary Ryan Baumann, Josie Hintzke, Kayla Lemmon, Anne Gravel Sullivan, Tina Shah, Anthony Stordalen, Michael Oberst, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Tom Brazelton, Heidi Twedt, David Kunstman, Graham Wills, John Long, Brian W Patterson, Frank J Liao, Stacy Rasmussen, Elizabeth Burnside, Cherodeep Goswami, Joel E Gordon

Background: Ambient artificial intelligence (AI) offers the potential to reduce documentation burden and improve efficiency through clinical note generation. Widespread adoption, however, remains limited due to challenges in electronic health record (EHR) integration, coding compliance, and real-world evaluation. This study introduces a framework and protocols to design, monitor, and deploy ambient AI within routine care.

Methods: We launched an implementation phase to build technical workflows, establish governance, and inform a pragmatic randomized trial. A bidirectional governance model linked operations and research through multidisciplinary workgroups that incorporated the Systems Engineering Initiative for Patient Safety (SEIPS) framework. Integration into the EHR used Fast Healthcare Interoperability Resources (FHIR), and a real-time dashboard tracked utilization and documentation accuracy. To monitor drift, a difference-in-differences analysis was applied to three process metrics: time in notes, work outside work, and utilization. Audits of International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) compliance were performed using an internally developed large language model (LLM), the validity of which was assessed through correlation with certified professional coders.

Results: Ambient AI utilization, measured as the proportion of eligible clinical notes completed using the system, had a weighted median of 65.4% (interquartile range, 50.6 to 84.0%). Iterative improvement cycles targeted task-specific adoption. A brief workflow issue related to a note template change initially reduced ICD-10 documentation accuracy from 79% (95% confidence interval [CI], 72 to 86%) to 35% (95% CI, 28 to 42%); accuracy returned to baseline after note template redesign and user training. The internally developed LLM coder achieved a strong correlation with professional coders (Pearson's r=0.97). The trial enrolled 66 providers across eight specialties, powered at 90% for the primary outcome of provider well-being.

Conclusions: We provide a publicly available framework and protocols to help safely implement ambient AI in health care. Innovations include an embedded pragmatic trial design, human factors engineering, compliance-driven feedback loops, and real-time monitoring to support deployment, ensuring fidelity before initiation of the clinical trial. (Funded by the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award; NIH/ NCATS UL1TR002737; ClinicalTrials.gov number, NCT06517082.).

背景:环境人工智能(AI)提供了通过临床记录生成减少文档负担和提高效率的潜力。然而,由于电子健康记录(EHR)集成、编码遵从性和实际评估方面的挑战,广泛采用仍然有限。本研究介绍了在日常护理中设计、监测和部署环境人工智能的框架和协议。方法:我们启动了一个实现阶段来构建技术工作流,建立治理,并通知一个实用的随机试验。一个双向治理模型通过多学科工作组将操作和研究联系起来,这些工作组结合了患者安全系统工程倡议(SEIPS)框架。集成到EHR中使用了快速医疗保健互操作性资源(FHIR),并使用实时仪表板跟踪利用率和文档准确性。为了监视漂移,对三个过程度量应用了差异中的差异分析:记录时间、工作之外的工作和利用率。使用内部开发的大型语言模型(LLM)对《国际疾病和相关健康问题统计分类第十版》(ICD-10)遵守情况进行了审计,通过与认证专业编码员的相关性评估了其有效性。结果:环境人工智能的利用率,以使用该系统完成的合格临床记录的比例来衡量,加权中位数为65.4%(四分位数范围为50.6%至84.0%)。迭代改进周期针对特定于任务的采用。与笔记模板更改相关的简短工作流程问题最初将ICD-10文档准确性从79%(95%置信区间[CI], 72 - 86%)降低到35% (95% CI, 28 - 42%);在笔记模板重新设计和用户培训后,准确度恢复到基线。内部开发的LLM编码员与专业编码员具有很强的相关性(Pearson’s r=0.97)。该试验招募了8个专业的66名提供者,提供者幸福感的主要结果为90%。结论:我们提供了一个公开可用的框架和协议,以帮助在卫生保健中安全地实施环境人工智能。创新包括嵌入式实用试验设计、人为因素工程、合规性驱动的反馈循环,以及支持部署的实时监控,确保临床试验开始前的保真度。(由威斯康星大学医院和诊所以及美国国立卫生研究院临床和转化科学奖资助;NIH/ NCATS UL1TR002737; ClinicalTrials.gov编号:NCT06517082)。
{"title":"A Novel Playbook for Pragmatic Trial Operations to Monitor and Evaluate Ambient Artificial Intelligence in Clinical Practice.","authors":"Majid Afshar, Felice Resnik, Mary Ryan Baumann, Josie Hintzke, Kayla Lemmon, Anne Gravel Sullivan, Tina Shah, Anthony Stordalen, Michael Oberst, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Tom Brazelton, Heidi Twedt, David Kunstman, Graham Wills, John Long, Brian W Patterson, Frank J Liao, Stacy Rasmussen, Elizabeth Burnside, Cherodeep Goswami, Joel E Gordon","doi":"10.1056/aidbp2401267","DOIUrl":"10.1056/aidbp2401267","url":null,"abstract":"<p><strong>Background: </strong>Ambient artificial intelligence (AI) offers the potential to reduce documentation burden and improve efficiency through clinical note generation. Widespread adoption, however, remains limited due to challenges in electronic health record (EHR) integration, coding compliance, and real-world evaluation. This study introduces a framework and protocols to design, monitor, and deploy ambient AI within routine care.</p><p><strong>Methods: </strong>We launched an implementation phase to build technical workflows, establish governance, and inform a pragmatic randomized trial. A bidirectional governance model linked operations and research through multidisciplinary workgroups that incorporated the Systems Engineering Initiative for Patient Safety (SEIPS) framework. Integration into the EHR used Fast Healthcare Interoperability Resources (FHIR), and a real-time dashboard tracked utilization and documentation accuracy. To monitor drift, a difference-in-differences analysis was applied to three process metrics: time in notes, work outside work, and utilization. Audits of <i>International Statistical Classification of Diseases and Related Health Problems</i>, Tenth Revision (ICD-10) compliance were performed using an internally developed large language model (LLM), the validity of which was assessed through correlation with certified professional coders.</p><p><strong>Results: </strong>Ambient AI utilization, measured as the proportion of eligible clinical notes completed using the system, had a weighted median of 65.4% (interquartile range, 50.6 to 84.0%). Iterative improvement cycles targeted task-specific adoption. A brief workflow issue related to a note template change initially reduced ICD-10 documentation accuracy from 79% (95% confidence interval [CI], 72 to 86%) to 35% (95% CI, 28 to 42%); accuracy returned to baseline after note template redesign and user training. The internally developed LLM coder achieved a strong correlation with professional coders (Pearson's r=0.97). The trial enrolled 66 providers across eight specialties, powered at 90% for the primary outcome of provider well-being.</p><p><strong>Conclusions: </strong>We provide a publicly available framework and protocols to help safely implement ambient AI in health care. Innovations include an embedded pragmatic trial design, human factors engineering, compliance-driven feedback loops, and real-time monitoring to support deployment, ensuring fidelity before initiation of the clinical trial. (Funded by the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award; NIH/ NCATS UL1TR002737; ClinicalTrials.gov number, NCT06517082.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12435388/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145077149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Video-Based Biomechanical Analysis Captures Disease-Specific Movement Signatures of Different Neuromuscular Diseases. 基于视频的生物力学分析捕获不同神经肌肉疾病的疾病特异性运动特征。
Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1056/aioa2401137
Parker S Ruth, Scott D Uhlrich, Constance de Monts, Antoine Falisse, Julie Muccini, Sydney Covitz, Shelby Vogt-Domke, John Day, Tina Duong, Scott L Delp

Background: Assessing human movement is essential for diagnosing and monitoring movement-related conditions like neuromuscular disorders. Timed function tests (TFTs) are among the most widespread types of assessments due to their speed and simplicity, but they cannot capture disease-specific movement patterns. Conversely, biomechanical analysis can produce sensitive disease-specific biomarkers, but it is traditionally confined to laboratory settings. Recent advances in smartphone video-based biomechanical analysis enable the quantification of three-dimensional movement with the ease and speed required for clinical settings. However, the potential of this technology to offer more sensitive assessments of human function than TFTs remains untested.

Methods: To compare video-based analysis with TFTs, we collected an observational dataset from 129 individuals: 28 with facioscapulohumeral muscular dystrophy, 58 with myotonic dystrophy, and 43 controls with no diagnosed neuromuscular condition. We used OpenCap, a free open-source software tool, to capture smartphone video-based biomechanics of nine different movements in a median time of 16 minutes per participant. From these recordings, we extracted 34 interpretable movement features. Using these features, we evaluated the ability of video-based biomechanics to reproduce four TFTs (10-meter walk, 10-meter run, timed up-and-go, and 5-times sit-to-stand) while capturing additional disease-specific signatures of movement.

Results: Video-based biomechanical analysis reproduced all four TFTs (r>0.98) with similar test-retest reliability. In addition, video metrics outperformed TFTs at disease classification (P=0.021). Unlike TFTs, video-based biomechanical analysis identified disease-specific signatures of movement, such as differences in gait kinematics, that are not evident in TFTs.

Conclusions: Video-based biomechanical analysis can complement existing functional movement assessments by capturing more sensitive, disease-specific outcomes from human movement. This technology enables digital health solutions for assessing and monitoring motor function, complementing traditional clinical outcome measures to enhance care, management, and clinical trial design for movement-related conditions. (Funded by the Wu Tsai Human Performance Alliance and others.).

背景:评估人体运动对于诊断和监测运动相关疾病(如神经肌肉疾病)至关重要。定时功能测试(TFTs)因其快速和简单而成为最广泛的评估类型之一,但它们不能捕获特定疾病的运动模式。相反,生物力学分析可以产生敏感的疾病特异性生物标志物,但传统上仅限于实验室设置。基于智能手机视频的生物力学分析的最新进展使三维运动的量化具有临床设置所需的轻松和速度。然而,这项技术提供比tft更灵敏的人体功能评估的潜力仍有待检验。方法:为了比较基于视频的分析与TFTs,我们收集了来自129名个体的观察数据集:28名患有面肩肱肌营养不良症,58名患有肌强直性营养不良症,43名没有诊断出神经肌肉疾病的对照组。我们使用免费的开源软件工具OpenCap来捕捉基于智能手机视频的九种不同动作的生物力学,每个参与者的平均时间为16分钟。从这些记录中,我们提取了34个可解释的运动特征。利用这些特征,我们评估了基于视频的生物力学再现四种tft(10米步行、10米跑步、定时起身和5次坐立)的能力,同时捕获了额外的疾病特异性运动特征。结果:基于视频的生物力学分析再现了所有四种tft (r>0.98),具有相似的重测信度。此外,视频指标在疾病分类方面优于TFTs (P=0.021)。与tft不同,基于视频的生物力学分析确定了疾病特定的运动特征,例如步态运动学的差异,而这些特征在tft中并不明显。结论:基于视频的生物力学分析可以通过捕捉更敏感的、疾病特异性的人体运动结果来补充现有的功能运动评估。这项技术使评估和监测运动功能的数字健康解决方案成为可能,补充了传统的临床结果措施,以加强对运动相关疾病的护理、管理和临床试验设计。(吴仔人类绩效联盟等资助)。
{"title":"Video-Based Biomechanical Analysis Captures Disease-Specific Movement Signatures of Different Neuromuscular Diseases.","authors":"Parker S Ruth, Scott D Uhlrich, Constance de Monts, Antoine Falisse, Julie Muccini, Sydney Covitz, Shelby Vogt-Domke, John Day, Tina Duong, Scott L Delp","doi":"10.1056/aioa2401137","DOIUrl":"10.1056/aioa2401137","url":null,"abstract":"<p><strong>Background: </strong>Assessing human movement is essential for diagnosing and monitoring movement-related conditions like neuromuscular disorders. Timed function tests (TFTs) are among the most widespread types of assessments due to their speed and simplicity, but they cannot capture disease-specific movement patterns. Conversely, biomechanical analysis can produce sensitive disease-specific biomarkers, but it is traditionally confined to laboratory settings. Recent advances in smartphone video-based biomechanical analysis enable the quantification of three-dimensional movement with the ease and speed required for clinical settings. However, the potential of this technology to offer more sensitive assessments of human function than TFTs remains untested.</p><p><strong>Methods: </strong>To compare video-based analysis with TFTs, we collected an observational dataset from 129 individuals: 28 with facioscapulohumeral muscular dystrophy, 58 with myotonic dystrophy, and 43 controls with no diagnosed neuromuscular condition. We used OpenCap, a free open-source software tool, to capture smartphone video-based biomechanics of nine different movements in a median time of 16 minutes per participant. From these recordings, we extracted 34 interpretable movement features. Using these features, we evaluated the ability of video-based biomechanics to reproduce four TFTs (10-meter walk, 10-meter run, timed up-and-go, and 5-times sit-to-stand) while capturing additional disease-specific signatures of movement.</p><p><strong>Results: </strong>Video-based biomechanical analysis reproduced all four TFTs (r>0.98) with similar test-retest reliability. In addition, video metrics outperformed TFTs at disease classification (P=0.021). Unlike TFTs, video-based biomechanical analysis identified disease-specific signatures of movement, such as differences in gait kinematics, that are not evident in TFTs.</p><p><strong>Conclusions: </strong>Video-based biomechanical analysis can complement existing functional movement assessments by capturing more sensitive, disease-specific outcomes from human movement. This technology enables digital health solutions for assessing and monitoring motor function, complementing traditional clinical outcome measures to enhance care, management, and clinical trial design for movement-related conditions. (Funded by the Wu Tsai Human Performance Alliance and others.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416922/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145031669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI-Powered Diabetes Precision Health: From Data to Action. 人工智能糖尿病精准健康:从数据到行动
Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1056/AIp2500475
Jia Nie, Carol Haft, Ashley Xia, Xujing Wang

Diabetes has become a major public health challenge due to its high prevalence and chronic nature, with many individuals managing the condition for decades. The vast heterogeneity in diabetes necessitates personalized approaches to its prevention, diagnosis, treatment, and prognosis. Recently, the National Institute of Diabetes and Digestive and Kidney Diseases at the National Institutes of Health convened experts from the fields of diabetes and AI to identify and discuss existing gaps, as well as potentially transformative opportunities and actionable items enabled by recent advancements in AI. One prominent theme that has emerged from this discussion was the considerable potential of AI in Diabetes Precision Health, a field that warrants greater attention. The purpose of this article is to describe the opportunities and challenges identified during the workshop and outline potential strategies recommended by workshop attendees to advance this promising field.

由于糖尿病的高患病率和慢性性质,糖尿病已成为一个重大的公共卫生挑战,许多人几十年来一直患有这种疾病。糖尿病的巨大异质性需要个性化的预防、诊断、治疗和预后方法。最近,美国国立卫生研究院的国家糖尿病、消化和肾脏疾病研究所召集了来自糖尿病和人工智能领域的专家,以确定和讨论现有的差距,以及人工智能最近取得的进展所带来的潜在变革机会和可操作的项目。这次讨论中出现的一个突出主题是人工智能在糖尿病精准健康领域的巨大潜力,这是一个值得更多关注的领域。本文的目的是描述研讨会期间确定的机遇和挑战,并概述研讨会与会者建议的潜在策略,以推进这一有前途的领域。
{"title":"AI-Powered Diabetes Precision Health: From Data to Action.","authors":"Jia Nie, Carol Haft, Ashley Xia, Xujing Wang","doi":"10.1056/AIp2500475","DOIUrl":"10.1056/AIp2500475","url":null,"abstract":"<p><p>Diabetes has become a major public health challenge due to its high prevalence and chronic nature, with many individuals managing the condition for decades. The vast heterogeneity in diabetes necessitates personalized approaches to its prevention, diagnosis, treatment, and prognosis. Recently, the National Institute of Diabetes and Digestive and Kidney Diseases at the National Institutes of Health convened experts from the fields of diabetes and AI to identify and discuss existing gaps, as well as potentially transformative opportunities and actionable items enabled by recent advancements in AI. One prominent theme that has emerged from this discussion was the considerable potential of AI in Diabetes Precision Health, a field that warrants greater attention. The purpose of this article is to describe the opportunities and challenges identified during the workshop and outline potential strategies recommended by workshop attendees to advance this promising field.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12412894/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145017125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
NEJM AI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1