Pub Date : 2026-02-01Epub Date: 2026-01-22DOI: 10.1056/aics2500676
Lori Uscher-Pines, Jessica L Sousa, Pushpa Raja, Lynsay Ayer, Ateev Mehrotra, Haiden A Huskamp, Alisa B Busch
Large language model (LLM)-based chatbots are increasingly used for behavioral health support. Few studies have rigorously evaluated their advice on alcohol misuse. We evaluated seven publicly available chatbots-including general-purpose and behavioral health-focused tools-in responding to alcohol misuse-related questions. Using a fictional case, we simulated longitudinal chatbot interactions over seven days, using 25 prompts derived from real-world Reddit posts. Using an evaluation framework specific to chatbots, four clinicians independently rated each chatbot's transcript along five domains: empathy, quality of information, usefulness, responsiveness, and scope awareness. Clinicians also assessed secondary dimensions, including stigmatizing language and challenging the user (vs. only validating feelings). We generated descriptive statistics on performance and identified examples of problematic output. Across all chatbots, empathy was the highest-rated domain (mean score 4.6/5) while quality of information was the lowest (mean 2.7/5). There was considerable variation in overall mean performance scores across the chatbots, ranging from 2.1 (SD 1.1) to 4.5 (SD 0.8). There were no significant differences in performance between behavioral health and general-purpose chatbots. All chatbots had one or more examples of guidance deemed inappropriate, over-stated, or inaccurate. All avoided stigmatizing or judgmental language and supported self-efficacy. Chatbots were perceived to vary widely in their ability to support individuals with alcohol misuse. While generally strong in empathy, there is room for improvement in response quality. As chatbot use expands, users and clinicians should be aware of the strengths and weaknesses of chatbots in providing advice on alcohol misuse.
{"title":"Assessing Generative AI Chatbots for Alcohol Misuse Support: A Longitudinal Simulation Study.","authors":"Lori Uscher-Pines, Jessica L Sousa, Pushpa Raja, Lynsay Ayer, Ateev Mehrotra, Haiden A Huskamp, Alisa B Busch","doi":"10.1056/aics2500676","DOIUrl":"10.1056/aics2500676","url":null,"abstract":"<p><p>Large language model (LLM)-based chatbots are increasingly used for behavioral health support. Few studies have rigorously evaluated their advice on alcohol misuse. We evaluated seven publicly available chatbots-including general-purpose and behavioral health-focused tools-in responding to alcohol misuse-related questions. Using a fictional case, we simulated longitudinal chatbot interactions over seven days, using 25 prompts derived from real-world Reddit posts. Using an evaluation framework specific to chatbots, four clinicians independently rated each chatbot's transcript along five domains: empathy, quality of information, usefulness, responsiveness, and scope awareness. Clinicians also assessed secondary dimensions, including stigmatizing language and challenging the user (vs. only validating feelings). We generated descriptive statistics on performance and identified examples of problematic output. Across all chatbots, empathy was the highest-rated domain (mean score 4.6/5) while quality of information was the lowest (mean 2.7/5). There was considerable variation in overall mean performance scores across the chatbots, ranging from 2.1 (SD 1.1) to 4.5 (SD 0.8). There were no significant differences in performance between behavioral health and general-purpose chatbots. All chatbots had one or more examples of guidance deemed inappropriate, over-stated, or inaccurate. All avoided stigmatizing or judgmental language and supported self-efficacy. Chatbots were perceived to vary widely in their ability to support individuals with alcohol misuse. While generally strong in empathy, there is room for improvement in response quality. As chatbot use expands, users and clinicians should be aware of the strengths and weaknesses of chatbots in providing advice on alcohol misuse.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"3 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12829918/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-26DOI: 10.1056/aioa2501000
Paul J Lukac, William Turner, Sitaram Vangala, Aaron T Chin, Joshua Khalili, Ya-Chen Tina Shih, Catherine Sarkisian, Eric M Cheng, John N Mafi
Background: Ambient artificial intelligence (AI) scribes record patient encounters and rapidly generate visit notes, representing a promising solution to documentation burden and physician burnout. However, the scribes' impacts have not been examined in randomized clinical trials.
Methods: In this parallel three-group pragmatic randomized clinical trial, 238 outpatient physicians, representing 14 specialties, were assigned 1:1:1 via covariate-constrained randomization (balancing on time-in-note, baseline burnout score, and clinic days per week) to either one of two AI scribe applications - Microsoft Dragon Ambient eXperience (DAX) Copilot or Nabla - or a usual-care control group from November 4, 2024, to January 3, 2025. The primary outcome was the change from baseline log writing time-in-note. Secondary end points measured by surveys included the Mini-Z 2.0, a four-item physician task load (PTL), and Professional Fulfillment Index - Work Exhaustion (PFI-WE) scores to evaluate aspects of burnout; work environment; stress; and targeted questions addressing safety, accuracy, and usability.
Results: DAX was used in 33.5% of 24,696 visits; Nabla was used in 29.5% of 23,653 visits. Nabla users experienced a 9.5% (95% confidence interval [CI], -17.2% to -1.8%; P=0.02) decrease in time-in-note versus the control group, whereas DAX users exhibited no significant change versus the control group (-1.7%; 95% CI, -9.4% to +5.9%; P=0.66). Increases in total Mini-Z (scale 10-50; DAX 2.83 [95% CI, +1.28 to +4.37]; Nabla +2.69 [95% CI, +1.14 to +4.23]) and reductions in PTL (scale 0-400; DAX -39.9 [95% CI, -71.9 to -7.9]; Nabla -31.7 [95% CI, -63.8 to +0.4]), and PFI-WE (scale 0-4; DAX 0.32 [95% CI,-0.55 to -0.08]; Nabla -0.23 [95% CI, -0.46 to +0.01]) scores suggest improvement for users of either scribe versus the control. One grade 1 (mild) adverse event was reported, while clinically significant inaccuracies were noted "occasionally" on five-point Likert questions (DAX 2.7 [95% CI, 2.4 to 3.0]; Nabla 2.8 [95% CI, 2.6 to 3.0]).
Conclusions: Nabla reduced time-in-note versus the control. Both DAX and Nabla resulted in potential improvements in burnout, task load, and work exhaustion, but these secondary end point findings need confirmation in larger, multicenter trials. Clinicians reported that performance was similar across the two distinct platforms, and occasional inaccuracies observed in either scribe require ongoing vigilance. (Funded by the University of California, Los Angeles, Department of Medicine and others; ClinicalTrials.gov number, NCT06792890.).
{"title":"Ambient AI Scribes in Clinical Practice: A Randomized Trial.","authors":"Paul J Lukac, William Turner, Sitaram Vangala, Aaron T Chin, Joshua Khalili, Ya-Chen Tina Shih, Catherine Sarkisian, Eric M Cheng, John N Mafi","doi":"10.1056/aioa2501000","DOIUrl":"10.1056/aioa2501000","url":null,"abstract":"<p><strong>Background: </strong>Ambient artificial intelligence (AI) scribes record patient encounters and rapidly generate visit notes, representing a promising solution to documentation burden and physician burnout. However, the scribes' impacts have not been examined in randomized clinical trials.</p><p><strong>Methods: </strong>In this parallel three-group pragmatic randomized clinical trial, 238 outpatient physicians, representing 14 specialties, were assigned 1:1:1 via covariate-constrained randomization (balancing on time-in-note, baseline burnout score, and clinic days per week) to either one of two AI scribe applications - Microsoft Dragon Ambient eXperience (DAX) Copilot or Nabla - or a usual-care control group from November 4, 2024, to January 3, 2025. The primary outcome was the change from baseline log writing time-in-note. Secondary end points measured by surveys included the Mini-Z 2.0, a four-item physician task load (PTL), and Professional Fulfillment Index - Work Exhaustion (PFI-WE) scores to evaluate aspects of burnout; work environment; stress; and targeted questions addressing safety, accuracy, and usability.</p><p><strong>Results: </strong>DAX was used in 33.5% of 24,696 visits; Nabla was used in 29.5% of 23,653 visits. Nabla users experienced a 9.5% (95% confidence interval [CI], -17.2% to -1.8%; P=0.02) decrease in time-in-note versus the control group, whereas DAX users exhibited no significant change versus the control group (-1.7%; 95% CI, -9.4% to +5.9%; P=0.66). Increases in total Mini-Z (scale 10-50; DAX 2.83 [95% CI, +1.28 to +4.37]; Nabla +2.69 [95% CI, +1.14 to +4.23]) and reductions in PTL (scale 0-400; DAX -39.9 [95% CI, -71.9 to -7.9]; Nabla -31.7 [95% CI, -63.8 to +0.4]), and PFI-WE (scale 0-4; DAX 0.32 [95% CI,-0.55 to -0.08]; Nabla -0.23 [95% CI, -0.46 to +0.01]) scores suggest improvement for users of either scribe versus the control. One grade 1 (mild) adverse event was reported, while clinically significant inaccuracies were noted \"occasionally\" on five-point Likert questions (DAX 2.7 [95% CI, 2.4 to 3.0]; Nabla 2.8 [95% CI, 2.6 to 3.0]).</p><p><strong>Conclusions: </strong>Nabla reduced time-in-note versus the control. Both DAX and Nabla resulted in potential improvements in burnout, task load, and work exhaustion, but these secondary end point findings need confirmation in larger, multicenter trials. Clinicians reported that performance was similar across the two distinct platforms, and occasional inaccuracies observed in either scribe require ongoing vigilance. (Funded by the University of California, Los Angeles, Department of Medicine and others; ClinicalTrials.gov number, NCT06792890.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12768499/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145914441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-26DOI: 10.1056/aioa2500945
Majid Afshar, Mary Ryan Baumann, Felice Resnik, Josie Hintzke, Anne Gravel Sullivan, Graham Wills, Kayla Lemmon, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Thomas B Brazelton, Margaret A Leaf, Heidi Twedt, David Kunstman, Brian Patterson, Frank Liao, Stacy Rasmussen, Elizabeth S Burnside, Cherodeep Goswami, Joel Gordon
Background: Electronic health record (EHR) documentation is a major contributor to work-related practitioner exhaustion and the interpersonal disengagement known as burnout. Generative artificial intelligence (AI) scribes that passively capture clinical conversations and draft visit notes may alleviate this burden, but evidence remains limited.
Methods: A 24-week, stepped-wedge, individually randomized pragmatic trial was conducted across ambulatory clinics in two states. Sixty-six health care practitioners were randomly assigned to three 6-week sequences of ambient AI. The coprimary outcomes were professional fulfillment and work exhaustion/interpersonal disengagement from the Stanford Professional Fulfillment Index. Secondary measures included time spent on notes, work outside work (WoW), documentation quality with the Provider Documentation Summarization Quality Instrument 9 (PDSQI-9), and billing diagnostic codes reviewed by professional staff coders. Linear mixed models were used for intention-to-treat (ITT) analyses.
Results: A total of 71,487 notes were authored, of which 27,092 (38%) were generated using ambient AI. Ambient AI use had a significant reduction in work exhaustion/interpersonal disengagement (-0.44 points; 95% confidence interval [CI], -0.62 to -0.25; P<0.001), and a nonsignificant increase in professional fulfillment (+0.14 points; 95% CI, 0.004 to 0.28; P=0.04) on a five-point Likert scale. Among secondary measures, time spent on notes decreased (-0.36 hours per day; 95% CI, -0.55 to -0.17). The reduction in WoW (-0.50 hours per day; 95% CI, -0.90 to -0.09) was sensitive to exclusion of extreme values and was no longer significant after removing the top 3% of daily observations. Diagnostic billing codes improved with ambient AI use (P<0.001). Documentation quality, assessed with the PDSQI-9, demonstrated mean scores ranging from 3.97 to 4.99 across domains on a five-point scale. No drift in software performance was detected.
Conclusions: In a real-world randomized implementation, ambient AI reduced health care practitioners' work exhaustion/interpersonal disengagement but did not significantly increase professional fulfillment. Documentation time decreased without compromising diagnosis, billing compliance, or note quality. (Funded by the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award; ClinicalTrials.gov number, NCT06517082.).
{"title":"A Pragmatic Randomized Controlled Trial of Ambient Artificial Intelligence to Improve Health Practitioner Well-Being.","authors":"Majid Afshar, Mary Ryan Baumann, Felice Resnik, Josie Hintzke, Anne Gravel Sullivan, Graham Wills, Kayla Lemmon, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Thomas B Brazelton, Margaret A Leaf, Heidi Twedt, David Kunstman, Brian Patterson, Frank Liao, Stacy Rasmussen, Elizabeth S Burnside, Cherodeep Goswami, Joel Gordon","doi":"10.1056/aioa2500945","DOIUrl":"10.1056/aioa2500945","url":null,"abstract":"<p><strong>Background: </strong>Electronic health record (EHR) documentation is a major contributor to work-related practitioner exhaustion and the interpersonal disengagement known as burnout. Generative artificial intelligence (AI) scribes that passively capture clinical conversations and draft visit notes may alleviate this burden, but evidence remains limited.</p><p><strong>Methods: </strong>A 24-week, stepped-wedge, individually randomized pragmatic trial was conducted across ambulatory clinics in two states. Sixty-six health care practitioners were randomly assigned to three 6-week sequences of ambient AI. The coprimary outcomes were professional fulfillment and work exhaustion/interpersonal disengagement from the Stanford Professional Fulfillment Index. Secondary measures included time spent on notes, work outside work (WoW), documentation quality with the Provider Documentation Summarization Quality Instrument 9 (PDSQI-9), and billing diagnostic codes reviewed by professional staff coders. Linear mixed models were used for intention-to-treat (ITT) analyses.</p><p><strong>Results: </strong>A total of 71,487 notes were authored, of which 27,092 (38%) were generated using ambient AI. Ambient AI use had a significant reduction in work exhaustion/interpersonal disengagement (-0.44 points; 95% confidence interval [CI], -0.62 to -0.25; P<0.001), and a nonsignificant increase in professional fulfillment (+0.14 points; 95% CI, 0.004 to 0.28; P=0.04) on a five-point Likert scale. Among secondary measures, time spent on notes decreased (-0.36 hours per day; 95% CI, -0.55 to -0.17). The reduction in WoW (-0.50 hours per day; 95% CI, -0.90 to -0.09) was sensitive to exclusion of extreme values and was no longer significant after removing the top 3% of daily observations. Diagnostic billing codes improved with ambient AI use (P<0.001). Documentation quality, assessed with the PDSQI-9, demonstrated mean scores ranging from 3.97 to 4.99 across domains on a five-point scale. No drift in software performance was detected.</p><p><strong>Conclusions: </strong>In a real-world randomized implementation, ambient AI reduced health care practitioners' work exhaustion/interpersonal disengagement but did not significantly increase professional fulfillment. Documentation time decreased without compromising diagnosis, billing compliance, or note quality. (Funded by the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award; ClinicalTrials.gov number, NCT06517082.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12858090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146109427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-26DOI: 10.1056/aioa2500164
Jonathan B Moody, Alexis Poitrasson-Rivière, Jennifer M Renaud, Tomoe Hagio, Fares Alahdab, Mouaz H Al-Mallah, Michael D Vanderver, Sascha N Goonewardena, Edward P Ficaro, Venkatesh L Murthy
Background: The wide availability of labeled electrocardiogram (ECG) data has driven major advances in artificial intelligence (AI)-based detection of structural and functional cardiac abnormalities and thus ECG-based diagnosis. However, many critical, high value clinical diagnostic applications, such as assessing myocardial ischemia and coronary microvascular dysfunction, remain underserved due to the limited availability of labeled datasets. We developed a self-supervised ECG foundation model and demonstrate how this approach can overcome this limitation.
Methods: A modified vision transformer model was pretrained using a large database of unlabeled ECG waveforms (MIMIC-IV-ECG, N=800,035). The model was then fine-tuned using smaller databases that included high-quality labels derived from positron emission tomography (N=3,126) and clinical reports (N=13,704) for 12 clinical, demographic, and traditional ECG prediction tasks. Diagnostic accuracy and model generalizability were evaluated across five additional cohorts including the publicly available PTB-XL and UK Biobank databases and labels from cardiac magnetic resonance imaging (MRI) and single photon emission computed tomography (SPECT).
Results: Diagnostic performance varied across tasks with area under the receiver operating characteristic curve (AUROC) ranging from 0.763 for detection of impaired myocardial flow reserve (MFR < 2) to 0.955 for impaired left ventricular ejection fraction (LVEF < 35%). Self-supervised learning (SSL) pretraining greatly improved diagnostic accuracy in 11 of the 12 prediction tasks compared to conventional de novo supervised training. The model retained strong performance across three external and two internal cross-modality databases, with AUROC ranging from 0.771 for impaired MFR to 0.949 for impaired LVEF.
Conclusion: This versatile ECG foundation model demonstrates that SSL pretraining enhances diagnostic accuracy and generalizability across diverse cardiac diagnostic applications. By enabling effective learning from limited labeled data, this approach supports AI development for complex but clinically critical tasks, such as detecting myocardial ischemia and coronary microvascular dysfunction, where high-quality labels are costly and scarce.
{"title":"A foundation transformer model with self-supervised learning for ECG-based assessment of cardiac and coronary function.","authors":"Jonathan B Moody, Alexis Poitrasson-Rivière, Jennifer M Renaud, Tomoe Hagio, Fares Alahdab, Mouaz H Al-Mallah, Michael D Vanderver, Sascha N Goonewardena, Edward P Ficaro, Venkatesh L Murthy","doi":"10.1056/aioa2500164","DOIUrl":"10.1056/aioa2500164","url":null,"abstract":"<p><strong>Background: </strong>The wide availability of labeled electrocardiogram (ECG) data has driven major advances in artificial intelligence (AI)-based detection of structural and functional cardiac abnormalities and thus ECG-based diagnosis. However, many critical, high value clinical diagnostic applications, such as assessing myocardial ischemia and coronary microvascular dysfunction, remain underserved due to the limited availability of labeled datasets. We developed a self-supervised ECG foundation model and demonstrate how this approach can overcome this limitation.</p><p><strong>Methods: </strong>A modified vision transformer model was pretrained using a large database of unlabeled ECG waveforms (MIMIC-IV-ECG, N=800,035). The model was then fine-tuned using smaller databases that included high-quality labels derived from positron emission tomography (N=3,126) and clinical reports (N=13,704) for 12 clinical, demographic, and traditional ECG prediction tasks. Diagnostic accuracy and model generalizability were evaluated across five additional cohorts including the publicly available PTB-XL and UK Biobank databases and labels from cardiac magnetic resonance imaging (MRI) and single photon emission computed tomography (SPECT).</p><p><strong>Results: </strong>Diagnostic performance varied across tasks with area under the receiver operating characteristic curve (AUROC) ranging from 0.763 for detection of impaired myocardial flow reserve (MFR < 2) to 0.955 for impaired left ventricular ejection fraction (LVEF < 35%). Self-supervised learning (SSL) pretraining greatly improved diagnostic accuracy in 11 of the 12 prediction tasks compared to conventional <i>de novo</i> supervised training. The model retained strong performance across three external and two internal cross-modality databases, with AUROC ranging from 0.771 for impaired MFR to 0.949 for impaired LVEF.</p><p><strong>Conclusion: </strong>This versatile ECG foundation model demonstrates that SSL pretraining enhances diagnostic accuracy and generalizability across diverse cardiac diagnostic applications. By enabling effective learning from limited labeled data, this approach supports AI development for complex but clinically critical tasks, such as detecting myocardial ischemia and coronary microvascular dysfunction, where high-quality labels are costly and scarce.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12724683/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145829815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-14DOI: 10.1056/aip2500705
Richard K Leuchter, William B Turner, David Ouyang
Predictive artificial intelligence models are being deployed across health systems with dangerously inconsistent oversight, creating two critical gaps: a compliance gap, where clinical tools that likely qualify as software as a medical device are implemented without seeking U.S. Food and Drug Administration authorization; and a regulatory gap, where administrative and operational models are deployed without any external review despite their potential to influence care and widen disparities. Given that comprehensive U.S. Food and Drug Administration oversight of all such models is infeasible, the de facto onus of ensuring their safety and efficacy falls on the implementing institutions. However, this imperative for self-governance is undermined by a fundamental and previously unarticulated two-way moving target problem: (1) prior to implementation, concurrent-intervention confounding moves the target as practice and operational changes shift the outcome during the time it takes to develop models; and (2) after implementation, action-induced outcome bias moves the target again when prediction-triggered interventions alter or censor the outcome. Together, these pitfalls render traditional evaluation methods inadequate. The authors argue that health systems must adopt a new default standard for implementing any model that predicts patient outcomes or utilization: short-term randomized deployment with a control group. This approach provides a crucial counterfactual for rigorous, independent assessment of model performance and intervention effectiveness. It offers a practical path forward for institutions to ensure that their artificial intelligence tools are safe, effective, and equitable, thereby building a foundation of trust that is worthy of the patients they serve. (Funded by the National Institutes of Health National Heart, Lung, and Blood Institute.).
{"title":"Evaluating Translational AI: A Two-Way Moving Target Problem.","authors":"Richard K Leuchter, William B Turner, David Ouyang","doi":"10.1056/aip2500705","DOIUrl":"10.1056/aip2500705","url":null,"abstract":"<p><p>Predictive artificial intelligence models are being deployed across health systems with dangerously inconsistent oversight, creating two critical gaps: a compliance gap, where clinical tools that likely qualify as software as a medical device are implemented without seeking U.S. Food and Drug Administration authorization; and a regulatory gap, where administrative and operational models are deployed without any external review despite their potential to influence care and widen disparities. Given that comprehensive U.S. Food and Drug Administration oversight of all such models is infeasible, the de facto onus of ensuring their safety and efficacy falls on the implementing institutions. However, this imperative for self-governance is undermined by a fundamental and previously unarticulated two-way moving target problem: (1) prior to implementation, concurrent-intervention confounding moves the target as practice and operational changes shift the outcome during the time it takes to develop models; and (2) after implementation, action-induced outcome bias moves the target again when prediction-triggered interventions alter or censor the outcome. Together, these pitfalls render traditional evaluation methods inadequate. The authors argue that health systems must adopt a new default standard for implementing any model that predicts patient outcomes or utilization: short-term randomized deployment with a control group. This approach provides a crucial counterfactual for rigorous, independent assessment of model performance and intervention effectiveness. It offers a practical path forward for institutions to ensure that their artificial intelligence tools are safe, effective, and equitable, thereby building a foundation of trust that is worthy of the patients they serve. (Funded by the National Institutes of Health National Heart, Lung, and Blood Institute.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12851562/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Atasi Poddar, Gabriel K Innes, Qi Liu, Anindita Saha, Morgan Hanger, Kelly Franzetti, M Khair ElZarrad, Tala H Fakhouri
The Orphan Drug Act defines a rare disease as a condition affecting fewer than 200,000 people in the United States. However, most rare diseases are categorized as ultrarare or hyper-rare, impacting fewer than 100 individuals worldwide. Developing drugs for these conditions involves multiple challenges, such as geographically dispersed and small patient populations, limited natural history data, and poor disease characterization. Issues related to small patient numbers, scarce natural history information, and clinical heterogeneity within rare diseases can be addressed by various strategies, including using artificial intelligence and advanced analytical methods, leveraging detailed individual-level data, and exploring synthetic data generation to overcome the limitations of small datasets. Moreover, establishing centralized databases and promoting public-private partnerships can help build a more comprehensive repository of available data.
{"title":"Mitigating Limited Data Challenges to Improve Artificial Intelligence Integration in Rare Disease Drug Development.","authors":"Atasi Poddar, Gabriel K Innes, Qi Liu, Anindita Saha, Morgan Hanger, Kelly Franzetti, M Khair ElZarrad, Tala H Fakhouri","doi":"10.1056/AIp2500802","DOIUrl":"10.1056/AIp2500802","url":null,"abstract":"<p><p>The Orphan Drug Act defines a rare disease as a condition affecting fewer than 200,000 people in the United States. However, most rare diseases are categorized as ultrarare or hyper-rare, impacting fewer than 100 individuals worldwide. Developing drugs for these conditions involves multiple challenges, such as geographically dispersed and small patient populations, limited natural history data, and poor disease characterization. Issues related to small patient numbers, scarce natural history information, and clinical heterogeneity within rare diseases can be addressed by various strategies, including using artificial intelligence and advanced analytical methods, leveraging detailed individual-level data, and exploring synthetic data generation to overcome the limitations of small datasets. Moreover, establishing centralized databases and promoting public-private partnerships can help build a more comprehensive repository of available data.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12690552/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145746540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Atasi Poddar, Marsha Samson, Gabriel K Innes, Qi Liu, Anindita Saha, Morgan Hanger, Kelly Franzetti, M Khair ElZarrad, Tala H Fakhouri
Artificial intelligence (AI) holds immense potential to transform drug development by improving the efficiency and accuracy of key processes across the drug product life cycle. However, the scalable adoption of this technology may be influenced by new and unique challenges. The U.S. Food and Drug Administration collaborated with the Clinical Trial Transformation Initiative to organize a public workshop on Artificial Intelligence in Drug and Biological Product Development in August 2024 with medical product sponsors, technology innovators, academicians, and regulators to discuss guiding principles for the use of AI in drug and biological product development in order to realize its transformative potential. This article synthesizes key insights from the workshop and discusses the emerging current need for policy development to enhance the integration of AI in drug and biological product development.
{"title":"Leveraging Artificial Intelligence in Drug and Biological Product Development: An FDA and Clinical Trial Transformation Initiative Workshop Report.","authors":"Atasi Poddar, Marsha Samson, Gabriel K Innes, Qi Liu, Anindita Saha, Morgan Hanger, Kelly Franzetti, M Khair ElZarrad, Tala H Fakhouri","doi":"10.1056/aipc2500801","DOIUrl":"10.1056/aipc2500801","url":null,"abstract":"<p><p>Artificial intelligence (AI) holds immense potential to transform drug development by improving the efficiency and accuracy of key processes across the drug product life cycle. However, the scalable adoption of this technology may be influenced by new and unique challenges. The U.S. Food and Drug Administration collaborated with the Clinical Trial Transformation Initiative to organize a public workshop on Artificial Intelligence in Drug and Biological Product Development in August 2024 with medical product sponsors, technology innovators, academicians, and regulators to discuss guiding principles for the use of AI in drug and biological product development in order to realize its transformative potential. This article synthesizes key insights from the workshop and discusses the emerging current need for policy development to enhance the integration of AI in drug and biological product development.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12690500/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145746516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01Epub Date: 2025-08-28DOI: 10.1056/aidbp2401267
Majid Afshar, Felice Resnik, Mary Ryan Baumann, Josie Hintzke, Kayla Lemmon, Anne Gravel Sullivan, Tina Shah, Anthony Stordalen, Michael Oberst, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Tom Brazelton, Heidi Twedt, David Kunstman, Graham Wills, John Long, Brian W Patterson, Frank J Liao, Stacy Rasmussen, Elizabeth Burnside, Cherodeep Goswami, Joel E Gordon
Background: Ambient artificial intelligence (AI) offers the potential to reduce documentation burden and improve efficiency through clinical note generation. Widespread adoption, however, remains limited due to challenges in electronic health record (EHR) integration, coding compliance, and real-world evaluation. This study introduces a framework and protocols to design, monitor, and deploy ambient AI within routine care.
Methods: We launched an implementation phase to build technical workflows, establish governance, and inform a pragmatic randomized trial. A bidirectional governance model linked operations and research through multidisciplinary workgroups that incorporated the Systems Engineering Initiative for Patient Safety (SEIPS) framework. Integration into the EHR used Fast Healthcare Interoperability Resources (FHIR), and a real-time dashboard tracked utilization and documentation accuracy. To monitor drift, a difference-in-differences analysis was applied to three process metrics: time in notes, work outside work, and utilization. Audits of International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) compliance were performed using an internally developed large language model (LLM), the validity of which was assessed through correlation with certified professional coders.
Results: Ambient AI utilization, measured as the proportion of eligible clinical notes completed using the system, had a weighted median of 65.4% (interquartile range, 50.6 to 84.0%). Iterative improvement cycles targeted task-specific adoption. A brief workflow issue related to a note template change initially reduced ICD-10 documentation accuracy from 79% (95% confidence interval [CI], 72 to 86%) to 35% (95% CI, 28 to 42%); accuracy returned to baseline after note template redesign and user training. The internally developed LLM coder achieved a strong correlation with professional coders (Pearson's r=0.97). The trial enrolled 66 providers across eight specialties, powered at 90% for the primary outcome of provider well-being.
Conclusions: We provide a publicly available framework and protocols to help safely implement ambient AI in health care. Innovations include an embedded pragmatic trial design, human factors engineering, compliance-driven feedback loops, and real-time monitoring to support deployment, ensuring fidelity before initiation of the clinical trial. (Funded by the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award; NIH/ NCATS UL1TR002737; ClinicalTrials.gov number, NCT06517082.).
{"title":"A Novel Playbook for Pragmatic Trial Operations to Monitor and Evaluate Ambient Artificial Intelligence in Clinical Practice.","authors":"Majid Afshar, Felice Resnik, Mary Ryan Baumann, Josie Hintzke, Kayla Lemmon, Anne Gravel Sullivan, Tina Shah, Anthony Stordalen, Michael Oberst, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Tom Brazelton, Heidi Twedt, David Kunstman, Graham Wills, John Long, Brian W Patterson, Frank J Liao, Stacy Rasmussen, Elizabeth Burnside, Cherodeep Goswami, Joel E Gordon","doi":"10.1056/aidbp2401267","DOIUrl":"10.1056/aidbp2401267","url":null,"abstract":"<p><strong>Background: </strong>Ambient artificial intelligence (AI) offers the potential to reduce documentation burden and improve efficiency through clinical note generation. Widespread adoption, however, remains limited due to challenges in electronic health record (EHR) integration, coding compliance, and real-world evaluation. This study introduces a framework and protocols to design, monitor, and deploy ambient AI within routine care.</p><p><strong>Methods: </strong>We launched an implementation phase to build technical workflows, establish governance, and inform a pragmatic randomized trial. A bidirectional governance model linked operations and research through multidisciplinary workgroups that incorporated the Systems Engineering Initiative for Patient Safety (SEIPS) framework. Integration into the EHR used Fast Healthcare Interoperability Resources (FHIR), and a real-time dashboard tracked utilization and documentation accuracy. To monitor drift, a difference-in-differences analysis was applied to three process metrics: time in notes, work outside work, and utilization. Audits of <i>International Statistical Classification of Diseases and Related Health Problems</i>, Tenth Revision (ICD-10) compliance were performed using an internally developed large language model (LLM), the validity of which was assessed through correlation with certified professional coders.</p><p><strong>Results: </strong>Ambient AI utilization, measured as the proportion of eligible clinical notes completed using the system, had a weighted median of 65.4% (interquartile range, 50.6 to 84.0%). Iterative improvement cycles targeted task-specific adoption. A brief workflow issue related to a note template change initially reduced ICD-10 documentation accuracy from 79% (95% confidence interval [CI], 72 to 86%) to 35% (95% CI, 28 to 42%); accuracy returned to baseline after note template redesign and user training. The internally developed LLM coder achieved a strong correlation with professional coders (Pearson's r=0.97). The trial enrolled 66 providers across eight specialties, powered at 90% for the primary outcome of provider well-being.</p><p><strong>Conclusions: </strong>We provide a publicly available framework and protocols to help safely implement ambient AI in health care. Innovations include an embedded pragmatic trial design, human factors engineering, compliance-driven feedback loops, and real-time monitoring to support deployment, ensuring fidelity before initiation of the clinical trial. (Funded by the University of Wisconsin Hospital and Clinics and the National Institutes of Health Clinical and Translational Science Award; NIH/ NCATS UL1TR002737; ClinicalTrials.gov number, NCT06517082.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12435388/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145077149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01Epub Date: 2025-08-28DOI: 10.1056/aioa2401137
Parker S Ruth, Scott D Uhlrich, Constance de Monts, Antoine Falisse, Julie Muccini, Sydney Covitz, Shelby Vogt-Domke, John Day, Tina Duong, Scott L Delp
Background: Assessing human movement is essential for diagnosing and monitoring movement-related conditions like neuromuscular disorders. Timed function tests (TFTs) are among the most widespread types of assessments due to their speed and simplicity, but they cannot capture disease-specific movement patterns. Conversely, biomechanical analysis can produce sensitive disease-specific biomarkers, but it is traditionally confined to laboratory settings. Recent advances in smartphone video-based biomechanical analysis enable the quantification of three-dimensional movement with the ease and speed required for clinical settings. However, the potential of this technology to offer more sensitive assessments of human function than TFTs remains untested.
Methods: To compare video-based analysis with TFTs, we collected an observational dataset from 129 individuals: 28 with facioscapulohumeral muscular dystrophy, 58 with myotonic dystrophy, and 43 controls with no diagnosed neuromuscular condition. We used OpenCap, a free open-source software tool, to capture smartphone video-based biomechanics of nine different movements in a median time of 16 minutes per participant. From these recordings, we extracted 34 interpretable movement features. Using these features, we evaluated the ability of video-based biomechanics to reproduce four TFTs (10-meter walk, 10-meter run, timed up-and-go, and 5-times sit-to-stand) while capturing additional disease-specific signatures of movement.
Results: Video-based biomechanical analysis reproduced all four TFTs (r>0.98) with similar test-retest reliability. In addition, video metrics outperformed TFTs at disease classification (P=0.021). Unlike TFTs, video-based biomechanical analysis identified disease-specific signatures of movement, such as differences in gait kinematics, that are not evident in TFTs.
Conclusions: Video-based biomechanical analysis can complement existing functional movement assessments by capturing more sensitive, disease-specific outcomes from human movement. This technology enables digital health solutions for assessing and monitoring motor function, complementing traditional clinical outcome measures to enhance care, management, and clinical trial design for movement-related conditions. (Funded by the Wu Tsai Human Performance Alliance and others.).
{"title":"Video-Based Biomechanical Analysis Captures Disease-Specific Movement Signatures of Different Neuromuscular Diseases.","authors":"Parker S Ruth, Scott D Uhlrich, Constance de Monts, Antoine Falisse, Julie Muccini, Sydney Covitz, Shelby Vogt-Domke, John Day, Tina Duong, Scott L Delp","doi":"10.1056/aioa2401137","DOIUrl":"10.1056/aioa2401137","url":null,"abstract":"<p><strong>Background: </strong>Assessing human movement is essential for diagnosing and monitoring movement-related conditions like neuromuscular disorders. Timed function tests (TFTs) are among the most widespread types of assessments due to their speed and simplicity, but they cannot capture disease-specific movement patterns. Conversely, biomechanical analysis can produce sensitive disease-specific biomarkers, but it is traditionally confined to laboratory settings. Recent advances in smartphone video-based biomechanical analysis enable the quantification of three-dimensional movement with the ease and speed required for clinical settings. However, the potential of this technology to offer more sensitive assessments of human function than TFTs remains untested.</p><p><strong>Methods: </strong>To compare video-based analysis with TFTs, we collected an observational dataset from 129 individuals: 28 with facioscapulohumeral muscular dystrophy, 58 with myotonic dystrophy, and 43 controls with no diagnosed neuromuscular condition. We used OpenCap, a free open-source software tool, to capture smartphone video-based biomechanics of nine different movements in a median time of 16 minutes per participant. From these recordings, we extracted 34 interpretable movement features. Using these features, we evaluated the ability of video-based biomechanics to reproduce four TFTs (10-meter walk, 10-meter run, timed up-and-go, and 5-times sit-to-stand) while capturing additional disease-specific signatures of movement.</p><p><strong>Results: </strong>Video-based biomechanical analysis reproduced all four TFTs (r>0.98) with similar test-retest reliability. In addition, video metrics outperformed TFTs at disease classification (P=0.021). Unlike TFTs, video-based biomechanical analysis identified disease-specific signatures of movement, such as differences in gait kinematics, that are not evident in TFTs.</p><p><strong>Conclusions: </strong>Video-based biomechanical analysis can complement existing functional movement assessments by capturing more sensitive, disease-specific outcomes from human movement. This technology enables digital health solutions for assessing and monitoring motor function, complementing traditional clinical outcome measures to enhance care, management, and clinical trial design for movement-related conditions. (Funded by the Wu Tsai Human Performance Alliance and others.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416922/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145031669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01Epub Date: 2025-08-28DOI: 10.1056/AIp2500475
Jia Nie, Carol Haft, Ashley Xia, Xujing Wang
Diabetes has become a major public health challenge due to its high prevalence and chronic nature, with many individuals managing the condition for decades. The vast heterogeneity in diabetes necessitates personalized approaches to its prevention, diagnosis, treatment, and prognosis. Recently, the National Institute of Diabetes and Digestive and Kidney Diseases at the National Institutes of Health convened experts from the fields of diabetes and AI to identify and discuss existing gaps, as well as potentially transformative opportunities and actionable items enabled by recent advancements in AI. One prominent theme that has emerged from this discussion was the considerable potential of AI in Diabetes Precision Health, a field that warrants greater attention. The purpose of this article is to describe the opportunities and challenges identified during the workshop and outline potential strategies recommended by workshop attendees to advance this promising field.
{"title":"AI-Powered Diabetes Precision Health: From Data to Action.","authors":"Jia Nie, Carol Haft, Ashley Xia, Xujing Wang","doi":"10.1056/AIp2500475","DOIUrl":"10.1056/AIp2500475","url":null,"abstract":"<p><p>Diabetes has become a major public health challenge due to its high prevalence and chronic nature, with many individuals managing the condition for decades. The vast heterogeneity in diabetes necessitates personalized approaches to its prevention, diagnosis, treatment, and prognosis. Recently, the National Institute of Diabetes and Digestive and Kidney Diseases at the National Institutes of Health convened experts from the fields of diabetes and AI to identify and discuss existing gaps, as well as potentially transformative opportunities and actionable items enabled by recent advancements in AI. One prominent theme that has emerged from this discussion was the considerable potential of AI in Diabetes Precision Health, a field that warrants greater attention. The purpose of this article is to describe the opportunities and challenges identified during the workshop and outline potential strategies recommended by workshop attendees to advance this promising field.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 9","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12412894/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145017125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}