James E Tcheng, David Finney, Keith Boone, Samit P Desai, David A Pyke, Nandan Shanbhag, Ganesan Srinivasan, Nick Ramsing, Mark D Kelemen
Objective: We conducted the Clinical Registry Extraction and Data Submission (CREDS) project to evaluate the readiness of HL7 Fast Healthcare Interoperability Resources (FHIR) for provisioning data from health information systems for the American College of Cardiology Cardiac Catheterization Percutaneous Coronary Intervention (CathPCI) Registry.
Materials and methods: The CREDS project had 3 workstreams: (1) evaluation of the readiness of clinical documentation for data transforms, (2) modeling of a FHIR-based clinical workflow for registry data submission, and (3) development and demonstration of a CREDS FHIR implementation for registry data submission.
Results: Of the 344 data concepts comprising the CathPCI Registry, only 111 (32%) were sufficiently discrete to be listed in the CathPCI Data Dictionary with a terminology mapping. Cardiologist informaticians identified an additional 42 concepts suitable for provisioning via a FHIR payload. The resulting notional workflow combined FHIR-based data assembly with manual chart abstraction of compound, summative, and complex clinical concepts. A CathPCI FHIR StructureDefinition artifact was authored, incorporated into a CREDS FHIR Implementation Guide, and balloted to Standard for Trial Use status.
Discussion: CREDS demonstrated both potential and limitations for using FHIR for registry data submission. The largest technical impediment was the volume of code (>11 000 lines) for the FHIR StructureDefinition. Lack of regularized clinical vocabularies, reliance of registries on complex clinical concepts, and absence of FHIR infrastructure must be overcome before CREDS can be used at scale.
Conclusion: CREDS demonstrated proof-of-concept FHIR-based provisioning of clinical data for registry submission. All artifacts are open source to inform others with similar interests.
{"title":"Evaluating the potential of fast healthcare interoperability resources for clinical registry data submission.","authors":"James E Tcheng, David Finney, Keith Boone, Samit P Desai, David A Pyke, Nandan Shanbhag, Ganesan Srinivasan, Nick Ramsing, Mark D Kelemen","doi":"10.1093/jamia/ocag029","DOIUrl":"https://doi.org/10.1093/jamia/ocag029","url":null,"abstract":"<p><strong>Objective: </strong>We conducted the Clinical Registry Extraction and Data Submission (CREDS) project to evaluate the readiness of HL7 Fast Healthcare Interoperability Resources (FHIR) for provisioning data from health information systems for the American College of Cardiology Cardiac Catheterization Percutaneous Coronary Intervention (CathPCI) Registry.</p><p><strong>Materials and methods: </strong>The CREDS project had 3 workstreams: (1) evaluation of the readiness of clinical documentation for data transforms, (2) modeling of a FHIR-based clinical workflow for registry data submission, and (3) development and demonstration of a CREDS FHIR implementation for registry data submission.</p><p><strong>Results: </strong>Of the 344 data concepts comprising the CathPCI Registry, only 111 (32%) were sufficiently discrete to be listed in the CathPCI Data Dictionary with a terminology mapping. Cardiologist informaticians identified an additional 42 concepts suitable for provisioning via a FHIR payload. The resulting notional workflow combined FHIR-based data assembly with manual chart abstraction of compound, summative, and complex clinical concepts. A CathPCI FHIR StructureDefinition artifact was authored, incorporated into a CREDS FHIR Implementation Guide, and balloted to Standard for Trial Use status.</p><p><strong>Discussion: </strong>CREDS demonstrated both potential and limitations for using FHIR for registry data submission. The largest technical impediment was the volume of code (>11 000 lines) for the FHIR StructureDefinition. Lack of regularized clinical vocabularies, reliance of registries on complex clinical concepts, and absence of FHIR infrastructure must be overcome before CREDS can be used at scale.</p><p><strong>Conclusion: </strong>CREDS demonstrated proof-of-concept FHIR-based provisioning of clinical data for registry submission. All artifacts are open source to inform others with similar interests.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147391600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarah Conderino, H Lester Kirchner, Lorna E Thorpe, Jasmin Divers, Annemarie G Hirsch, Cara M Nordberg, Brian S Schwartz, Lu Zhang, Bo Cai, Caroline Rudisill, Jihad S Obeid, Angela Liese, Katie S Allen, Brian E Dixon, Tessa Crume, Dana Dabelea, Shawna Burgett, Anna Bellatorre, Hui Shao, Jiang Bian, Yi Guo, Sarah Bost, Tianchen Lyu, Kristi Reynolds, Matthew T Mefford, Hui Zhou, Matt Zhou, Eva Lustigova, Levon H Utidjian, Mitchell Maltenfort, Manmohan Kamboj, Eneida A Mendonca, Patrick Hanley, Ibrahim Zaganjor, Meda E Pavkov, Marc Rosenman, Andrea R Titus
Objective: We discuss implications of potential ascertainment biases for studies examining diabetes risk following SARS-CoV-2 infection using electronic health records (EHRs). We quantitatively explore sensitivity of results to misclassification of COVID-19 status using data from the U.S.-based Diabetes in Children, Adolescents and Young Adults (DiCAYA) Network on children (≤17 years) and young adults (18-44 years).
Materials and methods: In our retrospective case study from the DiCAYA Network, SARS-CoV-2 was identified using labs and diagnoses from June 1, 2020 to December 31, 2021. Patients were followed through December 31, 2022 for new diabetes diagnoses. Sites examined incident diabetes by COVID-19 status using Cox proportional hazards models. Results were pooled in meta-analyses. A bias analysis examined potential impact of COVID-19 misclassification scenarios on results, guided by hypotheses that sensitivity would be <50% and would be higher among those who developed diabetes.
Results: Prevalence of documented COVID-19 was low overall and variable across sites (children: 4.4%-7.7%, young adults: 6.2%-22.7%). Individuals with documented COVID-19 were at higher risk of incident diabetes compared to those with no documented infection, but results were heterogeneous across sites. Findings were highly sensitive to COVID-19 misclassification assumptions. Observed results could be biased away from the null under several differential misclassification scenarios.
Discussion: Although EHR-based documentation of COVID-19 was associated with incident diabetes, COVID-19 phenotypes likely had low sensitivity, with considerable variation across sites. Misclassification assumptions strongly impacted interpretation of results.
Conclusion: Given the potential for low phenotype sensitivity and misclassification, caution is warranted when interpreting analyses of COVID-19 and incident diabetes using clinical or administrative databases.
{"title":"Multi-site analysis of COVID-19 and new-onset diabetes reveals need for improved sensitivity of EHR-based COVID-19 phenotypes-a DiCAYA Network analysis.","authors":"Sarah Conderino, H Lester Kirchner, Lorna E Thorpe, Jasmin Divers, Annemarie G Hirsch, Cara M Nordberg, Brian S Schwartz, Lu Zhang, Bo Cai, Caroline Rudisill, Jihad S Obeid, Angela Liese, Katie S Allen, Brian E Dixon, Tessa Crume, Dana Dabelea, Shawna Burgett, Anna Bellatorre, Hui Shao, Jiang Bian, Yi Guo, Sarah Bost, Tianchen Lyu, Kristi Reynolds, Matthew T Mefford, Hui Zhou, Matt Zhou, Eva Lustigova, Levon H Utidjian, Mitchell Maltenfort, Manmohan Kamboj, Eneida A Mendonca, Patrick Hanley, Ibrahim Zaganjor, Meda E Pavkov, Marc Rosenman, Andrea R Titus","doi":"10.1093/jamia/ocaf229","DOIUrl":"10.1093/jamia/ocaf229","url":null,"abstract":"<p><strong>Objective: </strong>We discuss implications of potential ascertainment biases for studies examining diabetes risk following SARS-CoV-2 infection using electronic health records (EHRs). We quantitatively explore sensitivity of results to misclassification of COVID-19 status using data from the U.S.-based Diabetes in Children, Adolescents and Young Adults (DiCAYA) Network on children (≤17 years) and young adults (18-44 years).</p><p><strong>Materials and methods: </strong>In our retrospective case study from the DiCAYA Network, SARS-CoV-2 was identified using labs and diagnoses from June 1, 2020 to December 31, 2021. Patients were followed through December 31, 2022 for new diabetes diagnoses. Sites examined incident diabetes by COVID-19 status using Cox proportional hazards models. Results were pooled in meta-analyses. A bias analysis examined potential impact of COVID-19 misclassification scenarios on results, guided by hypotheses that sensitivity would be <50% and would be higher among those who developed diabetes.</p><p><strong>Results: </strong>Prevalence of documented COVID-19 was low overall and variable across sites (children: 4.4%-7.7%, young adults: 6.2%-22.7%). Individuals with documented COVID-19 were at higher risk of incident diabetes compared to those with no documented infection, but results were heterogeneous across sites. Findings were highly sensitive to COVID-19 misclassification assumptions. Observed results could be biased away from the null under several differential misclassification scenarios.</p><p><strong>Discussion: </strong>Although EHR-based documentation of COVID-19 was associated with incident diabetes, COVID-19 phenotypes likely had low sensitivity, with considerable variation across sites. Misclassification assumptions strongly impacted interpretation of results.</p><p><strong>Conclusion: </strong>Given the potential for low phenotype sensitivity and misclassification, caution is warranted when interpreting analyses of COVID-19 and incident diabetes using clinical or administrative databases.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"710-718"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12884381/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145829065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michelle Gomez, Ellen W Clayton, Colin G Walsh, Kim M Unertl
Objectives: Trafficked persons experience adverse health consequences and seek help, but many go unrecognized by health-care professionals. This study explored professionals' perspectives on current approaches toward identifying and supporting trafficked persons in health-care settings, highlighting current technology roles, gaps, and future directions.
Materials and methods: We developed an interview guide to investigate current human trafficking (HT) approaches, safety procedures, and HT education. Semistructured interviews were conducted via Zoom, iteratively coded in Dedoose, and analyzed using a thematic analysis approach.
Results: We interviewed 19 health-care and community group professionals and identified 3 themes: (1) participants described a responsibility to build trust with patients through compassionate communication, rapport, and trauma-informed approaches across different stages of care. (2) Technology played a dual role, as professionals navigated both benefits and challenges of tools such as Zoom, virtual interpreters, and cameras in trust building. (3) Safety and privacy concerns guided how participants documented patient encounters and shared community resources, ensuring confidentiality while supporting patient and community well-being.
Discussion: Technology can both support and hinder trust in health care, directly affecting trafficked patients and their safety. Informatics can improve care for trafficked persons, but further research is needed on technology-based interventions. We provide recommendations to strengthen trust, enhance safety, support trauma-informed care, and promote safe documentation practices.
Conclusion: Effective sociotechnical approaches rely on trust, safety, and mindful documentation to support trafficked patients. Future research directions include refining the role of informatics in trauma-informed care to strengthen trust and mitigate unintended consequences.
{"title":"Identifying and supporting trafficked individuals: provider and community organization perspectives on existing sociotechnical approaches.","authors":"Michelle Gomez, Ellen W Clayton, Colin G Walsh, Kim M Unertl","doi":"10.1093/jamia/ocaf220","DOIUrl":"10.1093/jamia/ocaf220","url":null,"abstract":"<p><strong>Objectives: </strong>Trafficked persons experience adverse health consequences and seek help, but many go unrecognized by health-care professionals. This study explored professionals' perspectives on current approaches toward identifying and supporting trafficked persons in health-care settings, highlighting current technology roles, gaps, and future directions.</p><p><strong>Materials and methods: </strong>We developed an interview guide to investigate current human trafficking (HT) approaches, safety procedures, and HT education. Semistructured interviews were conducted via Zoom, iteratively coded in Dedoose, and analyzed using a thematic analysis approach.</p><p><strong>Results: </strong>We interviewed 19 health-care and community group professionals and identified 3 themes: (1) participants described a responsibility to build trust with patients through compassionate communication, rapport, and trauma-informed approaches across different stages of care. (2) Technology played a dual role, as professionals navigated both benefits and challenges of tools such as Zoom, virtual interpreters, and cameras in trust building. (3) Safety and privacy concerns guided how participants documented patient encounters and shared community resources, ensuring confidentiality while supporting patient and community well-being.</p><p><strong>Discussion: </strong>Technology can both support and hinder trust in health care, directly affecting trafficked patients and their safety. Informatics can improve care for trafficked persons, but further research is needed on technology-based interventions. We provide recommendations to strengthen trust, enhance safety, support trauma-informed care, and promote safe documentation practices.</p><p><strong>Conclusion: </strong>Effective sociotechnical approaches rely on trust, safety, and mindful documentation to support trafficked patients. Future research directions include refining the role of informatics in trauma-informed care to strengthen trust and mitigate unintended consequences.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"641-652"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981671/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145806132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved evaluation frameworks are required to move application of LLMs from research into clinical practice.","authors":"Suzanne Bakken","doi":"10.1093/jamia/ocag016","DOIUrl":"10.1093/jamia/ocag016","url":null,"abstract":"","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":"33 3","pages":"551-552"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981623/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147445873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Chen, Patrick Li, Ealia Khoshkish, Seungmin Lee, Tony Ning, Umair Tahir, Henry C Y Wong, Michael S F Lee, Srinivas Raman
Objectives: To develop AutoReporter, a large language model (LLM) system that automates evaluation of adherence to research reporting guidelines.
Materials and methods: Eight prompt-engineering and retrieval strategies coupled with reasoning and general-purpose LLMs were benchmarked on the SPIRIT-CONSORT-TM corpus. The top-performing approach, AutoReporter, was validated on BenchReport, a novel benchmark dataset of expert-rated reporting guideline assessments from 10 systematic reviews.
Results: AutoReporter, a zero-shot, no-retrieval prompt coupled with the o3-mini reasoning LLM, demonstrated strong accuracy (CONSORT 90.09%; SPIRIT: 92.07%), substantial agreement with humans (CONSORT Cohen's κ = 0.70, SPIRIT Cohen's κ = 0.77), runtime (CONSORT: 617.26 s; SPIRIT: 544.51 s), and cost (CONSORT: 0.68 USD; SPIRIT: 0.65 USD). AutoReporter achieved a mean accuracy of 91.8% and substantial agreement (Cohen's κ > 0.6) with expert ratings from the BenchReport benchmark.
Discussion: Structured prompting alone can match or exceed fine-tuned domain models while forgoing manually annotated corpora and computationally intensive training.
Conclusion: Large language models can feasibly automate reporting guideline adherence assessments for scalable quality control in scientific research reporting. AutoReporter is publicly accessible at https://autoreporter.streamlit.app.
{"title":"AutoReporter: development of an artificial intelligence tool for automated assessment of research reporting guideline adherence.","authors":"David Chen, Patrick Li, Ealia Khoshkish, Seungmin Lee, Tony Ning, Umair Tahir, Henry C Y Wong, Michael S F Lee, Srinivas Raman","doi":"10.1093/jamia/ocaf223","DOIUrl":"10.1093/jamia/ocaf223","url":null,"abstract":"<p><strong>Objectives: </strong>To develop AutoReporter, a large language model (LLM) system that automates evaluation of adherence to research reporting guidelines.</p><p><strong>Materials and methods: </strong>Eight prompt-engineering and retrieval strategies coupled with reasoning and general-purpose LLMs were benchmarked on the SPIRIT-CONSORT-TM corpus. The top-performing approach, AutoReporter, was validated on BenchReport, a novel benchmark dataset of expert-rated reporting guideline assessments from 10 systematic reviews.</p><p><strong>Results: </strong>AutoReporter, a zero-shot, no-retrieval prompt coupled with the o3-mini reasoning LLM, demonstrated strong accuracy (CONSORT 90.09%; SPIRIT: 92.07%), substantial agreement with humans (CONSORT Cohen's κ = 0.70, SPIRIT Cohen's κ = 0.77), runtime (CONSORT: 617.26 s; SPIRIT: 544.51 s), and cost (CONSORT: 0.68 USD; SPIRIT: 0.65 USD). AutoReporter achieved a mean accuracy of 91.8% and substantial agreement (Cohen's κ > 0.6) with expert ratings from the BenchReport benchmark.</p><p><strong>Discussion: </strong>Structured prompting alone can match or exceed fine-tuned domain models while forgoing manually annotated corpora and computationally intensive training.</p><p><strong>Conclusion: </strong>Large language models can feasibly automate reporting guideline adherence assessments for scalable quality control in scientific research reporting. AutoReporter is publicly accessible at https://autoreporter.streamlit.app.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"724-731"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981685/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145821799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samuel Dubin, Gabrielle Mayer, Nishant Pradhan, Madeline Xin, Richard Greene
Objectives: Documentation of gender identity (GI) and anatomy data in the electronic health record (EHR) is a proposed standard of care for transgender populations. However, there is limited research on implementation of proposed best practices, particularly anatomy data collection. This study aims to characterize factors that influence patient preferences and comfort around the collection and documentation of GI and anatomy in EHRs.
Materials and methods: From November 2023 to January 2024, 17 one-on-one, semi-structured virtual interviews were conducted with transgender adults residing in the Metropolitan New York area. Transcriptions were analyzed using inductive thematic analysis.
Results: Themes clustered around comfort and preferences for data collection processes and outcomes. Factors that influenced preferences and comfort around anatomy data were distinct from those impacting GI documentation preferences and comfort. The tension between the categories of GI and sex assigned at birth impacted anatomy data documentation preferences. Clinical context emerged as a consistent factor that impacts both preferences and comfort of GI and anatomy data documentation.
Discussion and conclusion: GI data collection efforts in clinical settings must consider the implication of anatomy data collection when determining data collection best practice methodologies. Anticipated and experienced stigma remain significant hurdles to patient comfort and willingness to collect GI and anatomy data, and their impact on actual data collection should be further elucidated among diverse gender identities. Clinical data collection methods, tools, and education warrant ongoing research investment to further elucidate best practices.
{"title":"Patient perspectives on gender identity and anatomy data collection in electronic health records: a qualitative study.","authors":"Samuel Dubin, Gabrielle Mayer, Nishant Pradhan, Madeline Xin, Richard Greene","doi":"10.1093/jamia/ocaf205","DOIUrl":"10.1093/jamia/ocaf205","url":null,"abstract":"<p><strong>Objectives: </strong>Documentation of gender identity (GI) and anatomy data in the electronic health record (EHR) is a proposed standard of care for transgender populations. However, there is limited research on implementation of proposed best practices, particularly anatomy data collection. This study aims to characterize factors that influence patient preferences and comfort around the collection and documentation of GI and anatomy in EHRs.</p><p><strong>Materials and methods: </strong>From November 2023 to January 2024, 17 one-on-one, semi-structured virtual interviews were conducted with transgender adults residing in the Metropolitan New York area. Transcriptions were analyzed using inductive thematic analysis.</p><p><strong>Results: </strong>Themes clustered around comfort and preferences for data collection processes and outcomes. Factors that influenced preferences and comfort around anatomy data were distinct from those impacting GI documentation preferences and comfort. The tension between the categories of GI and sex assigned at birth impacted anatomy data documentation preferences. Clinical context emerged as a consistent factor that impacts both preferences and comfort of GI and anatomy data documentation.</p><p><strong>Discussion and conclusion: </strong>GI data collection efforts in clinical settings must consider the implication of anatomy data collection when determining data collection best practice methodologies. Anticipated and experienced stigma remain significant hurdles to patient comfort and willingness to collect GI and anatomy data, and their impact on actual data collection should be further elucidated among diverse gender identities. Clinical data collection methods, tools, and education warrant ongoing research investment to further elucidate best practices.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"587-592"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981651/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145726643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin D Wissel, Zana Percy, Tanner J Zachem, Brett Beaulieu-Jones, Isaac S Kohane, Stuart L Goldstein, Emrah Gecili, Judith W Dexheimer
Objective: To understand the heterogeneous treatment effects of electronic alerts for acute kidney injury (AKI).
Materials and methods: Secondary analysis of individual patient data from 3 randomized controlled trials. Our outcome measure was 14-day all-cause mortality. Data from the ELAIA-1 trial were used to predict the individualized effect of alerts on mortality based on patients' phenotype. Results were internally validated on a holdout dataset and externally validated using data from 2 additional trials: UPenn and ELAIA-2. We used machine learning-based methods and performed a meta-analysis on individual patient data to identify patient subgroups whose risk of mortality was associated with alerts. In addition, provider actions following alerts were examined to explain how alerts impacted patient mortality.
Results: Compared to patients who were predicted to be harmed by an alert, patients predicted to benefit had a lower risk of death in both the internal validation cohort (n = 1809 patients; Pinteraction = .045) and both external validation cohorts (n = 7453 patients; Pinteraction < .0001). In external cohorts, 43 deaths may have been preventable if alerts were restricted to likely beneficiaries. Machine-learning based meta-analysis identified reduced mortality with alerts among patients with higher blood pressures (BP) and lower predicted risk, but increased mortality in non-urban and non-teaching hospitals. Provider responses to alerts differed across subgroups.
Discussion: Our findings indicate substantial heterogeneity in the effects of AKI alerts on patient mortality. Tailoring alert delivery based on predicted benefit may mitigate harm and enhance clinical outcomes.
Conclusion: Individualizing automated alerts may reduce all-cause mortality. A prospective trial of individualized alerts is needed to confirm these results.
Trial registration: https://clinicaltrials.gov/ct2/show/NCT02753751 and https://clinicaltrials.gov/ct2/show/NCT02771977.
{"title":"Heterogenous effect of automated alerts on mortality.","authors":"Benjamin D Wissel, Zana Percy, Tanner J Zachem, Brett Beaulieu-Jones, Isaac S Kohane, Stuart L Goldstein, Emrah Gecili, Judith W Dexheimer","doi":"10.1093/jamia/ocaf222","DOIUrl":"10.1093/jamia/ocaf222","url":null,"abstract":"<p><strong>Objective: </strong>To understand the heterogeneous treatment effects of electronic alerts for acute kidney injury (AKI).</p><p><strong>Materials and methods: </strong>Secondary analysis of individual patient data from 3 randomized controlled trials. Our outcome measure was 14-day all-cause mortality. Data from the ELAIA-1 trial were used to predict the individualized effect of alerts on mortality based on patients' phenotype. Results were internally validated on a holdout dataset and externally validated using data from 2 additional trials: UPenn and ELAIA-2. We used machine learning-based methods and performed a meta-analysis on individual patient data to identify patient subgroups whose risk of mortality was associated with alerts. In addition, provider actions following alerts were examined to explain how alerts impacted patient mortality.</p><p><strong>Results: </strong>Compared to patients who were predicted to be harmed by an alert, patients predicted to benefit had a lower risk of death in both the internal validation cohort (n = 1809 patients; Pinteraction = .045) and both external validation cohorts (n = 7453 patients; Pinteraction < .0001). In external cohorts, 43 deaths may have been preventable if alerts were restricted to likely beneficiaries. Machine-learning based meta-analysis identified reduced mortality with alerts among patients with higher blood pressures (BP) and lower predicted risk, but increased mortality in non-urban and non-teaching hospitals. Provider responses to alerts differed across subgroups.</p><p><strong>Discussion: </strong>Our findings indicate substantial heterogeneity in the effects of AKI alerts on patient mortality. Tailoring alert delivery based on predicted benefit may mitigate harm and enhance clinical outcomes.</p><p><strong>Conclusion: </strong>Individualizing automated alerts may reduce all-cause mortality. A prospective trial of individualized alerts is needed to confirm these results.</p><p><strong>Trial registration: </strong>https://clinicaltrials.gov/ct2/show/NCT02753751 and https://clinicaltrials.gov/ct2/show/NCT02771977.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"653-662"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981663/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145829031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu
Objectives: To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).
Materials and methods: We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.
Results: Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.
Discussion: While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.
Conclusion: Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.
{"title":"Information extraction from clinical notes: are we ready to switch to large language models?","authors":"Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu","doi":"10.1093/jamia/ocaf213","DOIUrl":"10.1093/jamia/ocaf213","url":null,"abstract":"<p><strong>Objectives: </strong>To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).</p><p><strong>Materials and methods: </strong>We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.</p><p><strong>Results: </strong>Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.</p><p><strong>Discussion: </strong>While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.</p><p><strong>Conclusion: </strong>Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"553-562"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981642/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145985179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing example-selection in retrieval-augmented biomedical in-context learning: reflections on the MMRAG study.","authors":"Weihao Cheng","doi":"10.1093/jamia/ocaf236","DOIUrl":"10.1093/jamia/ocaf236","url":null,"abstract":"","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"779-780"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981649/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146202867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew R Weckstein, Shirley V Wang, Richard Wyss, Sebastian Schneeweiss
Objectives: Real-world evidence (RWE) increasingly informs clinical decisions, yet manual adjustment for confounding limits scalability. Data-adaptive (DA) algorithms for high-dimensional proxy adjustment show promise but have not been systematically compared to investigator-specified (IS) approaches across diverse treatment scenarios. We evaluated whether DA strategies perform comparably to manually curated IS models using claims-based emulations of 15 randomized trials from the RCT-DUPLICATE initiative.
Materials and methods: We identified new-user cohorts for 15 trial emulations in Optum's de-identified Clinformatics Data Mart Database (2004-2023). Treatment effects were estimated using 3 adjustment strategies: (1) IS models with manually tailored covariates; (2) full-DA strategies using empirical features from semiautomated pipelines; and (3) hybrid-DA models incorporating both empirical and investigator-defined covariates. Agreement with RCT benchmarks was assessed via binary metrics and difference-in-differences.
Results: Outcome-adaptive LASSO achieved better RWE-RCT agreement than IS adjustment in 73% of full-DA and 87% of hybrid-DA emulations. Other DA methods considering feature associations with both treatment and outcome performed similarly well, while models tuned solely for treatment prediction performed poorly. Performance of IS vs DA strategies differed across emulated trials.
Discussion: Top DA algorithms matched manual IS models on average, but impact varied by emulation. Case studies illustrate the continued importance of subject-matter knowledge, particularly for complex treatment strategies.
Conclusion: Data-adaptive algorithms show promise for scalable confounding adjustment in large-scale evidence systems and as augmentation tools for investigator-specified designs. Hybrid strategies combining algorithmic methods with investigator expertise offer the most reliable approach for individual causal questions.
{"title":"Scalable confounding adjustment in real-world evidence: benchmarking data-adaptive and investigator-specified strategies in a large-scale trial emulation study.","authors":"Andrew R Weckstein, Shirley V Wang, Richard Wyss, Sebastian Schneeweiss","doi":"10.1093/jamia/ocaf204","DOIUrl":"10.1093/jamia/ocaf204","url":null,"abstract":"<p><strong>Objectives: </strong>Real-world evidence (RWE) increasingly informs clinical decisions, yet manual adjustment for confounding limits scalability. Data-adaptive (DA) algorithms for high-dimensional proxy adjustment show promise but have not been systematically compared to investigator-specified (IS) approaches across diverse treatment scenarios. We evaluated whether DA strategies perform comparably to manually curated IS models using claims-based emulations of 15 randomized trials from the RCT-DUPLICATE initiative.</p><p><strong>Materials and methods: </strong>We identified new-user cohorts for 15 trial emulations in Optum's de-identified Clinformatics Data Mart Database (2004-2023). Treatment effects were estimated using 3 adjustment strategies: (1) IS models with manually tailored covariates; (2) full-DA strategies using empirical features from semiautomated pipelines; and (3) hybrid-DA models incorporating both empirical and investigator-defined covariates. Agreement with RCT benchmarks was assessed via binary metrics and difference-in-differences.</p><p><strong>Results: </strong>Outcome-adaptive LASSO achieved better RWE-RCT agreement than IS adjustment in 73% of full-DA and 87% of hybrid-DA emulations. Other DA methods considering feature associations with both treatment and outcome performed similarly well, while models tuned solely for treatment prediction performed poorly. Performance of IS vs DA strategies differed across emulated trials.</p><p><strong>Discussion: </strong>Top DA algorithms matched manual IS models on average, but impact varied by emulation. Case studies illustrate the continued importance of subject-matter knowledge, particularly for complex treatment strategies.</p><p><strong>Conclusion: </strong>Data-adaptive algorithms show promise for scalable confounding adjustment in large-scale evidence systems and as augmentation tools for investigator-specified designs. Hybrid strategies combining algorithmic methods with investigator expertise offer the most reliable approach for individual causal questions.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"573-586"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145670308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}