Objectives: Elevated blood pressure (BP) and hypertension are often overlooked in pediatric care. We adapted a pediatric hypertension clinical decision support (CDS) for a primarily rural health system and compared CDS impact across varied implementation approaches.
Methods: In this cluster randomized trial, 40 primary care clinics were randomized 1:1:1 to CDS with high-intensity implementation, CDS with low-intensity implementation, or usual care (UC). Low-intensity implementation was limited to online training. High-intensity CDS implementation included in-person and online training, monthly check-ins and feedback regarding CDS use. Patients 6-17 years with BP measured at a primary care visit from August 1, 2022 to January 31, 2024 were eligible. Outcomes were remeasurement of elevated BP during a visit and recognition of hypertension within 6 months of meeting criteria. Analyses adjusted for clustered study design and patient characteristics.
Results: Of 9155 patients with an elevated BP, remeasurement during the visit occurred for 51.5% at high-intensity, 23.6% at low-intensity, and 6.2% at UC clinics. Among 578 patients with incident hypertension, recognition was 42.8% at high-intensity, 24.5% at low-intensity and 14.4% at UC clinics. Patients attending high or low-intensity CDS clinics were more likely than those at UC to have elevated BP remeasured (adjusted odds ratio [aOR] 8.70; 95% CI 5.68-13.3) and to have their hypertension clinically recognized (aOR 2.94; 1.00-8.60). High-intensity implementation was more effective than low-intensity implementation for repeat BP measurement (aOR 3.45; 1.88-6.33) and hypertension recognition (aOR 2.31; 1.08-4.98).
Conclusions: CDS improved pediatric BP care in a primarily rural health system while effectiveness varied by implementation approach.
Objective: Predictive models of suicide risk have focused on features extracted from structured data found in electronic health records, with limited consideration of predisposing life events (LE) expressed in unstructured clinical text such as housing instability and marital troubles. This study aims to expand upon previous research, demonstrating how high-performance computing (HPC) and machine learning methodologies can be used to extract and annotate 8 LE across all Veterans Health Administration (VHA) unstructured clinical text data with enriched performance metrics. Integration of the 8 LE with the structured features using different statistical and machine learning (ML) methods is also discussed.
Materials/methods: VHA-wide clinical text from January 2000 to January 2022 was pre-processed and analyzed using HPC. Data-driven lexicon curation enabled a rule-based annotator to extract LE, followed by machine learning for improved positive predictive value (PPV). NLP results were analyzed longitudinally and then integrated and compared to a baseline statistical model predicting risk for a combined outcome (suicide death, suicide attempt and overdose).
Results: First-time LE mentions showed a significant temporal correlation to suicide-related events (SRE) (suicide ideation, attempt and/or death) and are not associated with administrative bias. Predictive linear regression (LR) models integrating NLP-derived LE show an improved AUC of 0.81 and novel patient identification of up to 18%.
Discussion: Our analysis shows that these methodologies helped improve performance metrics significantly from previous work, while outperforming related works. These results demonstrated that NLP-derived LE served as acute predictors for SRE.
Conclusion: NLP integration into predictive models may help improve clinician decision support. Future work is necessary to better define and integrate these and other potential LE.
Purpose: Foundation models pretrained on structured electronic health record (EHR) data promise improved predictive performance, sample efficiency and resilience to distribution shifts. However, model design, scale and use remain unclear. Objectives were to characterize foundation models pretrained on structured EHR data; examine temporal trends in model application and scale, architecture and design; and assess the extent to which publications omitted methodological details.
Methods: We searched MEDLINE and Embase (2018-October 2025) for foundation models pretrained on structured EHR data using self-supervised learning and applied to clinical prediction tasks. Study selection and data abstraction were performed in duplicate. Characteristics were summarized and stratified by median publication year.
Results: Fifty-three studies were included; publications increased over time. Most datasets (79%) originated from the United States. None pretrained exclusively on pediatric cohorts. Model architecture shifted towards transformers (P = .013) with longer context windows (P = .028), while application shifted from exclusively embedding-based toward generative or mixed use (P < .001). Choices regarding feature inclusion, temporal representation, self-supervised objective and downstream adaptation remained heterogeneous. Only 26% of studies evaluated transfer to external datasets, and none described clinical deployment. Key indicators of scale and compute were frequently unreported.
Conclusions: EHR foundation models are proliferating and increasingly transformer-based and generative. Yet methodological choices and reporting remain fragmented, indicating design trade-offs and best practices for EHR foundation models have not yet been established. None describe clinical deployment. Future work should clarify which design choices improve performance, robustness and transferability, increase reporting transparency and identify if they can be implemented to improve patient-important outcomes.
Objective: This study examined the use of machine learning (ML) and domain-specific enrichment in patient-generated health data, in the form of free-text meal logs, to classify meals on alignment with different nutritional goals.
Materials and methods: We used a dataset of over 3000 meal records collected by 114 individuals from a diverse, low-income community in a major US city using a mobile app. Registered dietitians (RDs) provided expert judgment for meal-goal alignment, used as the "gold-standard" for evaluation. Using text embeddings (TF-IDF and BERT) and domain-specific enrichment information (ontologies, ingredient parsers, and macronutrient contents) as inputs, we evaluated the performance of logistic regression and multilayer perceptron classifiers using accuracy, precision, recall, and F1 score against the gold standard and the individual's self-assessment.
Results: On average, individuals who logged meals achieved 0.576 accuracy of meal-goal alignment self-assessments. Even without enrichment, ML outperformed individual's self-assessments, with accuracies within 0.726-0.841 for different goals. The best-performing combination of ML classifier with enrichment achieved even higher accuracies (0.814-0.902). In general, ML classifiers with enrichment of parsed ingredients, food entities, and macronutrients information performed well across multiple nutritional goals, but there was variability in the impact of enrichment and classification algorithm on accuracy of classification for different nutritional goals.
Conclusion: ML can utilize unstructured free-text meal logs and reliably classify whether meals align with specific nutritional goals, exceeding individuals' self-assessments, especially when incorporating nutrition domain knowledge. Our findings highlight the potential of ML analysis of patient-generated health data to support patient-centered nutrition guidance in precision healthcare.
Objectives: Patients often receive health care from multiple organizations. Privacy Preserving Record Linkage (PPRL) is a technology for linking patient records without releasing personally identifiable information. We compared a commercial PPRL tool that uses the XGBoost machine learning algorithm with Care Everywhere (CE), a widely used rule-based patient linkage module.
Materials and methods: We matched the complete patient populations from Cedars-Sinai Health System and University of California, Los Angeles (UCLA) Health using the XGBoost PPRL tool at each of 3 score thresholds (98, 95, and 90), reflecting stricter vs more permissive matching. We compared PPRL matches with CE matches for the cohort of 849 157 patients who had been queried by CE from UCLA to Cedars-Sinai over 18 months. To classify proposed matches as false, uncertain or correct matches, 2 reviewers manually reviewed a random sample of 1200 patients representing each category of matches.
Results: Care Everywhere matched 18% of the cohort, whereas PPRL matched 9%, 27%, and 29% of the cohort using the 98, 95, and 90 thresholds, respectively. Projecting the false match rates from the manual review to the original populations, precision for CE was 99.6% (95% CI, 97.8%-100%). Precision for PPRL was 100% (95% CI, 99.2%-100%), 99.4% (95% CI, 97.4%-99.9%), and 98.7% (95% CI, 96.5%-99.4%) at the 3 thresholds, respectively. Using CE and PPRL matches together as a proxy gold standard, recall for CE was 61.5% (95% CI, 60.3%-61.9%) and for PPRL was 30.6% (95% CI, 30.3%-30.7%), 92.2% (95% CI, 90.2%-92.7%), and 96.8% (95% CI, 94.6%-97.5%) at each threshold, respectively.
Conclusions: The precision and recall of PPRL matching differed substantially across the available match thresholds. Compared with the rule-based system, PPRL at the 95 threshold had 50% higher recall with similar precision. Privacy Preserving Record Linkage holds promise for improving research, but users must choose the precision vs recall needed for their application.
Objectives: We aimed to identify and map recent studies using high-frequency, time-series electronic patient-generated health data (ePGHD) to assess treatment response; characterize ePGHD types and collection methods; summarize ePGHD-based definitions of treatment response; and describe analytical approaches used.
Materials and methods: We systematically searched 4 databases for articles published between January 2022 and June 2024, supplemented by a forward citation search until June 2025. Peer-reviewed studies were eligible if ePGHD were collected outside clinical settings, and either reported at least weekly (ie, if actively reported by participants) or summarized discretely (eg, daily) if passively collected via wearables/sensors. We screened articles for eligibility independently in duplicate and synthesized extracted data descriptively.
Results: Our search yielded 4030 articles, of which we included 186. Most studies collected ePGHD using mobile applications or webforms (n = 133) over 4-12 weeks (n = 67). Prior to analysis, 132 studies excluded portions or condensed ePGHD into one or more summaries. Among 172 studies estimating treatment response, 98 applied longitudinal methods (eg, mixed-effects models) that accounted for repeated measures while capturing within- and between-subject variations, whereas 74 used cross-sectional approaches. Of 18 prediction modeling studies, 16 employed machine learning techniques, with only 4 explicitly modeling repeated measures. Five studies identified clusters of response trajectories generally without incorporating temporal dependencies (eg, using K-means).
Discussion and conclusion: Many studies in this review did not fully leverage the high-frequency, longitudinal nature of ePGHD. Future research should adopt more appropriate and readily available analytic methods to maximize the potential of time-series ePGHD for generating insights into treatment response.
Objective: Traditional electronic health record (EHR) foundation models fail to process unseen medical codes, limiting generalizability across institutions with different vocabularies. To address this problem, we introduce medical concept representation (MedRep), standardized medical concept representations for EHR foundation models, enabling recognition of semantically similar concepts regardless of their specific IDs.
Materials and methods: We utilized Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) vocabulary covering 7.5 million concepts from 66 medical vocabularies. MedRep integrates large language model-generated concept descriptions and OMOP graph ontology using graph contrastive learning with knowledge distillation. We evaluated MedRep-based models on MIMIC-IV (internal validation) and EHRSHOT (external validation) across 9 prediction tasks including clinical outcomes, phenotypes, and in-hospital events.
Results: MedRep consistently outperformed baseline models, particularly in external validation with average improvements of 0.088 in area under the receiver operating characteristic curve and 0.208 in area under the precision-recall curve. Qualitative analysis demonstrated that MedRep-based models identified more clinically relevant concepts when making decisions than the baseline models. Performance improvements remained stable across diverse EHR foundation model architectures, including BEHRT, Med-BERT, and CDM-BERT.
Discussion: MedRep improves the generalizability of EHR foundation models by encouraging similar concepts to have similar representations. EHR foundation models developed at different institutions could cooperate through MedRep, merging knowledge from multiple hospital datasets. In addition, our approach could reduce healthcare disparities by enabling smaller institutions to benefit from models trained on larger datasets.
Conclusion: MedRep improves EHR foundation model performance, interpretability, and generalizability, serving as a standard baseline representation for EHR foundation models adopting OMOP CDM.
Objective: Automated literature screening in biomedical research is often hindered by domain shifts and scarcity of labeled data, which limit model accuracy and generalizability. While large language models (LLMs) perform well in zero-shot settings, they often fail to capture complex, domain-specific reasoning patterns. To address this limitation, this study investigates whether an interactive, weakly supervised learning framework combining GPT (generative pre-trained transformer)'s fine-tuning adaptability with DeepSeek's reasoning capabilities can improve literature screening performance across biomedical domains.
Materials and methods: We developed an active learning framework that leverages model disagreement between GPT-4o and DeepSeek to improve literature screening performance. This process began with a labeled corpus of 6331 articles on large language models, from which a model disagreement analysis was performed to identify cases where GPT-4o misclassified and DeepSeek produced correct predictions. Three GPT variants-GPT-4o, GPT-4o-mini, and GPT-4.1-nano, were fine-tuned under standard supervised learning settings using these disagreement-based samples. Fine-tuning prompts incorporated classification labels and, when available, rationale traces generated by DeepSeek to provide reasoning-augmented weak supervision. Model performance was evaluated on an independent benchmark set of 291 annotated articles across 10 topic queries in cancer immunotherapy and LLMs in medicine, using standard evaluation metrics, with recall as the primary measure.
Results: Fine-tuning GPT models using disagreement-based examples significantly improved performance. GPT-4o-mini achieved the best overall results after fine-tuning, especially with the highest F1 score (0.93, P < .001) and recall (0.95, P < .001). Across the biomedical topics, fine-tuned models consistently outperformed their zero-shot counterparts without increasing reviewer workload.
Discussion: These findings demonstrate the effectiveness of disagreement-driven active learning in enhancing GPT-based biomedical literature screening. Lightweight models like GPT-4o-mini benefit most from targeted, reasoning-enriched training, highlighting their suitability for scalable deployment.
Conclusion: This study introduces an interactive active learning framework that leverages fine-tuned LLMs with reasoning capabilities to enhance literature screening. The approach offers a scalable solution to more efficient and reliable information retrieval in systematic reviews.
Objectives: Skin cancer is the most common malignancy in the United States, with more than five million cases diagnosed annually among 3.3 million individuals. Melanoma, the deadliest form of skin cancer, accounts for roughly 200 000 new diagnoses each year and nearly 10 000 deaths. AI-based skin cancer detection is being developed and tested in laboratory and academic settings as a promising approach to improve access and reduce disparities. However, current models often underperform on darker skin tones (Fitzpatrick Types V and VI), creating fairness concerns that must be addressed prior to clinical deployment. Existing fairness-aware methods focus on algorithmic adjustments while neglecting data quality and representation. We introduce FAIR-SCAN (Fairness and Accuracy through Ranking-Based Subset Selection for Skin Cancer Detection), a data-centric framework that enhances fairness through subset selection guided by marginal contribution score (MCS) estimation.
Materials and methods: FAIR-SCAN ranks data points by their contribution to both accuracy and fairness, then selects an optimal subset for training. We evaluated its effectiveness using images from Diverse Dermatology Images (DDI) and Fitzpatrick 17K.
Results: FAIR-SCAN improved balance in accuracy, True Positive Rate, and False Positive Rate across skin tones while reducing the training dataset by 50%, outperforming algorithm-focused fairness methods.
Discussion: These findings highlight the importance of strategic data selection in mitigating bias in AI-driven diagnostics. FAIR-SCAN's data-centric approach enhances both precision and equity in skin cancer detection.
Conclusion: Strategic data selection is critical for equitable AI-driven diagnostics. FAIR-SCAN advances fairness and accuracy in skin cancer detection, supporting development of trustworthy clinical AI systems.

