[This corrects the article DOI: 10.1093/jamiaopen/ooz061.].
[This corrects the article DOI: 10.1093/jamiaopen/ooz061.].
Objective: To develop and evaluate an automated classification system for labeling Exposure Process Coding System (EPCS) quality codes-specifically exposure and encourage events-during in-person exposure therapy sessions using automatic speech recognition (ASR) and natural language processing techniques.
Materials and methods: The system was trained and tested on 360 manually labeled pediatric Obsessive-Compulsive Disorder (OCD) therapy sessions from 3 clinical trials. Audio recordings were transcribed using ASR tools (OpenAI's Whisper and Google Speech-to-Text). Transcription accuracy was evaluated via word error rate (WER) on manual transcriptions of 2-minute audio segments compared against ASR-generated transcripts. The resulting text was analyzed with transformer-based models, including Bidirectional Encoder Representations from Transformers (BERT), Sentence-BERT, and Meta Llama 3. Models were trained to predict EPCS codes in 2 classification settings: sequence-level classification, where events are labeled in delimited text chunks, and token-level classification, where event boundaries are unknown. Classification was performed either with fine-tuned transformer-based models, or with logistic regression on embeddings produced by each model.
Results: With respect to transcription accuracy, Whisper outperformed Google Speech-to-Text with a lower WER (0.31 vs 0.51). For sequence classification setting, Llama 3 models achieved high performance with area under the ROC curve (AUC) scores of 0.95 for exposures and 0.75 for encourage events, outperforming traditional methods and standard BERT models. In the token-level setting, fine-tuned BERT models performed best, achieving AUC scores of 0.85 for exposures and 0.75 for encourage events.
Discussion and conclusion: Current ASR and transformer-based models enable automated quality coding of in-person exposure therapy sessions. These findings demonstrate potential for real-time assessment in clinical practice and scalable research on effective therapy methods. Future work should focus on optimization, including improvements in ASR accuracy, expanding training datasets, and multimodal data integration.
Objectives: The objective of this study was to develop and test natural language processing (NLP) methods for screening and, ultimately, predicting the cancer relevance of peer-reviewed publications.
Materials and methods: Two datasets were used: (1) manually curated publications labeled for cancer relevance, co-authored by members of The University of Kansas Cancer Center (KUCC) and (2) a derived dataset containing cancer-related abstracts from American Association for Cancer Research journals and noncancer-related abstracts from other medical journals. Two text encoding methods were explored: term frequency-inverse document frequency (TF-IDF) vectorization and various BERT embeddings. These representations served as inputs to 3 supervised machine learning classifiers: Support Vector Classification (SVC), Gradient Boosting Classification, and Multilayer Perceptron (MLP) neural networks. Model performance was evaluated by comparing predictions to the "true" cancer-relevant labels in a withheld test set.
Results: All machine learning models performed best when trained and tested within the derived dataset. Across the datasets, SVC and MLP both exhibited strong performance, with F1 scores as high as 0.976 and 0.997, respectively. BioBERT embeddings resulted in slightly higher metrics when compared to TF-IDF vectorization across most models.
Discussion: Models trained on the derived data performed very well internally; however, weaker performance was noted when these models were tested on the KUCC dataset. This finding highlights the subjective nature of cancer-relevant determinations. In contrast, KUCC trained models had high predictive performance when tested on the derived-specific classifications, showing that models trained on the KUCC dataset may be suitable for wider cancer-relevant prediction.
Conclusions: Overall, our results suggest that NLP can effectively automate the classification of cancer-relevant publications, enhancing research productivity tracking; however, great care should be taken in selecting the appropriate data, text representation approach, and machine learning approach.
Objective: Neonatal jaundice monitoring is resource-intensive. Existing artificial intelligence methods use image or clinical data, but none systematically combine both or compare feature contributions. This study fills that gap by extracting and analyzing multimodal features on a large dataset, identifying an optimal feature set for accurate, accessible jaundice assessment.
Materials and methods: This study collected clinical data and skin images from 3 body regions of 633 neonates, generating 460 features across 4 categories. Four tree-based models were used to predict total serum bilirubin levels and feature importance analysis guided the selection of an optimal feature set.
Results: The optimal performance was achieved using the Light Gradient Boosting Machine (LGBM) model with 140 selected features, yielding a root mean square error (RMSE) of 2.0477 mg/dL and a Pearson correlation of 0.8435. This represents a performance gain of over 10% in RMSE compared to models using only a single data modality. Moreover, selecting the top 30 features based on SHapley Additive exPlanation (SHAP) allows for a substantial reduction in data dimensionality, while maintaining performance within 5% of the optimal model.
Discussion: Color features contributed over 60% of the total importance, with clinical data adding more than 25%, led by hour of life. Light temperature also affected predictions, while texture features had minimal impact. Among body regions, the abdomen provided the most informative signals for jaundice severity.
Conclusion: The proposed algorithm shows promise for real-world use by enabling timely, automated jaundice assessment for families, while also offering insights for future research and broader medical applications.
Objectives: Prediction models using statistical or machine learning (ML) approaches can enhance clinical decision support tools. Infliximab (IFX), a biologic with a newly introduced biosimilar for Crohn's disease (CD) and ulcerative colitis (UC), presents an opportunity to evaluate these tools at time of biosimilar switch to predict disease flares. This study sought to evaluate real-world safety and effectiveness of nonmedical IFX biosimilar switch in a national US cohort of CD and UC patients, and to develop and compare interpretable models for predicting adverse clinical events among patients on maintenance IFX.
Materials and methods: This retrospective cohort study used administrative and clinical data from the National Veterans Health Administration Corporate Data Warehouse. It included 2529 Veterans with CD or UC on maintenance IFX (2017-2020), either continuing originator IFX or switching to a biosimilar. The primary outcome was disease-related flare. Classification and survival models were developed using traditional and ML methods and assessed via receiver operating characteristic curve, precision-recall curve, and decision curve analysis.
Results: In 2529 Veterans with CD or UC, biosimilar switch had low predictive importance across survival models. Objective laboratory-related information yielded the highest validation. Random forest+ (RF+) outperformed all other statistical and ML models. Prior flares and total health-care encounters were the 2 most important predictors, while hemoglobin was the top laboratory predictor.
Conclusions: Prediction models, particularly RF+, may aid in optimizing biologic therapy for CD and UC by identifying patients at higher risk of flare following a biosimilar switch.
Background: Bias evaluations of machine learning (ML) models often focus on performance in research settings, with limited assessment of downstream bias following clinical deployment. The objective of this study was to evaluate whether CHARTwatch, a real-time ML early warning system for inpatient deterioration, demonstrated algorithmic bias in model performance, or produced disparities in care processes, and outcomes across patient sociodemographic groups.
Methods: We evaluated CHARTwatch implementation on the internal medicine service at a large academic hospital. Patient outcomes during the intervention period (November 1, 2020-June 1, 2022) were compared to the control period (November 1, 2016-December 31, 2019) using propensity score overlap weighting. We evaluated differences across key sociodemographic subgroups, including age, sex, homelessness, and neighborhood-level socioeconomic and racialized composition. Outcomes included model performance (sensitivity and specificity), processes of care, and patient outcomes (non-palliative in-hospital death).
Results: Among 12 877 patients (9079 control, 3798 intervention), 13.3% were experiencing homelessness and 36.9% lived in the quintile with the highest neighborhood racialized and newcomer populations. Model sensitivity was 70.1% overall, with no significant variation across subgroups. Model specificity varied by age, <60 years: 93% (95% Confidence Interval [CI] 91-95%), 60-80 years: 90% (95%CI 87-92%), and >80 years: 84% (95%CI 79-88%), P < .001, but not other subgroups. CHARTwatch implementation was associated with an increase in code status documentation among patients experiencing homelessness, without significant differences in other care processes or outcomes.
Conclusion: CHARTwatch model performance and impact were generally consistent across measured sociodemographic subgroups. ML-based clinical decision support tools, and associated standardization of care, may reduce existing inequities, as was observed for code status orders among patients experiencing homelessness. This evaluation provides a framework for future bias assessments of deployed ML-CDS tools.
Objectives: The NIH's Bridge2AI Program has funded 4 "new flagship biomedical and behavioral datasets that are properly documented and ready for use with AI [artificial intelligence] or ML [machine learning] technologies" to promote the adoption of AI. This article discusses the challenges and lessons learned in data collection and governance to ensure their responsible use.
Materials and methods: We outline major steps involved in creating and using these datasets in ethically acceptable ways, including (1) data selection-what data are being selected and why, (2) increasing attention to public concerns, (3) the role of participant consent depending on data source, (4) ensuring responsible use, (5) where and how data are stored, (6) what control participants have over data sharing, (7) data access, and (8) data download.
Results: We discuss ethical, legal, social, and practical challenges raised at each step of creating AI-ready datasets, noting the importance of addressing issues of future data storage and use. We identify some of the many choices that these projects have made, including how to incorporate public input, where to store data, and defining criteria for access to and downloading data.
Discussion: The processes involved in the establishment and governance of the Bridge2AI datasets vary widely but have common elements, suggesting opportunities for future programs to lean upon Bridge2AI strategies.
Conclusions: This article discusses the challenges and lessons learned in data collection and governance to ensure their responsible use, particularly as confronted by the 4 distinct projects funded by this program.
Objective: Identifying social determinants of mental health (SDOMH) in patients with opioid use disorder (OUD) is crucial for estimating risk and enabling early intervention. Extracting such data from unstructured clinical notes is challenging due to annotation complexity and requires advanced natural language processing (NLP) techniques. We propose the Human-in-the-Loop Large Language Model Interaction for Annotation (HLLIA) framework, combined with a Multilevel Hierarchical Clinical-Longformer Embedding (MHCLE) algorithm, to annotate and predict SDOMH variables.
Materials and methods: We utilized 2636 annotated discharge summaries from the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. High-quality annotations were ensured via a human-in-the-loop approach, refined using large language models (LLMs). The MHCLE algorithm performed multi-label classification of 13 SDOMH variables and was evaluated against baseline models, including RoBERTa, Bio_ClinicalBERT, ClinicalBERT, and ClinicalBigBird.
Results: The MHCLE model achieved superior performance with 96.29% accuracy and a 95.41% F1score, surpassing baseline models. Training-testing policies P1, P2, and P3 yielded accuracies of 98.49%, 90.10%, and 89.04%, respectively, highlighting the importance of human intervention in refining LLM annotations.
Discussion and conclusion: Integrating the MHCLE model with the HLLIA framework offers an effective approach for predicting SDOMH factors from clinical notes, advancing NLP in OUD care. It highlights the importance of human oversight and sets a benchmark for future research.
Objective: This study aims to develop and validate of a span-based annotation framework for clinical named entity recognition (NER) using large language models (LLMs) based on Korean emergency department clinical notes.
Materials and methods: Two datasets with the same entity types but different annotation spans (word- vs phrase-level) were constructed, with the phrase-level dataset further was expanded into a doubled version. A Korean language-specific LLM was fine-tuned on each dataset, producing three variants that were compared with two baseline models, few-shot LLM and fine-tuned small language model (SLM). The final variant fine-tuned on the doubled phrase-level dataset was further evaluated against a human annotator.
Results: In all experimental settings, three variants outperformed the baselines by achieving the highest F1 scores across all metrics. The final variant achieved F1 scores exceeding 0.80 across all averaging strategies and evaluation metrics, including token-based, span-based exact, and span-based partial evaluations demonstrating its robustness applicable in a practical setting.
Discussion: While prompt engineering with few-shot is widely adopted for LLM-based clinical NER, our results proved that supervised fine-tuning (SFT) is consistently superior. The final variant outperformed the human annotator, emphasizing its potential as an automatic labeling tool.
Conclusion: This study introduced a novel span-based annotation framework for LLM-based clinical NER verified by three independent experiments. In multilingual and real-world clinical settings, LLMs have proven in handling complex entity spans that include word-level and phrase-level annotations, particularly for long and attribute-rich entities.
Objectives: This study describes the utilization and experiences of artificial intelligence (AI)-generated draft responses to patient messages in pediatric ambulatory clinicians and contextualizes their experiences in relation to those of adult specialty clinicians.
Materials and methods: A prospective pilot was conducted from September 2023 to August 2024 in 2 pediatric clinics (General Pediatric and Adolescent Medicine) and 2 obstetric clinics (Reproductive Endocrinology and Infertility and General Obstetrics) within an academic health system in Northern California. Participants included physician, nurse, and medical assistant volunteers. The intervention involved a feature utilizing large language models embedded in the electronic health record to generate draft responses. Proportion of AI-generated draft used was collected, as were prepilot and follow-up surveys.
Results: A total of 61 clinicians (26 pediatric, 35 obstetric) enrolled, with 46 (75%) completing both surveys. Pediatric clinicians utilized 13.3% (95% CI, 12.3%-14.4%) of AI-generated drafts, and usage rates when responding to patients vs their proxies was similar (15% vs 12.9%, P = .24). Despite using AI-generated drafts significantly less than obstetric clinicians (18.3% [17.2%-19.5%], P < .0001), pediatric clinicians reported a significant reduction in perceived task load (NASA Task Load Index: 59.9-50.9, P = .04) and were more likely to recommend the tool (LTR: 7.0 vs 5.2, P = .04).
Discussion and conclusion: Pediatric clinicians used AI-generated drafts at a rate within previously reported ranges in adult specialties and experienced utility. These findings suggest this tool has potential for enhancing efficiency and reducing task load in pediatric care.

