Background: Cancer progression is an important outcome in cancer research. However, it is frequently documented only in electronic health records (EHRs) as unstructured text, which requires lengthy and costly chart reviews to extract for retrospective studies.
Objective: This study aimed to evaluate the performance of 3 deep learning language models in determining breast and colorectal cancer progression in EHRs.
Methods: EHRs for individuals diagnosed with stage 4 breast or colorectal cancer between 2004 and 2020 in Manitoba, Canada, were extracted. A chart review was conducted to identify cancer progression in each EHR. Data were analyzed with pretrained deep learning language models (Bio+ClinicalBERT, Clinical-BigBird, and Clinical-Longformer). Sensitivity, positive predictive value, area under the curve, and scaled Brier scores were used to evaluate performance. Influential tokens were identified by removing and adding tokens to EHRs and examining changes in predicted probabilities.
Results: Clinical-BigBird and Clinical-Longformer models for breast and colorectal cancer cohorts demonstrated higher accuracy than the Bio+ClinicalBERT models (scaled Brier scores for breast cancer models: 0.70-0.79 vs 0.49-0.71; scaled Brier scores for colorectal cancer models: 0.61-0.65 vs 0.49-0.61). The same models also demonstrated higher sensitivity (breast cancer models: 86.6%-94.3% vs 76.6%-87.1%; colorectal cancer models: 73.1%-78.9% vs 62.8%-77.0%) and positive predictive value (breast cancer models: 77.9%-92.3% vs 80.6%-85.5%; colorectal cancer models: 81.6%-86.3% vs 72.9%-82.9%) compared to Bio+ClinicalBERT models. All models could remove more than 84% of charts from the chart review process. The most influential token was the word progression, which was influenced by the presence of other tokens and its position within an EHR.
Conclusions: The deep learning language models could help identify breast and colorectal cancer progression in EHRs and remove most charts from the chart review process. A limited number of tokens may influence model predictions. Improvements in model performance could be obtained by increasing the training dataset size and analyzing EHRs at the sentence level rather than at the EHR level.
Background: Recent advances in large language models (LLMs), such as GPT-4o, offer a transformative opportunity to extract nuanced linguistic, emotional, and social features from campaign texts at scale. These models enable a deeper understanding of the factors influencing campaign success-far beyond what structured data alone can reveal. Given these advancements, there is a pressing need for an integrated modeling framework that leverages both LLM-derived features and machine learning algorithms to more accurately predict and explain success in medical crowdfunding.
Objective: This study addresses that gap by leveraging cutting-edge machine learning techniques alongside state-of-the-art large language models such as GPT-4o to automatically generate and extract nuanced linguistic, social, and clinical features from campaign narratives. By combining these features with ensemble learning approaches, the proposed methodology offers a novel and more comprehensive strategy for understanding and predicting crowdfunding success in the medical domain.
Methods: We used GPT-4o to extract linguistic and social determinants of health (SDOH) features from cancer crowdfunding campaign narratives. A Random Forest model with permutation importance was applied to rank features based on their contribution to predicting campaign success. Four machine learning algorithms-Random Forest, Gradient Boosting, Logistic Regression, and Elastic Net-were evaluated using stratified 10-fold cross-validation, with performance measured by accuracy, sensitivity, and specificity.
Results: Gradient Boosting consistently outperforms the other algorithms in terms of sensitivity (consistently around 0.786 to 0.798), indicating its superior ability to identify successful crowdfunding campaigns using linguistic and social determinants of health features. The permutation importance score reveals that for severe medical conditions, income loss, chemotherapy treatment, clear and effective communication, cognitive understanding, family involvement, empathy and social behaviors play an important role in the success of campaigns.
Conclusions: This study demonstrates that large language models like GPT-4o can effectively extract nuanced linguistic and social features from crowdfunding narratives, offering deeper insights than traditional methods. These features, when combined with machine learning, significantly improve the identification of key predictors of campaign success, such as medical severity, financial hardship, and empathetic communication. Our findings underscore the potential of LLMs to enhance predictive modeling in health-related crowdfunding and support more targeted policy and communication strategies to reduce financial vulnerability among cancer patients.
Clinicaltrial:
Background: As physicians spend up to twice as much time on electronic health record tasks as on direct patient care, digital scribes have emerged as a promising solution to restore patient-clinician communication and reduce documentation burden-making it essential to study their real-world impact on clinical workflows, efficiency, and satisfaction.
Objective: This study aimed to synthesize evidence on clinician efficiency, user satisfaction, quality, and practical barriers associated with the use of digital scribes using ambient listening and generative artificial intelligence (AI) in real-world clinical settings.
Methods: A rapid review was conducted to evaluate the real-world evidence of digital scribes using ambient listening and generative AI in clinical practice from 2014 to 2024. Data were collected from Ovid MEDLINE, Embase, Web of Science-Core Collection, Cochrane CENTRAL and Reviews, and PubMed Central. Predefined eligibility criteria focused on studies addressing clinical implementation, excluding those centered solely on technical development or model validation. The findings of each study were synthesized and analyzed through the QUEST human evaluation framework for quality and safety and the Systems Engineering Initiative for Patient Safety (SEIPS) 3.0 model to assess integration into clinicians' workflows and experience.
Results: Of the 1450 studies identified, 6 met the inclusion criteria. These studies included an observational study, a case report, a peer-matched cohort study, and survey-based assessments conducted across academic health systems, community settings, and outpatient practices. The major themes noted were as follows: (1) they decreased self-reported documentation times, with associated increased length of notes; (2) physician burnout measured using standardized scales was unaffected, but physician engagement improved; (3) physician productivity, assessed via billing metrics, was unchanged; and (4) the studies fell short when compared to standardized frameworks.
Conclusions: Digital scribes show promise in reducing documentation burden and enhancing clinician satisfaction, thereby supporting workflow efficiency. However, the currently available evidence is sparse. Future real-world, multifaceted studies are needed before AI scribes can be recommended unequivocally.
Background: Early diagnosis of diabetes is essential for early interventions to slow the progression of dysglycemia and its comorbidities. However, among individuals with diabetes, about 23% were unaware of their condition.
Objective: This study aims to investigate the potential use of automated machine learning (AutoML) models and self-reported data in detecting undiagnosed diabetes among US adults.
Methods: Individual-level data, including biochemical tests for diabetes, demographic characteristics, family history of diabetes, anthropometric measures, dietary intakes, health behaviors, and chronic conditions, were retrieved from the National Health and Nutrition Examination Survey, 1999-2020. Undiagnosed diabetes was defined as having no prior self-reported diagnosis but meeting diagnostic criteria for elevated hemoglobin A1c, fasting plasma glucose, or 2-hour plasma glucose. The H2O AutoML framework, which allows for automated hyperparameter tuning, model selection, and ensemble learning, was used to automate the machine learning workflow. For comparative analysis, 4 traditional machine learning models-logistic regression, support vector machines, random forest, and extreme gradient boosting-were implemented. Model performance was evaluated using the area under the receiver operating characteristic curve.
Results: The study included 11,815 participants aged 20 years and older, comprising 2256 patients with undiagnosed diabetes and 9559 without diabetes. The average age was 59.76 (SD 15.0) years for participants with undiagnosed diabetes and 46.78 (SD 17.2) years for those without diabetes. The AutoML model demonstrated superior performance compared with the 4 traditional machine learning models. The trained AutoML model achieved an area under the receiver operating characteristic curve of 0.909 (95% CI 0.897-0.921) in the test set. The model demonstrated a sensitivity of 70.26%, specificity of 90.46%, positive predictive value of 64.10%, and negative predictive value of 92.61% for identifying undiagnosed diabetes from nondiabetes.
Conclusions: To our knowledge, this study is the first to utilize the AutoML model for detecting undiagnosed diabetes in US adults. The model's strong performance and applicability to the broader US population make it a promising tool for large-scale diabetes screening efforts.
Background: Multidisciplinary care management teams must rapidly prioritize interventions for patients with complex medical and social needs. Current approaches rely on individual training, judgment, and experience, missing opportunities to learn from longitudinal trajectories and prevent adverse outcomes through recommender systems.
Objective: This study aims to evaluate whether a reinforcement learning approach could outperform standard care management practices in recommending optimal interventions for patients with complex needs.
Methods: Using data from 3175 Medicaid beneficiaries in care management programs across 2 states from 2023 to 2024, we compared alternative approaches for recommending "next best step" interventions: the standard experience-based approach (status quo) and a state-action-reward-state-action (SARSA) reinforcement learning model. We evaluated performance using clinical impact metrics, conducted counterfactual causal inference analyses to estimate reductions in acute care events, assessed fairness across demographic subgroups, and performed qualitative chart reviews where the models differed.
Results: In counterfactual analyses, SARSA-guided care management reduced acute care events by 12 percentage points (95% CI 2.2-21.8 percentage points, a 20.7% relative reduction; P=.02) compared to the status quo approach, with a number needed to treat of 8.3 (95% CI 4.6-45.2) to prevent 1 acute event. The approach showed improved fairness across demographic groups, including gender (3.8% vs 5.3% disparity in acute event rates, reduction 1.5%, 95% CI 0.3%-2.7%) and race and ethnicity (5.6% vs 8.9% disparity, reduction 3.3%, 95% CI 1.1%-5.5%). In qualitative reviews, the SARSA model detected and recommended interventions for specific medical-social interactions, such as respiratory issues associated with poor housing quality or food insecurity in individuals with diabetes.
Conclusions: SARSA-guided care management shows potential to reduce acute care use compared to standard practice. The approach demonstrates how reinforcement learning can improve complex decision-making in situations where patients face concurrent clinical and social factors while maintaining safety and fairness across demographic subgroups.
Background: Australians can face significant challenges in navigating the health care system, especially in rural and regional areas. Generative search tools, powered by large language models (LLMs), show promise in improving health information retrieval by generating direct answers. However, concerns remain regarding their accuracy and reliability when compared to traditional search engines in a health care context.
Objective: This study aimed to compare the effectiveness of a generative artificial intelligence (AI) search (ie, Microsoft Copilot) versus a conventional search engine (Google Web Search) for navigating health care information.
Methods: A total of 97 adults in Queensland, Australia, participated in a web-based survey, answering scenario-based health care navigation questions using either Microsoft Copilot or Google Web Search. Accuracy was assessed using binary correct or incorrect ratings, graded correctness (incorrect, partially correct, or correct), and numerical scores (0-2 for service identification and 0-6 for criteria). Participants also completed a Technology Rating Questionnaire (TRQ) to evaluate their experience with their assigned tool.
Results: Participants assigned to Microsoft Copilot outperformed the Google Web Search group on 2 health care navigation tasks (identifying aged care application services and listing mobility allowance eligibility criteria), with no clear evidence of a difference in the remaining 6 tasks. On the TRQ, participants rated Google Web Search higher in willingness to adopt and perceived impact on quality of life, and lower in effort needed to learn. Both tools received similar ratings in perceived value, confidence, help required to use, and concerns about privacy.
Conclusions: Generative AI tools can achieve comparable accuracy to traditional search engines for health care navigation tasks, though this did not translate into an improved user experience. Further evaluation is needed as AI technology improves and users become more familiar with its use.
Background: Electronic patient records are a valuable yet underused data source; they have been explored in research using natural language processing, but not yet within a third-sector organization.
Objective: This study aimed to apply natural language processing to develop a risk identification tool capable of discerning high and low suicide risk among veterans, using electronic patient records from a United Kingdom-based veteran mental health charity.
Methods: A total of 20,342 notes were extracted for this purpose. To develop the risk tool, 70% of the records formed the training dataset, while the remaining 30% were allocated for testing and evaluation. The classification framework was devised and trained to categorize risk as a binary outcome: 1 indicating high risk and 0 indicating low risk.
Results: The efficacy of each classifier model was assessed by comparing its results with those from clinical risk assessments. A logistic regression classifier was found to perform best and was used to develop the final model. This comparison allowed for the calculation of the positive predictive value (mean 0.74, SD 0.059; 95% CI 0.70-0.77), negative predictive value (mean 0.73, SD 0.024; 95% CI 0.72-0.75), sensitivity (mean 0.75, SD 0.017; 95% CI 0.74-0.76), F1-score (mean 0.74, SD 0.033; 95% CI 0.72-0.76), and accuracy, which was measured using the Youden index (mean 0.73, SD 0.035; 95% CI 0.71-0.76).
Conclusions: The risk identification tool successfully determined the correct risk category of veterans from a large sample of clinical notes. Future studies should investigate whether this tool can detect more nuanced differences in risk and be generalizable across data sources.

