Pub Date : 2025-11-14DOI: 10.1016/j.jbi.2025.104957
Iris Beerepoot , Sjaak Brinkkemper , Elke Huntink , Berfin Duman , Hajo A. Reijers , Nienke Bleijenberg
Objective:
To assess the feasibility of using a large language model (LLM) to generate structured event logs from conversational data in home-based nursing care, with the goal of reducing the documentation burden and enabling process analysis.
Methods:
We conducted an exploratory study involving 27 audio-recorded home care visits between district nurses and patients. These recordings were transcribed and used as input for a Generative Pre-Trained Transformer (GPT) to identify nursing interventions and construct event logs, using the standardised Nursing Interventions Classification (NIC) system. We applied and evaluated different prompts through an iterative, interdisciplinary process involving computer scientists and nurse researchers.
Results:
GPT demonstrated reasonable ability to extract nursing interventions from conversational transcripts, especially when activities were discussed explicitly and temporally aligned. Challenges emerged when information was implicit, ambiguous, or not captured in the dialogue. We propose five guidelines for using LLMs in this context, addressing data source limitations, activity label selection, confidence calibration, hallucination handling, and stakeholder-specific output needs. These guidelines provide lessons that extend beyond home care to other domains where conversational data must be translated into structured process insights.
Conclusion:
LLMs show promise for transforming informal clinical dialogue into structured representations of care. While expert oversight and tailored prompts remain essential, future model improvements may enhance reliability. Still, applications in real-world healthcare contexts must be handled with care to ensure accuracy, transparency, and stakeholder trust.
{"title":"Turning Dialogues Into Event Data: Lessons From GPT-Based Recognition of Nursing Actions","authors":"Iris Beerepoot , Sjaak Brinkkemper , Elke Huntink , Berfin Duman , Hajo A. Reijers , Nienke Bleijenberg","doi":"10.1016/j.jbi.2025.104957","DOIUrl":"10.1016/j.jbi.2025.104957","url":null,"abstract":"<div><h3>Objective:</h3><div>To assess the feasibility of using a large language model (LLM) to generate structured event logs from conversational data in home-based nursing care, with the goal of reducing the documentation burden and enabling process analysis.</div></div><div><h3>Methods:</h3><div>We conducted an exploratory study involving 27 audio-recorded home care visits between district nurses and patients. These recordings were transcribed and used as input for a Generative Pre-Trained Transformer (GPT) to identify nursing interventions and construct event logs, using the standardised Nursing Interventions Classification (NIC) system. We applied and evaluated different prompts through an iterative, interdisciplinary process involving computer scientists and nurse researchers.</div></div><div><h3>Results:</h3><div>GPT demonstrated reasonable ability to extract nursing interventions from conversational transcripts, especially when activities were discussed explicitly and temporally aligned. Challenges emerged when information was implicit, ambiguous, or not captured in the dialogue. We propose five guidelines for using LLMs in this context, addressing data source limitations, activity label selection, confidence calibration, hallucination handling, and stakeholder-specific output needs. These guidelines provide lessons that extend beyond home care to other domains where conversational data must be translated into structured process insights.</div></div><div><h3>Conclusion:</h3><div>LLMs show promise for transforming informal clinical dialogue into structured representations of care. While expert oversight and tailored prompts remain essential, future model improvements may enhance reliability. Still, applications in real-world healthcare contexts must be handled with care to ensure accuracy, transparency, and stakeholder trust.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104957"},"PeriodicalIF":4.5,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145517870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14DOI: 10.1016/j.jbi.2025.104958
Chia-Hsuan Chang , Brian Ondov , Bin Choi , Xueqing Peng , Huan He , Hua Xu
Objective
The rapid expansion of biomedical literature necessitates effective approaches for organizing and interpreting complex research topics. Existing embedding-based topic modeling techniques provide flat clusters at single granularities, which ignores the reality of complex hierarchies of subjects. Our objective is to instead create a forest of topic trees, each of which start from a broad area and drill down to narrow specialties.
Methods
We propose TopicForest, a new embedding-driven hierarchical clustering and labeling framework that involves: (1) embedding biomedical abstracts within a high-dimensional semantic space using contrastively trained LLMs, (2) manifold learning to reduce dimensionality for visual interpretation, (3) hierarchical clustering via binary partitioning and multi-level dendrogram cutting, and (4) recursive LLM-based topic summarization to efficiently generate concise and coherent labels from the smallest clusters up to broad subjects covering thousands of publications. We construct a corpus comprising 24,366 biomedical abstracts from Scientific Reports, leveraging its human-curated topic hierarchy as gold-standard for evaluation. We evaluate clustering performance using Adjusted Mutual Information (AMI) and Dasgupta’s cost, while labeling quality is evaluated based on diversity and hierarchical affinity.
Results
TopicForest’s dendrogram cutting achieves AMI scores comparable to or better than flat embedding-based clustering methods such as BERTopic (with K-means or HDBSCAN) across multiple dimension-reduction strategies (t-SNE and UMAP), while uniquely providing multi-scale topic granularity. It also outperforms the deep hierarchical topic model HyperMiner, yielding higher AMI scores and comparable Dasgupta’s costs. For labeling, the proposed LLM recursive labeling method surpasses both c-TF-IDF and HyperMiner, achieving higher label diversity and hierarchical affinity, while maintaining efficient token usage. Furthermore, TopicForest maintains stable clustering quality across different embedding models, demonstrating robustness and generalizability in hierarchical topic discovery.
Conclusion
Through novel integration of LLMs, dimension reduction, and advanced hierarchical clustering techniques, TopicForest provides effective and interpretable hierarchical topic modeling for biomedical literature, facilitating multi-scale exploration and visualization of literature corpora.
{"title":"TopicForest: embedding-driven hierarchical clustering and labeling for biomedical literature","authors":"Chia-Hsuan Chang , Brian Ondov , Bin Choi , Xueqing Peng , Huan He , Hua Xu","doi":"10.1016/j.jbi.2025.104958","DOIUrl":"10.1016/j.jbi.2025.104958","url":null,"abstract":"<div><h3>Objective</h3><div>The rapid expansion of biomedical literature necessitates effective approaches for organizing and interpreting complex research topics. Existing embedding-based topic modeling techniques provide flat clusters at single granularities, which ignores the reality of complex hierarchies of subjects. Our objective is to instead create a forest of topic trees, each of which start from a broad area and drill down to narrow specialties.</div></div><div><h3>Methods</h3><div>We propose TopicForest, a new embedding-driven hierarchical clustering and labeling framework that involves: (1) embedding biomedical abstracts within a high-dimensional semantic space using contrastively trained LLMs, (2) manifold learning to reduce dimensionality for visual interpretation, (3) hierarchical clustering via binary partitioning and multi-level dendrogram cutting, and (4) recursive LLM-based topic summarization to efficiently generate concise and coherent labels from the smallest clusters up to broad subjects covering thousands of publications. We construct a corpus comprising 24,366 biomedical abstracts from Scientific Reports, leveraging its human-curated topic hierarchy as gold-standard for evaluation. We evaluate clustering performance using Adjusted Mutual Information (AMI) and Dasgupta’s cost, while labeling quality is evaluated based on diversity and hierarchical affinity.</div></div><div><h3>Results</h3><div>TopicForest’s dendrogram cutting achieves AMI scores comparable to or better than flat embedding-based clustering methods such as BERTopic (with K-means or HDBSCAN) across multiple dimension-reduction strategies (t-SNE and UMAP), while uniquely providing multi-scale topic granularity. It also outperforms the deep hierarchical topic model HyperMiner, yielding higher AMI scores and comparable Dasgupta’s costs. For labeling, the proposed LLM recursive labeling method surpasses both c-TF-IDF and HyperMiner, achieving higher label diversity and hierarchical affinity, while maintaining efficient token usage. Furthermore, TopicForest maintains stable clustering quality across different embedding models, demonstrating robustness and generalizability in hierarchical topic discovery.</div></div><div><h3>Conclusion</h3><div>Through novel integration of LLMs, dimension reduction, and advanced hierarchical clustering techniques, TopicForest provides effective and interpretable hierarchical topic modeling for biomedical literature, facilitating multi-scale exploration and visualization of literature corpora.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104958"},"PeriodicalIF":4.5,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145534439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1016/j.jbi.2025.104954
Biyi Shen , Yilin Zhang , Thomas G. Travison , Michelle Shardell , Rozalina G. McCoy , Takumi Saegusa , Jason Falvey , Chixiang Chen
Objective:
We propose JLNet, along with a companion R software package, as a systematic joint learning framework for analyzing data from national geriatric centralized networks, such as Medicare Claims. JLNet addresses key challenges in real-world, large-scale healthcare datasets, including hospital-level clustering and heterogeneity, patient-level variability from high-dimensional covariates, and losses to follow-up, while promoting easy implementation to ultimately support decision-making.
Methods:
JLNet proceeds in three steps: (1) fit a dynamic propensity score model to handle patient loss to follow-up; (2) fit a projection-based regularized regression to identify predictive patient-level features while adjusting for hospital-level confounding; and (3) perform hospital-level clustering using transformed residuals, enabling downstream analyses without sharing raw data. We applied JLNet to Medicare claims data to study post-fracture recovery among older adults with Alzheimer’s disease and related dementias (ADRD) following a hip fracture (2010–2018), and evaluated its performance via numerical experiments.
Results:
JLNet identified clinically meaningful patient-level variables (e.g., age, weight loss, peripheral vascular disease, etc.) and distinct hospital clusters associated with variation in post-discharge recovery, measured by days at home, among patients with ADRD. Numerical experiments showed that JLNet outperformed existing approaches in variable selection and hospital clustering in the setting involving high-dimensional covariates and unmeasured hospital-level confounding.
Discussion and conclusion:
JLNet is a scalable, interpretable framework for analyzing centralized health data. It enhances identification of high-risk subcohorts and hospital clusters, supporting more precise resource allocation and personalized care strategies for high-risk older adults. Findings also inform the design of tailored interventions in real-world settings.
{"title":"A joint learning framework for analyzing data from national geriatric centralized networks: A new toolbox deciphering real-world complexity","authors":"Biyi Shen , Yilin Zhang , Thomas G. Travison , Michelle Shardell , Rozalina G. McCoy , Takumi Saegusa , Jason Falvey , Chixiang Chen","doi":"10.1016/j.jbi.2025.104954","DOIUrl":"10.1016/j.jbi.2025.104954","url":null,"abstract":"<div><h3>Objective:</h3><div>We propose JLNet, along with a companion R software package, as a systematic joint learning framework for analyzing data from national geriatric centralized networks, such as Medicare Claims. JLNet addresses key challenges in real-world, large-scale healthcare datasets, including hospital-level clustering and heterogeneity, patient-level variability from high-dimensional covariates, and losses to follow-up, while promoting easy implementation to ultimately support decision-making.</div></div><div><h3>Methods:</h3><div>JLNet proceeds in three steps: (1) fit a dynamic propensity score model to handle patient loss to follow-up; (2) fit a projection-based regularized regression to identify predictive patient-level features while adjusting for hospital-level confounding; and (3) perform hospital-level clustering using transformed residuals, enabling downstream analyses without sharing raw data. We applied JLNet to Medicare claims data to study post-fracture recovery among older adults with Alzheimer’s disease and related dementias (ADRD) following a hip fracture (2010–2018), and evaluated its performance via numerical experiments.</div></div><div><h3>Results:</h3><div>JLNet identified clinically meaningful patient-level variables (e.g., age, weight loss, peripheral vascular disease, etc.) and distinct hospital clusters associated with variation in post-discharge recovery, measured by days at home, among patients with ADRD. Numerical experiments showed that JLNet outperformed existing approaches in variable selection and hospital clustering in the setting involving high-dimensional covariates and unmeasured hospital-level confounding.</div></div><div><h3>Discussion and conclusion:</h3><div>JLNet is a scalable, interpretable framework for analyzing centralized health data. It enhances identification of high-risk subcohorts and hospital clusters, supporting more precise resource allocation and personalized care strategies for high-risk older adults. Findings also inform the design of tailored interventions in real-world settings.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104954"},"PeriodicalIF":4.5,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145513020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1016/j.jbi.2025.104928
Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm
Objectives
Access to real-world healthcare data is constrained by privacy regulations and data imbalances, hindering the development of fair and reliable clinical prediction models. Synthetic data offers a potential solution, yet existing methods often fail to maintain calibration or enable subgroup-specific augmentation. This study introduces Masked Clinical Modelling (MCM), an attention-based synthetic data generation framework designed to enhance survival model calibration in both global and stratified analyses.
Methods
MCM uses masked feature reconstruction to learn feature dependencies without explicitly training on survival objectives. It supports both standalone dataset synthesis and conditional data augmentation, enabling the generation of targeted synthetic subcohorts without retraining. Evaluated on a chronic kidney disease (CKD) electronic health record (EHR) dataset, MCM was benchmarked against eight baseline methods, including variational autoencoders, GANs, SMOTE variants, and a recent risk-aware distillation model. Model performance was assessed via calibration loss, Cox model consistency, and Kaplan–Meier fidelity.
Results
MCM-generated data closely replicated statistical properties of the real dataset, pre- served hazard ratios, and matched time-to-event curves with high fidelity. Cox models trained on MCM-augmented data demonstrated improved calibration, reducing overall calibration loss by 15% and subgroup meta-calibration loss by 9% compared to unaugmented data. These improvements held across multiple high-risk subgroups including those with diabetes, renal dys- function, and advanced age. Unlike competing methods, MCM achieved this without retraining or outcome-specific tuning.
Conclusions
MCM offers a practical and flexible framework for generating synthetic survival data that improves risk model calibration. By supporting both reproducible dataset synthesis and conditional subgroup augmentation, MCM bridges privacy-preserving data access with calibration-aware learning. This work highlights the role of synthetic data not just as a privacy tool, but as a vehicle for improving equity and reliability in clinical modelling.
{"title":"Attention-based synthetic data generation for calibration-enhanced survival analysis: A case study for chronic kidney disease using electronic health records","authors":"Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm","doi":"10.1016/j.jbi.2025.104928","DOIUrl":"10.1016/j.jbi.2025.104928","url":null,"abstract":"<div><h3>Objectives</h3><div>Access to real-world healthcare data is constrained by privacy regulations and data imbalances, hindering the development of fair and reliable clinical prediction models. Synthetic data offers a potential solution, yet existing methods often fail to maintain calibration or enable subgroup-specific augmentation. This study introduces Masked Clinical Modelling (MCM), an attention-based synthetic data generation framework designed to enhance survival model calibration in both global and stratified analyses.</div></div><div><h3>Methods</h3><div>MCM uses masked feature reconstruction to learn feature dependencies without explicitly training on survival objectives. It supports both standalone dataset synthesis and conditional data augmentation, enabling the generation of targeted synthetic subcohorts without retraining. Evaluated on a chronic kidney disease (CKD) electronic health record (EHR) dataset, MCM was benchmarked against eight baseline methods, including variational autoencoders, GANs, SMOTE variants, and a recent risk-aware distillation model. Model performance was assessed via calibration loss, Cox model consistency, and Kaplan–Meier fidelity.</div></div><div><h3>Results</h3><div>MCM-generated data closely replicated statistical properties of the real dataset, pre- served hazard ratios, and matched time-to-event curves with high fidelity. Cox models trained on MCM-augmented data demonstrated improved calibration, reducing overall calibration loss by 15% and subgroup <em>meta</em>-calibration loss by 9% compared to unaugmented data. These improvements held across multiple high-risk subgroups including those with diabetes, renal dys- function, and advanced age. Unlike competing methods, MCM achieved this without retraining or outcome-specific tuning.</div></div><div><h3>Conclusions</h3><div>MCM offers a practical and flexible framework for generating synthetic survival data that improves risk model calibration. By supporting both reproducible dataset synthesis and conditional subgroup augmentation, MCM bridges privacy-preserving data access with calibration-aware learning. This work highlights the role of synthetic data not just as a privacy tool, but as a vehicle for improving equity and reliability in clinical modelling.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104928"},"PeriodicalIF":4.5,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145476802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1016/j.jbi.2025.104951
Shuyang Xie , Hailing Cai , Yaoqin Sun, Xudong Lv
Objective
To develop and evaluate LLM-DQR, an automated approach using large language models to generate electronic health record data quality rules, addressing the limitations of current manual and automated methods that suffer from low efficiency, limited flexibility, and inadequate coverage of complex business logic.
Materials and Methods
We designed a comprehensive pipeline with three core components: (1) standardized input processing integrating database schemas, natural language requirements, and sample data; (2) Chain-of-Thought prompt engineering for guided rule generation; and (3) closed-loop validation with deduplication, sandbox execution, and iterative debugging. The approach was evaluated on two distinct, publicly available datasets: the Paediatric Intensive Care (PIC) dataset and the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. Performance was compared against manual expert construction (expert-DQR) and clinical information model-based generation (CIM-DQR).
Results
LLM-DQR demonstrated higher performance across all evaluation metrics. The GPT implementation achieved overall coverage rates of 97.1% on the PIC dataset and 99.6% on the MIMIC-IV dataset, outperforming CIM-DQR. Performance was particularly strong for complex dimensions: achieving 100% coverage for Consistency rules on both datasets, whereas CIM-DQR achieved 0%. Construction time was reduced by over 10-fold compared to manual methods. Additionally, on the PIC dataset, LLM-DQR generated 89 extra, expert-validated rules.
Discussion
The stronger performance demonstrates LLMs’ capability to understand complex EHR data patterns and assessment requirements, functioning as data quality analysis assistants with domain knowledge and logical reasoning capabilities.
Conclusion
LLM-DQR provides an efficient, scalable solution for automated data quality rule generation in clinical settings, offering considerable improvements over traditional approaches.
{"title":"LLM-DQR: Large language model-based automated generation of data quality rules for electronic health records","authors":"Shuyang Xie , Hailing Cai , Yaoqin Sun, Xudong Lv","doi":"10.1016/j.jbi.2025.104951","DOIUrl":"10.1016/j.jbi.2025.104951","url":null,"abstract":"<div><h3>Objective</h3><div>To develop and evaluate LLM-DQR, an automated approach using large language models to generate electronic health record data quality rules, addressing the limitations of current manual and automated methods that suffer from low efficiency, limited flexibility, and inadequate coverage of complex business logic.</div></div><div><h3>Materials and Methods</h3><div>We designed a comprehensive pipeline with three core components: (1) standardized input processing integrating database schemas, natural language requirements, and sample data; (2) Chain-of-Thought prompt engineering for guided rule generation; and (3) closed-loop validation with deduplication, sandbox execution, and iterative debugging. The approach was evaluated on two distinct, publicly available datasets: the Paediatric Intensive Care (PIC) dataset and the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. Performance was compared against manual expert construction (expert-DQR) and clinical information model-based generation (CIM-DQR).</div></div><div><h3>Results</h3><div>LLM-DQR demonstrated higher performance across all evaluation metrics. The GPT implementation achieved overall coverage rates of 97.1% on the PIC dataset and 99.6% on the MIMIC-IV dataset, outperforming CIM-DQR. Performance was particularly strong for complex dimensions: achieving 100% coverage for Consistency rules on both datasets, whereas CIM-DQR achieved 0%. Construction time was reduced by over 10-fold compared to manual methods. Additionally, on the PIC dataset, LLM-DQR generated 89 extra, expert-validated rules.</div></div><div><h3>Discussion</h3><div>The stronger performance demonstrates LLMs’ capability to understand complex EHR data patterns and assessment requirements, functioning as data quality analysis assistants with domain knowledge and logical reasoning capabilities.</div></div><div><h3>Conclusion</h3><div>LLM-DQR provides an efficient, scalable solution for automated data quality rule generation in clinical settings, offering considerable improvements over traditional approaches.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104951"},"PeriodicalIF":4.5,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145476965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.jbi.2025.104930
Jianfu Li , Yiming Li , Zenan Sun , Evan Yu , Ahmed M. Abdelhameed , Weiguo Cao , Haifang Li , Jianping He , Pengze Li , Jingna Feng , Yue Yu , Xinyue Hu , Manqi Li , Rakesh Kumar , Yifang Dang , Fang Li , Shahyar M Gharacholou , Cui Tao
Objective
Multimodal large language models (LLMs) offer new potential for enhancing cardiovascular decision support, particularly in interpreting echocardiographic data. This study systematically evaluates and benchmarks foundation models from diverse domains on echocardiogram-based tasks to assess their effectiveness, limitations and potential in clinical cardiovascular applications.
Methods
We curated three cardiovascular imaging datasets—EchoNet-Dynamic, TMED2, and an expert-annotated echocardiogram (TTE) dataset—to evaluate performance on four critical tasks: (1) cardiac function evaluation through ejection fraction (EF) prediction, (2) cardiac view classification, (3) aortic stenosis (AS) severity assessment, and (4) cardiovascular disease classification. We evaluated six multimodal LLMs: EchoClip (cardiovascular-specific), BiomedGPT and LLaVA-Med (medical-domain), and MiniCPM-V 2.6, LLaMA-3-Vision-Alpha, and Gemini-1.5 (general-domain). Models were assessed using zero-shot, few-shot, and fine-tuning strategies, where applicable. Performance was measured using mean absolute error (MAE) and root mean squared error (RMSE) for EF prediction, and accuracy, precision, recall, and F1 score for classification tasks.
Results
Domain-specific models such as EchoClip demonstrated the strongest zero-shot performance in EF prediction, achieving an MAE of 10.34. General-domain models showed limited effectiveness without adaptation, with MiniCPM-V 2.6 reporting an MAE of 251.92. Fine-tuning significantly improved outcomes; for example, MiniCPM-V 2.6′s MAE decreased to 31.93, and view classification accuracy increased from 20 % to 63.05 %. In classification tasks, EchoClip achieved F1 scores of 0.2716 for AS severity and 0.4919 for disease classification but exhibited limited performance in view classification (F1 = 0.1457). Few-shot learning yielded modest gains but was generally less effective than fine-tuning.
Conclusions
This evaluation and benchmarking study demonstrated the importance of domain-specific pretraining and model adaptation in cardiovascular decision support tasks. Cardiovascular-focused models and fine-tuned general-domain models achieved superior performance, especially for complex assessments such as EF estimation. These findings offer critical insights into the current capabilities and future directions for clinically meaningful AI integration in cardiovascular medicine.
{"title":"Exploring multimodal large language models on transthoracic Echocardiogram (TTE) tasks for cardiovascular decision support","authors":"Jianfu Li , Yiming Li , Zenan Sun , Evan Yu , Ahmed M. Abdelhameed , Weiguo Cao , Haifang Li , Jianping He , Pengze Li , Jingna Feng , Yue Yu , Xinyue Hu , Manqi Li , Rakesh Kumar , Yifang Dang , Fang Li , Shahyar M Gharacholou , Cui Tao","doi":"10.1016/j.jbi.2025.104930","DOIUrl":"10.1016/j.jbi.2025.104930","url":null,"abstract":"<div><h3>Objective</h3><div>Multimodal large language models (LLMs) offer new potential for enhancing cardiovascular decision support, particularly in interpreting echocardiographic data. This study systematically evaluates and benchmarks foundation models from diverse domains on echocardiogram-based tasks to assess their effectiveness, limitations and potential in clinical cardiovascular applications.</div></div><div><h3>Methods</h3><div>We curated three cardiovascular imaging datasets—EchoNet-Dynamic, TMED2, and an expert-annotated echocardiogram (TTE) dataset—to evaluate performance on four critical tasks: (1) cardiac function evaluation through ejection fraction (EF) prediction, (2) cardiac view classification, (3) aortic stenosis (AS) severity assessment, and (4) cardiovascular disease classification. We evaluated six multimodal LLMs: EchoClip (cardiovascular-specific), BiomedGPT and LLaVA-Med (medical-domain), and MiniCPM-V 2.6, LLaMA-3-Vision-Alpha, and Gemini-1.5 (general-domain). Models were assessed using zero-shot, few-shot, and fine-tuning strategies, where applicable. Performance was measured using mean absolute error (MAE) and root mean squared error (RMSE) for EF prediction, and accuracy, precision, recall, and F1 score for classification tasks.</div></div><div><h3>Results</h3><div>Domain-specific models such as EchoClip demonstrated the strongest zero-shot performance in EF prediction, achieving an MAE of 10.34. General-domain models showed limited effectiveness without adaptation, with MiniCPM-V 2.6 reporting an MAE of 251.92. Fine-tuning significantly improved outcomes; for example, MiniCPM-V 2.6′s MAE decreased to 31.93, and view classification accuracy increased from 20 % to 63.05 %. In classification tasks, EchoClip achieved F1 scores of 0.2716 for AS severity and 0.4919 for disease classification but exhibited limited performance in view classification (F1 = 0.1457). Few-shot learning yielded modest gains but was generally less effective than fine-tuning.</div></div><div><h3>Conclusions</h3><div>This evaluation and benchmarking study demonstrated the importance of domain-specific pretraining and model adaptation in cardiovascular decision support tasks. Cardiovascular-focused models and fine-tuned general-domain models achieved superior performance, especially for complex assessments such as EF estimation. These findings offer critical insights into the current capabilities and future directions for clinically meaningful AI integration in cardiovascular medicine.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104930"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145370273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.jbi.2025.104949
Yilun Liang , Gongbo Zhang , Edward Sun , Betina Idnay , Yilu Fang , Fangyi Chen , Casey Ta , Yifan Peng , Chunhua Weng
Objective
Research profiles highlight scientists’ research focus, enabling talent discovery and fostering collaborations, but they are often outdated. Automated, scalable methods are urgently needed to keep these profiles current.
Methods
In this study, we design and evaluate two Large Language Models (LLMs)-based methods to generate scientific interest profiles—one summarizing researchers’ PubMed abstracts and the other generating a summary using their publications’ Medical Subject Headings (MeSH) terms—and compare these machine-generated profiles with researchers’ self-summarized interests. We collected the titles, MeSH terms, and abstracts of PubMed publications for 595 faculty members affiliated with Columbia University Irving Medical Center (CUIMC), for 167 of whom we obtained human-written online research profiles. Subsequently, GPT-4o-mini, a state-of-the-art LLM, was prompted to summarize each researcher’s interests. Both manual and automated evaluations were conducted to characterize the similarities and differences between the machine-generated and self-written research profiles.
Results
The similarity study showed low ROUGE-L, BLEU, and METEOR scores, reflecting little overlap between terminologies used in machine-generated and self-written profiles. BERTScore analysis revealed moderate semantic similarity between machine-generated and reference summaries (F1: 0.542 for MeSH-based, 0.555 for abstract-based), despite low lexical overlap. In validation, paraphrased summaries achieved a higher F1 of 0.851. A further comparison between the original and paraphrased manually written summaries indicates the limitations of such metrics. Kullback-Leibler (KL) Divergence of term frequency-inverse document frequency (TF-IDF) values (8.56 and 8.58 for profiles derived from MeSH terms and abstracts, respectively) suggests that machine-generated summaries employ different keywords than human-written summaries. Manual reviews further showed that 77.78% rated the overall impression of MeSH-based profiling as “good” or “excellent,” with readability receiving favorable ratings in 93.44% of cases, though granularity and factual accuracy varied. Overall, panel reviews favored 67.86% of machine-generated profiles derived from MeSH terms over those derived from abstracts.
Conclusion
LLMs promise to automate scientific interest profiling at scale. Profiles derived from MeSH terms have better readability than profiles derived from abstracts. Overall, machine-generated summaries differ from human-written ones in their choice of concepts, with the latter initiating more novel ideas.
{"title":"Scalable scientific interest profiling using large language models","authors":"Yilun Liang , Gongbo Zhang , Edward Sun , Betina Idnay , Yilu Fang , Fangyi Chen , Casey Ta , Yifan Peng , Chunhua Weng","doi":"10.1016/j.jbi.2025.104949","DOIUrl":"10.1016/j.jbi.2025.104949","url":null,"abstract":"<div><h3>Objective</h3><div>Research profiles highlight scientists’ research focus, enabling talent discovery and fostering collaborations, but they are often outdated. Automated, scalable methods are urgently needed to keep these profiles current.</div></div><div><h3>Methods</h3><div>In this study, we design and evaluate two Large Language Models (LLMs)-based methods to generate scientific interest profiles—one summarizing researchers’ PubMed abstracts and the other generating a summary using their publications’ Medical Subject Headings (MeSH) terms—and compare these machine-generated profiles with researchers’ self-summarized interests. We collected the titles, MeSH terms, and abstracts of PubMed publications for 595 faculty members affiliated with Columbia University Irving Medical Center (CUIMC), for 167 of whom we obtained human-written online research profiles. Subsequently, GPT-4o-mini, a state-of-the-art LLM, was prompted to summarize each researcher’s interests. Both manual and automated evaluations were conducted to characterize the similarities and differences between the machine-generated and self-written research profiles.</div></div><div><h3>Results</h3><div>The similarity study showed low ROUGE-L, BLEU, and METEOR scores, reflecting little overlap between terminologies used in machine-generated and self-written profiles. BERTScore analysis revealed moderate semantic similarity between machine-generated and reference summaries (F1: 0.542 for MeSH-based, 0.555 for abstract-based), despite low lexical overlap. In validation, paraphrased summaries achieved a higher F1 of 0.851. A further comparison between the original and paraphrased manually written summaries indicates the limitations of such metrics. Kullback-Leibler (KL) Divergence of term frequency-inverse document frequency (TF-IDF) values (8.56 and 8.58 for profiles derived from MeSH terms and abstracts, respectively) suggests that machine-generated summaries employ different keywords than human-written summaries. Manual reviews further showed that 77.78% rated the overall impression of MeSH-based profiling as “good” or “excellent,” with readability receiving favorable ratings in 93.44% of cases, though granularity and factual accuracy varied. Overall, panel reviews favored 67.86% of machine-generated profiles derived from MeSH terms over those derived from abstracts.</div></div><div><h3>Conclusion</h3><div>LLMs promise to automate scientific interest profiling at scale. Profiles derived from MeSH terms have better readability than profiles derived from abstracts. Overall, machine-generated summaries differ from human-written ones in their choice of concepts, with the latter initiating more novel ideas.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104949"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145431708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.jbi.2025.104917
Fei Wang
{"title":"The crisis of biomedical foundation models","authors":"Fei Wang","doi":"10.1016/j.jbi.2025.104917","DOIUrl":"10.1016/j.jbi.2025.104917","url":null,"abstract":"","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104917"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145182104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.jbi.2025.104946
Cheng Peng , Kai Zhang , Mengxian Lyu , Hongfang Liu , Lichao Sun , Yonghui Wu
Objective
To advance biomedical vision language model capabilities through scaling up, fine-tuning, and instruction tuning, develop vision-language models with improved performance in handling long text, explore strategies to efficiently adopt vision language models for diverse multi-modal biomedical tasks, and examine the zero-shot learning performance.
Methods
We developed two biomedical vision language models, BiomedGPT-Large and BiomedGPT-XLarge, based on an encoder-decoder-based transformer architecture. We fine-tuned the two models on 23 benchmark datasets from 6 multi-modal biomedical tasks, including one image-only task (image classification), three language-only tasks (text understanding, text summarization, and question answering), and two vision-language tasks (visual question answering and image captioning). We compared the developed scaled models with our previous BiomedGPT-Base model and existing prestigious models reported in the literature. We instruction-tuned the two models using a large-scale multi-modal biomedical instruction-tuning dataset and assessed the zero-shot learning performance and alignment accuracy.
Results and Conclusion
The experimental results show that the new models developed in this study outperform our previous BiomedGPT-Base model on 17 of 23 benchmark datasets and achieve state-of-the-art performance on 15 of 23 datasets when compared to previous models reported in the literature. The new models also demonstrated improved ability in handling long text, particularly on text summarization on the MIMIC-III dataset and text understanding on the SEER dataset, with a remarkable improvement of 4.6–11.4 %. Instruction tuning on the scaled models resulted in significant enhancements in zero-shot learning ability and alignment accuracy in following complex instructions across multiple tasks, including image classification, visual question answering, and image captioning. This study develops two vision-language models in the biomedical domain and examines technologies to improve long text content in vision language models through scaling, fine-tuning, and instruction tuning. This study demonstrates the potential of vision language models to integrate multiple data modalities to solve diverse multimodal tasks in the biomedical domain.
{"title":"Scaling up biomedical vision-language models: Fine-tuning, instruction tuning, and multi-modal learning","authors":"Cheng Peng , Kai Zhang , Mengxian Lyu , Hongfang Liu , Lichao Sun , Yonghui Wu","doi":"10.1016/j.jbi.2025.104946","DOIUrl":"10.1016/j.jbi.2025.104946","url":null,"abstract":"<div><h3>Objective</h3><div>To advance biomedical vision language model capabilities through scaling up, fine-tuning, and instruction tuning, develop vision-language models with improved performance in handling long text, explore strategies to efficiently adopt vision language models for diverse multi-modal biomedical tasks, and examine the zero-shot learning performance.</div></div><div><h3>Methods</h3><div>We developed two biomedical vision language models, BiomedGPT-Large and BiomedGPT-XLarge, based on an encoder-decoder-based transformer architecture. We fine-tuned the two models on 23 benchmark datasets from 6 multi-modal biomedical tasks, including one image-only task (image classification), three language-only tasks (text understanding, text summarization, and question answering), and two vision-language tasks (visual question answering and image captioning). We compared the developed scaled models with our previous BiomedGPT-Base model and existing prestigious models reported in the literature. We instruction-tuned the two models using a large-scale multi-modal biomedical instruction-tuning dataset and assessed the zero-shot learning performance and alignment accuracy.</div></div><div><h3>Results and Conclusion</h3><div>The experimental results show that the new models developed in this study outperform our previous BiomedGPT-Base model on 17 of 23 benchmark datasets and achieve state-of-the-art performance on 15 of 23 datasets when compared to previous models reported in the literature. The new models also demonstrated improved ability in handling long text, particularly on text summarization on the MIMIC-III dataset and text understanding on the SEER dataset, with a remarkable improvement of 4.6–11.4 %. Instruction tuning on the scaled models resulted in significant enhancements in zero-shot learning ability and alignment accuracy in following complex instructions across multiple tasks, including image classification, visual question answering, and image captioning. This study develops two vision-language models in the biomedical domain and examines technologies to improve long text content in vision language models through scaling, fine-tuning, and instruction tuning. This study demonstrates the potential of vision language models to integrate multiple data modalities to solve diverse multimodal tasks in the biomedical domain.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104946"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145370338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.jbi.2025.104943
Tapio Niemi, Gautier Defossez, Simon Germann, Jean-Luc Bulliard
Objective
Population-based cancer registries receive numerous free-text pathology reports from which cancer cases are manually coded according to international standards. Skin cancer is the most frequent cancer in Caucasian populations, and its incidence is increasing. We developed an AI-based method to identify skin cancer, locate relevant key terms in pathological reports, and suggest coding for the main clinical variables.
Methods
We explored multiple neural network architectures and found out that convolutional neural networks with customised noise-robust loss functions offer the best performance for identifying cancer types and pre-coding subsite, morphology, behaviour, grade, laterality, and first line of treatment of skin cancer cases. Previously registered cases were used as training data. We additionally applied an attention mechanism to extract and highlight reports’ key diagnostic terms. These highlights facilitate human review of pre-coding results. We evaluated performance of the method by using manually coded cases in a separate test set.
Results
The accuracies of detecting skin cancer types were 0.98–0.99, and F1 scores 0.93–0.96. Pre-coding accuracy and weighted F1 score were: ICD-O subsite (4 digits): 0.89–0.91 and 0.89–0.91, morphology (4 digits): 0.61–0.90 and 0.63–0.89, morphology (3-digits): 0.86–0.98 and 0.89–0.98, tumour behaviour: 0.96–0.98 and 0.96–0.98, laterality: 0.99 and 0.98–0.99. Also, accuracy (0.96) and weighted F1 score (0.96) for the grade were estimated for squamous cell carcinoma (SCC) of the skin, and treatments for SCC and melanoma (accuracies 0.84 and 0.87, weighted F1 scores and 0.82 and 0.87). The extracted key words matched ICD-O code descriptions with high precision.
Conclusion
We piloted our method in the Vaud Cancer Registry, Switzerland. It was able to identify and pre-code skin cancer cases efficiently and find correct key terms in reports. Medical coders found pre-coding useful and time saving. Integration of the method in the registry document workflow and its extension to other cancer types are intended.
{"title":"Pre-coding skin cancer from free-text pathology reports using noise-robust neural networks","authors":"Tapio Niemi, Gautier Defossez, Simon Germann, Jean-Luc Bulliard","doi":"10.1016/j.jbi.2025.104943","DOIUrl":"10.1016/j.jbi.2025.104943","url":null,"abstract":"<div><h3>Objective</h3><div>Population-based cancer registries receive numerous free-text pathology reports from which cancer cases are manually coded according to international standards. Skin cancer is the most frequent cancer in Caucasian populations, and its incidence is increasing. We developed an AI-based method to identify skin cancer, locate relevant key terms in pathological reports, and suggest coding for the main clinical variables.</div></div><div><h3>Methods</h3><div>We explored multiple neural network architectures and found out that convolutional neural networks with customised noise-robust loss functions offer the best performance for identifying cancer types and pre-coding subsite, morphology, behaviour, grade, laterality, and first line of treatment of skin cancer cases. Previously registered cases were used as training data. We additionally applied an attention mechanism to extract and highlight reports’ key diagnostic terms. These highlights facilitate human review of pre-coding results. We evaluated performance of the method by using manually coded cases in a separate test set.</div></div><div><h3>Results</h3><div>The accuracies of detecting skin cancer types were 0.98–0.99, and F1 scores 0.93–0.96. Pre-coding accuracy and weighted F1 score were: ICD-O subsite (4 digits): 0.89–0.91 and 0.89–0.91, morphology (4 digits): 0.61–0.90 and 0.63–0.89, morphology (3-digits): 0.86–0.98 and 0.89–0.98, tumour behaviour: 0.96–0.98 and 0.96–0.98, laterality: 0.99 and 0.98–0.99. Also, accuracy (0.96) and weighted F1 score (0.96) for the grade were estimated for squamous cell carcinoma (SCC) of the skin, and treatments for SCC and melanoma (accuracies 0.84 and 0.87, weighted F1 scores and 0.82 and 0.87). The extracted key words matched ICD-O code descriptions with high precision.</div></div><div><h3>Conclusion</h3><div>We piloted our method in the Vaud Cancer Registry, Switzerland. It was able to identify and pre-code skin cancer cases efficiently and find correct key terms in reports. Medical coders found pre-coding useful and time saving. Integration of the method in the registry document workflow and its extension to other cancer types are intended.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104943"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145417992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}