首页 > 最新文献

Journal of Biomedical Informatics最新文献

英文 中文
Turning Dialogues Into Event Data: Lessons From GPT-Based Recognition of Nursing Actions 将对话转化为事件数据:基于gpt的护理行为识别的经验教训
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-14 DOI: 10.1016/j.jbi.2025.104957
Iris Beerepoot , Sjaak Brinkkemper , Elke Huntink , Berfin Duman , Hajo A. Reijers , Nienke Bleijenberg

Objective:

To assess the feasibility of using a large language model (LLM) to generate structured event logs from conversational data in home-based nursing care, with the goal of reducing the documentation burden and enabling process analysis.

Methods:

We conducted an exploratory study involving 27 audio-recorded home care visits between district nurses and patients. These recordings were transcribed and used as input for a Generative Pre-Trained Transformer (GPT) to identify nursing interventions and construct event logs, using the standardised Nursing Interventions Classification (NIC) system. We applied and evaluated different prompts through an iterative, interdisciplinary process involving computer scientists and nurse researchers.

Results:

GPT demonstrated reasonable ability to extract nursing interventions from conversational transcripts, especially when activities were discussed explicitly and temporally aligned. Challenges emerged when information was implicit, ambiguous, or not captured in the dialogue. We propose five guidelines for using LLMs in this context, addressing data source limitations, activity label selection, confidence calibration, hallucination handling, and stakeholder-specific output needs. These guidelines provide lessons that extend beyond home care to other domains where conversational data must be translated into structured process insights.

Conclusion:

LLMs show promise for transforming informal clinical dialogue into structured representations of care. While expert oversight and tailored prompts remain essential, future model improvements may enhance reliability. Still, applications in real-world healthcare contexts must be handled with care to ensure accuracy, transparency, and stakeholder trust.
目的:评估在家庭护理中使用大型语言模型(LLM)从会话数据生成结构化事件日志的可行性,以减少文档负担并实现过程分析。方法:对27例区护士与患者的家庭护理访视录音进行了探索性研究。这些记录被转录并用作生成式预训练转换器(GPT)的输入,使用标准化护理干预分类(NIC)系统识别护理干预措施并构建事件日志。我们应用和评估不同的提示通过迭代,跨学科的过程涉及计算机科学家和护士研究人员。结果:GPT展示了从会话记录中提取护理干预措施的合理能力,特别是当活动被明确地讨论和时间一致时。当信息是隐含的、模棱两可的,或者对话中没有捕捉到信息时,挑战就出现了。我们提出了在这种情况下使用法学硕士的五个指导方针,解决数据源限制、活动标签选择、置信度校准、幻觉处理和利益相关者特定的输出需求。这些指导方针提供的经验可以扩展到家庭护理以外的其他领域,在这些领域中,会话数据必须转化为结构化的流程见解。结论:法学硕士有望将非正式的临床对话转化为结构化的护理表征。虽然专家监督和量身定制的提示仍然必不可少,但未来的模型改进可能会提高可靠性。尽管如此,必须谨慎处理现实医疗环境中的应用程序,以确保准确性、透明度和利益相关者的信任。
{"title":"Turning Dialogues Into Event Data: Lessons From GPT-Based Recognition of Nursing Actions","authors":"Iris Beerepoot ,&nbsp;Sjaak Brinkkemper ,&nbsp;Elke Huntink ,&nbsp;Berfin Duman ,&nbsp;Hajo A. Reijers ,&nbsp;Nienke Bleijenberg","doi":"10.1016/j.jbi.2025.104957","DOIUrl":"10.1016/j.jbi.2025.104957","url":null,"abstract":"<div><h3>Objective:</h3><div>To assess the feasibility of using a large language model (LLM) to generate structured event logs from conversational data in home-based nursing care, with the goal of reducing the documentation burden and enabling process analysis.</div></div><div><h3>Methods:</h3><div>We conducted an exploratory study involving 27 audio-recorded home care visits between district nurses and patients. These recordings were transcribed and used as input for a Generative Pre-Trained Transformer (GPT) to identify nursing interventions and construct event logs, using the standardised Nursing Interventions Classification (NIC) system. We applied and evaluated different prompts through an iterative, interdisciplinary process involving computer scientists and nurse researchers.</div></div><div><h3>Results:</h3><div>GPT demonstrated reasonable ability to extract nursing interventions from conversational transcripts, especially when activities were discussed explicitly and temporally aligned. Challenges emerged when information was implicit, ambiguous, or not captured in the dialogue. We propose five guidelines for using LLMs in this context, addressing data source limitations, activity label selection, confidence calibration, hallucination handling, and stakeholder-specific output needs. These guidelines provide lessons that extend beyond home care to other domains where conversational data must be translated into structured process insights.</div></div><div><h3>Conclusion:</h3><div>LLMs show promise for transforming informal clinical dialogue into structured representations of care. While expert oversight and tailored prompts remain essential, future model improvements may enhance reliability. Still, applications in real-world healthcare contexts must be handled with care to ensure accuracy, transparency, and stakeholder trust.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104957"},"PeriodicalIF":4.5,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145517870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TopicForest: embedding-driven hierarchical clustering and labeling for biomedical literature TopicForest:生物医学文献的嵌入驱动分层聚类和标记。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-14 DOI: 10.1016/j.jbi.2025.104958
Chia-Hsuan Chang , Brian Ondov , Bin Choi , Xueqing Peng , Huan He , Hua Xu

Objective

The rapid expansion of biomedical literature necessitates effective approaches for organizing and interpreting complex research topics. Existing embedding-based topic modeling techniques provide flat clusters at single granularities, which ignores the reality of complex hierarchies of subjects. Our objective is to instead create a forest of topic trees, each of which start from a broad area and drill down to narrow specialties.

Methods

We propose TopicForest, a new embedding-driven hierarchical clustering and labeling framework that involves: (1) embedding biomedical abstracts within a high-dimensional semantic space using contrastively trained LLMs, (2) manifold learning to reduce dimensionality for visual interpretation, (3) hierarchical clustering via binary partitioning and multi-level dendrogram cutting, and (4) recursive LLM-based topic summarization to efficiently generate concise and coherent labels from the smallest clusters up to broad subjects covering thousands of publications. We construct a corpus comprising 24,366 biomedical abstracts from Scientific Reports, leveraging its human-curated topic hierarchy as gold-standard for evaluation. We evaluate clustering performance using Adjusted Mutual Information (AMI) and Dasgupta’s cost, while labeling quality is evaluated based on diversity and hierarchical affinity.

Results

TopicForest’s dendrogram cutting achieves AMI scores comparable to or better than flat embedding-based clustering methods such as BERTopic (with K-means or HDBSCAN) across multiple dimension-reduction strategies (t-SNE and UMAP), while uniquely providing multi-scale topic granularity. It also outperforms the deep hierarchical topic model HyperMiner, yielding higher AMI scores and comparable Dasgupta’s costs. For labeling, the proposed LLM recursive labeling method surpasses both c-TF-IDF and HyperMiner, achieving higher label diversity and hierarchical affinity, while maintaining efficient token usage. Furthermore, TopicForest maintains stable clustering quality across different embedding models, demonstrating robustness and generalizability in hierarchical topic discovery.

Conclusion

Through novel integration of LLMs, dimension reduction, and advanced hierarchical clustering techniques, TopicForest provides effective and interpretable hierarchical topic modeling for biomedical literature, facilitating multi-scale exploration and visualization of literature corpora.
目的:生物医学文献的快速增长需要有效的方法来组织和解释复杂的研究课题。现有的基于嵌入的主题建模技术提供了单粒度的平面聚类,忽略了主题复杂层次的现实。我们的目标是创建一个主题树的森林,每个主题树都从一个广泛的领域开始,并向下钻取到狭窄的专业。方法:我们提出了TopicForest,一个新的嵌入驱动的分层聚类和标记框架,包括:(1)使用对比训练的llm在高维语义空间中嵌入生物医学摘要;(2)歧形学习以降维为视觉解释;(3)通过二元划分和多级树图切割进行分层聚类;(4)基于递归llm的主题摘要,从最小的聚类到涵盖数千种出版物的广泛主题,有效地生成简洁连贯的标签。我们从《科学报告》中构建了一个包含24,366篇生物医学摘要的语料库,利用其人工策划的主题层次作为评估的黄金标准。我们使用调整互信息(AMI)和Dasgupta成本来评估聚类性能,而基于多样性和层次亲和力来评估标记质量。结果:TopicForest的树形图切割在多个降维策略(t-SNE和UMAP)上实现了与基于平面嵌入的聚类方法(如BERTopic(使用K-means或HDBSCAN))相当或更好的AMI分数,同时独特地提供了多尺度主题粒度。它也优于深度分层主题模型HyperMiner,产生更高的AMI分数和可比较的Dasgupta的成本。在标记方面,本文提出的LLM递归标记方法超越了c-TF-IDF和HyperMiner,实现了更高的标签多样性和层次亲和性,同时保持了高效的令牌使用。此外,TopicForest在不同嵌入模型之间保持稳定的聚类质量,展示了分层主题发现的鲁棒性和泛化性。结论:TopicForest通过llm、降维和先进的分层聚类技术的新颖集成,为生物医学文献提供了有效的、可解释的分层主题建模,促进了文献语料库的多尺度探索和可视化。
{"title":"TopicForest: embedding-driven hierarchical clustering and labeling for biomedical literature","authors":"Chia-Hsuan Chang ,&nbsp;Brian Ondov ,&nbsp;Bin Choi ,&nbsp;Xueqing Peng ,&nbsp;Huan He ,&nbsp;Hua Xu","doi":"10.1016/j.jbi.2025.104958","DOIUrl":"10.1016/j.jbi.2025.104958","url":null,"abstract":"<div><h3>Objective</h3><div>The rapid expansion of biomedical literature necessitates effective approaches for organizing and interpreting complex research topics. Existing embedding-based topic modeling techniques provide flat clusters at single granularities, which ignores the reality of complex hierarchies of subjects. Our objective is to instead create a forest of topic trees, each of which start from a broad area and drill down to narrow specialties.</div></div><div><h3>Methods</h3><div>We propose TopicForest, a new embedding-driven hierarchical clustering and labeling framework that involves: (1) embedding biomedical abstracts within a high-dimensional semantic space using contrastively trained LLMs, (2) manifold learning to reduce dimensionality for visual interpretation, (3) hierarchical clustering via binary partitioning and multi-level dendrogram cutting, and (4) recursive LLM-based topic summarization to efficiently generate concise and coherent labels from the smallest clusters up to broad subjects covering thousands of publications. We construct a corpus comprising 24,366 biomedical abstracts from Scientific Reports, leveraging its human-curated topic hierarchy as gold-standard for evaluation. We evaluate clustering performance using Adjusted Mutual Information (AMI) and Dasgupta’s cost, while labeling quality is evaluated based on diversity and hierarchical affinity.</div></div><div><h3>Results</h3><div>TopicForest’s dendrogram cutting achieves AMI scores comparable to or better than flat embedding-based clustering methods such as BERTopic (with K-means or HDBSCAN) across multiple dimension-reduction strategies (t-SNE and UMAP), while uniquely providing multi-scale topic granularity. It also outperforms the deep hierarchical topic model HyperMiner, yielding higher AMI scores and comparable Dasgupta’s costs. For labeling, the proposed LLM recursive labeling method surpasses both c-TF-IDF and HyperMiner, achieving higher label diversity and hierarchical affinity, while maintaining efficient token usage. Furthermore, TopicForest maintains stable clustering quality across different embedding models, demonstrating robustness and generalizability in hierarchical topic discovery.</div></div><div><h3>Conclusion</h3><div>Through novel integration of LLMs, dimension reduction, and advanced hierarchical clustering techniques, TopicForest provides effective and interpretable hierarchical topic modeling for biomedical literature, facilitating multi-scale exploration and visualization of literature corpora.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104958"},"PeriodicalIF":4.5,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145534439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A joint learning framework for analyzing data from national geriatric centralized networks: A new toolbox deciphering real-world complexity 用于分析国家老年集中网络数据的联合学习框架:一个破译现实世界复杂性的新工具箱。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-11 DOI: 10.1016/j.jbi.2025.104954
Biyi Shen , Yilin Zhang , Thomas G. Travison , Michelle Shardell , Rozalina G. McCoy , Takumi Saegusa , Jason Falvey , Chixiang Chen

Objective:

We propose JLNet, along with a companion R software package, as a systematic joint learning framework for analyzing data from national geriatric centralized networks, such as Medicare Claims. JLNet addresses key challenges in real-world, large-scale healthcare datasets, including hospital-level clustering and heterogeneity, patient-level variability from high-dimensional covariates, and losses to follow-up, while promoting easy implementation to ultimately support decision-making.

Methods:

JLNet proceeds in three steps: (1) fit a dynamic propensity score model to handle patient loss to follow-up; (2) fit a projection-based regularized regression to identify predictive patient-level features while adjusting for hospital-level confounding; and (3) perform hospital-level clustering using transformed residuals, enabling downstream analyses without sharing raw data. We applied JLNet to Medicare claims data to study post-fracture recovery among older adults with Alzheimer’s disease and related dementias (ADRD) following a hip fracture (2010–2018), and evaluated its performance via numerical experiments.

Results:

JLNet identified clinically meaningful patient-level variables (e.g., age, weight loss, peripheral vascular disease, etc.) and distinct hospital clusters associated with variation in post-discharge recovery, measured by days at home, among patients with ADRD. Numerical experiments showed that JLNet outperformed existing approaches in variable selection and hospital clustering in the setting involving high-dimensional covariates and unmeasured hospital-level confounding.

Discussion and conclusion:

JLNet is a scalable, interpretable framework for analyzing centralized health data. It enhances identification of high-risk subcohorts and hospital clusters, supporting more precise resource allocation and personalized care strategies for high-risk older adults. Findings also inform the design of tailored interventions in real-world settings.
目的:我们提出JLNet,以及配套的R软件包,作为一个系统的联合学习框架,用于分析来自国家老年集中网络的数据,如医疗保险索赔。JLNet解决了现实世界中大规模医疗保健数据集中的关键挑战,包括医院级聚类和异质性、来自高维协变量的患者级可变性以及随访损失,同时促进易于实施,最终支持决策。方法:JLNet分三步进行:(1)拟合动态倾向评分模型,处理患者失访问题;(2)拟合基于预测的正则化回归,以识别预测性患者水平特征,同时调整医院水平的混杂因素;(3)使用转换后的残差进行医院级聚类,在不共享原始数据的情况下进行下游分析。我们将JLNet应用于医疗保险索赔数据,研究2010-2018年髋部骨折后老年阿尔茨海默病及相关痴呆(ADRD)患者骨折后的康复情况,并通过数值实验评估其性能。结果:JLNet确定了有临床意义的患者水平变量(如年龄、体重减轻、周围血管疾病等),以及与ADRD患者出院后恢复变化相关的不同医院集群,以在家天数衡量。数值实验表明,在涉及高维协变量和未测量的医院水平混杂的情况下,JLNet在变量选择和医院聚类方面优于现有方法。讨论和结论:JLNet是一个可扩展的、可解释的框架,用于分析集中的健康数据。它增强了对高风险亚群和医院群的识别,支持对高风险老年人更精确的资源分配和个性化护理策略。研究结果还为在现实环境中设计量身定制的干预措施提供了信息。
{"title":"A joint learning framework for analyzing data from national geriatric centralized networks: A new toolbox deciphering real-world complexity","authors":"Biyi Shen ,&nbsp;Yilin Zhang ,&nbsp;Thomas G. Travison ,&nbsp;Michelle Shardell ,&nbsp;Rozalina G. McCoy ,&nbsp;Takumi Saegusa ,&nbsp;Jason Falvey ,&nbsp;Chixiang Chen","doi":"10.1016/j.jbi.2025.104954","DOIUrl":"10.1016/j.jbi.2025.104954","url":null,"abstract":"<div><h3>Objective:</h3><div>We propose JLNet, along with a companion R software package, as a systematic joint learning framework for analyzing data from national geriatric centralized networks, such as Medicare Claims. JLNet addresses key challenges in real-world, large-scale healthcare datasets, including hospital-level clustering and heterogeneity, patient-level variability from high-dimensional covariates, and losses to follow-up, while promoting easy implementation to ultimately support decision-making.</div></div><div><h3>Methods:</h3><div>JLNet proceeds in three steps: (1) fit a dynamic propensity score model to handle patient loss to follow-up; (2) fit a projection-based regularized regression to identify predictive patient-level features while adjusting for hospital-level confounding; and (3) perform hospital-level clustering using transformed residuals, enabling downstream analyses without sharing raw data. We applied JLNet to Medicare claims data to study post-fracture recovery among older adults with Alzheimer’s disease and related dementias (ADRD) following a hip fracture (2010–2018), and evaluated its performance via numerical experiments.</div></div><div><h3>Results:</h3><div>JLNet identified clinically meaningful patient-level variables (e.g., age, weight loss, peripheral vascular disease, etc.) and distinct hospital clusters associated with variation in post-discharge recovery, measured by days at home, among patients with ADRD. Numerical experiments showed that JLNet outperformed existing approaches in variable selection and hospital clustering in the setting involving high-dimensional covariates and unmeasured hospital-level confounding.</div></div><div><h3>Discussion and conclusion:</h3><div>JLNet is a scalable, interpretable framework for analyzing centralized health data. It enhances identification of high-risk subcohorts and hospital clusters, supporting more precise resource allocation and personalized care strategies for high-risk older adults. Findings also inform the design of tailored interventions in real-world settings.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104954"},"PeriodicalIF":4.5,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145513020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attention-based synthetic data generation for calibration-enhanced survival analysis: A case study for chronic kidney disease using electronic health records 基于注意力的合成数据生成用于校准增强生存分析:使用电子健康记录的慢性肾脏疾病案例研究。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-07 DOI: 10.1016/j.jbi.2025.104928
Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

Objectives

Access to real-world healthcare data is constrained by privacy regulations and data imbalances, hindering the development of fair and reliable clinical prediction models. Synthetic data offers a potential solution, yet existing methods often fail to maintain calibration or enable subgroup-specific augmentation. This study introduces Masked Clinical Modelling (MCM), an attention-based synthetic data generation framework designed to enhance survival model calibration in both global and stratified analyses.

Methods

MCM uses masked feature reconstruction to learn feature dependencies without explicitly training on survival objectives. It supports both standalone dataset synthesis and conditional data augmentation, enabling the generation of targeted synthetic subcohorts without retraining. Evaluated on a chronic kidney disease (CKD) electronic health record (EHR) dataset, MCM was benchmarked against eight baseline methods, including variational autoencoders, GANs, SMOTE variants, and a recent risk-aware distillation model. Model performance was assessed via calibration loss, Cox model consistency, and Kaplan–Meier fidelity.

Results

MCM-generated data closely replicated statistical properties of the real dataset, pre- served hazard ratios, and matched time-to-event curves with high fidelity. Cox models trained on MCM-augmented data demonstrated improved calibration, reducing overall calibration loss by 15% and subgroup meta-calibration loss by 9% compared to unaugmented data. These improvements held across multiple high-risk subgroups including those with diabetes, renal dys- function, and advanced age. Unlike competing methods, MCM achieved this without retraining or outcome-specific tuning.

Conclusions

MCM offers a practical and flexible framework for generating synthetic survival data that improves risk model calibration. By supporting both reproducible dataset synthesis and conditional subgroup augmentation, MCM bridges privacy-preserving data access with calibration-aware learning. This work highlights the role of synthetic data not just as a privacy tool, but as a vehicle for improving equity and reliability in clinical modelling.
目的:对现实世界医疗保健数据的访问受到隐私法规和数据不平衡的限制,阻碍了公平可靠的临床预测模型的发展。合成数据提供了一个潜在的解决方案,但现有的方法往往无法维持校准或使特定子组的增强。本研究引入了掩蔽临床模型(MCM),这是一种基于注意力的合成数据生成框架,旨在增强全局和分层分析中的生存模型校准。方法:MCM在没有明确生存目标训练的情况下,使用掩模特征重构来学习特征依赖关系。它既支持独立数据集合成,也支持有条件的数据增强,无需再训练即可生成目标合成子队列。在慢性肾脏疾病(CKD)电子健康记录(EHR)数据集上进行评估,MCM与八种基线方法进行基准测试,包括变分自动编码器、gan、SMOTE变体和最近的风险感知蒸馏模型。通过校准损失、Cox模型一致性和Kaplan-Meier保真度评估模型性能。结果:mcm生成的数据与真实数据集的统计特性、预先保存的风险比和高保真度匹配的时间-事件曲线非常接近。在mcm增强数据上训练的Cox模型显示,与未增强数据相比,校准效果更好,总体校准损失减少了15%,亚组元校准损失减少了9%。这些改善在多个高危亚组中都存在,包括糖尿病患者、肾功能不全患者和老年患者。与其他竞争方法不同,MCM无需重新训练或特定于结果的调优即可实现此目标。结论:MCM为生成综合生存数据提供了一个实用而灵活的框架,可改善风险模型校准。通过支持可重复数据集合成和条件子组扩展,MCM将隐私保护数据访问与校准感知学习连接起来。这项工作强调了合成数据的作用,不仅是作为隐私工具,而且作为提高临床建模公平性和可靠性的工具。
{"title":"Attention-based synthetic data generation for calibration-enhanced survival analysis: A case study for chronic kidney disease using electronic health records","authors":"Nicholas I-Hsien Kuo,&nbsp;Blanca Gallego,&nbsp;Louisa Jorm","doi":"10.1016/j.jbi.2025.104928","DOIUrl":"10.1016/j.jbi.2025.104928","url":null,"abstract":"<div><h3>Objectives</h3><div>Access to real-world healthcare data is constrained by privacy regulations and data imbalances, hindering the development of fair and reliable clinical prediction models. Synthetic data offers a potential solution, yet existing methods often fail to maintain calibration or enable subgroup-specific augmentation. This study introduces Masked Clinical Modelling (MCM), an attention-based synthetic data generation framework designed to enhance survival model calibration in both global and stratified analyses.</div></div><div><h3>Methods</h3><div>MCM uses masked feature reconstruction to learn feature dependencies without explicitly training on survival objectives. It supports both standalone dataset synthesis and conditional data augmentation, enabling the generation of targeted synthetic subcohorts without retraining. Evaluated on a chronic kidney disease (CKD) electronic health record (EHR) dataset, MCM was benchmarked against eight baseline methods, including variational autoencoders, GANs, SMOTE variants, and a recent risk-aware distillation model. Model performance was assessed via calibration loss, Cox model consistency, and Kaplan–Meier fidelity.</div></div><div><h3>Results</h3><div>MCM-generated data closely replicated statistical properties of the real dataset, pre- served hazard ratios, and matched time-to-event curves with high fidelity. Cox models trained on MCM-augmented data demonstrated improved calibration, reducing overall calibration loss by 15% and subgroup <em>meta</em>-calibration loss by 9% compared to unaugmented data. These improvements held across multiple high-risk subgroups including those with diabetes, renal dys- function, and advanced age. Unlike competing methods, MCM achieved this without retraining or outcome-specific tuning.</div></div><div><h3>Conclusions</h3><div>MCM offers a practical and flexible framework for generating synthetic survival data that improves risk model calibration. By supporting both reproducible dataset synthesis and conditional subgroup augmentation, MCM bridges privacy-preserving data access with calibration-aware learning. This work highlights the role of synthetic data not just as a privacy tool, but as a vehicle for improving equity and reliability in clinical modelling.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104928"},"PeriodicalIF":4.5,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145476802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLM-DQR: Large language model-based automated generation of data quality rules for electronic health records LLM-DQR:基于大型语言模型的电子健康记录数据质量规则自动生成。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-06 DOI: 10.1016/j.jbi.2025.104951
Shuyang Xie , Hailing Cai , Yaoqin Sun, Xudong Lv

Objective

To develop and evaluate LLM-DQR, an automated approach using large language models to generate electronic health record data quality rules, addressing the limitations of current manual and automated methods that suffer from low efficiency, limited flexibility, and inadequate coverage of complex business logic.

Materials and Methods

We designed a comprehensive pipeline with three core components: (1) standardized input processing integrating database schemas, natural language requirements, and sample data; (2) Chain-of-Thought prompt engineering for guided rule generation; and (3) closed-loop validation with deduplication, sandbox execution, and iterative debugging. The approach was evaluated on two distinct, publicly available datasets: the Paediatric Intensive Care (PIC) dataset and the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. Performance was compared against manual expert construction (expert-DQR) and clinical information model-based generation (CIM-DQR).

Results

LLM-DQR demonstrated higher performance across all evaluation metrics. The GPT implementation achieved overall coverage rates of 97.1% on the PIC dataset and 99.6% on the MIMIC-IV dataset, outperforming CIM-DQR. Performance was particularly strong for complex dimensions: achieving 100% coverage for Consistency rules on both datasets, whereas CIM-DQR achieved 0%. Construction time was reduced by over 10-fold compared to manual methods. Additionally, on the PIC dataset, LLM-DQR generated 89 extra, expert-validated rules.

Discussion

The stronger performance demonstrates LLMs’ capability to understand complex EHR data patterns and assessment requirements, functioning as data quality analysis assistants with domain knowledge and logical reasoning capabilities.

Conclusion

LLM-DQR provides an efficient, scalable solution for automated data quality rule generation in clinical settings, offering considerable improvements over traditional approaches.
目的:开发和评估LLM-DQR,一种使用大型语言模型生成电子健康记录数据质量规则的自动化方法,解决当前手动和自动化方法效率低、灵活性有限以及对复杂业务逻辑覆盖不足的局限性。材料和方法:我们设计了一个完整的管道,包括三个核心组件:(1)集成数据库模式、自然语言需求和样本数据的标准化输入处理;(2)引导规则生成的思维链提示工程;(3)采用重复数据删除、沙盒执行和迭代调试的闭环验证。该方法在两个不同的公开数据集上进行了评估:儿科重症监护(PIC)数据集和重症监护医疗信息集市(MIMIC-IV)数据集。与手工专家构建(expert- dqr)和基于临床信息模型生成(CIM-DQR)的性能进行比较。结果:LLM-DQR在所有评估指标中表现出更高的性能。GPT实现在PIC数据集上的总体覆盖率为97.1%,在MIMIC-IV数据集上的总体覆盖率为99.6%,优于CIM-DQR。对于复杂维度,性能特别强:在两个数据集上实现了100%的一致性规则覆盖率,而CIM-DQR实现了0%。与手工方法相比,施工时间缩短了10倍以上。此外,在PIC数据集上,LLM-DQR生成了89条额外的、经过专家验证的规则。讨论:较强的性能表明llm有能力理解复杂的EHR数据模式和评估需求,具有领域知识和逻辑推理能力的数据质量分析助手。结论:LLM-DQR为临床环境中的自动数据质量规则生成提供了高效、可扩展的解决方案,比传统方法有了很大的改进。
{"title":"LLM-DQR: Large language model-based automated generation of data quality rules for electronic health records","authors":"Shuyang Xie ,&nbsp;Hailing Cai ,&nbsp;Yaoqin Sun,&nbsp;Xudong Lv","doi":"10.1016/j.jbi.2025.104951","DOIUrl":"10.1016/j.jbi.2025.104951","url":null,"abstract":"<div><h3>Objective</h3><div>To develop and evaluate LLM-DQR, an automated approach using large language models to generate electronic health record data quality rules, addressing the limitations of current manual and automated methods that suffer from low efficiency, limited flexibility, and inadequate coverage of complex business logic.</div></div><div><h3>Materials and Methods</h3><div>We designed a comprehensive pipeline with three core components: (1) standardized input processing integrating database schemas, natural language requirements, and sample data; (2) Chain-of-Thought prompt engineering for guided rule generation; and (3) closed-loop validation with deduplication, sandbox execution, and iterative debugging. The approach was evaluated on two distinct, publicly available datasets: the Paediatric Intensive Care (PIC) dataset and the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. Performance was compared against manual expert construction (expert-DQR) and clinical information model-based generation (CIM-DQR).</div></div><div><h3>Results</h3><div>LLM-DQR demonstrated higher performance across all evaluation metrics. The GPT implementation achieved overall coverage rates of 97.1% on the PIC dataset and 99.6% on the MIMIC-IV dataset, outperforming CIM-DQR. Performance was particularly strong for complex dimensions: achieving 100% coverage for Consistency rules on both datasets, whereas CIM-DQR achieved 0%. Construction time was reduced by over 10-fold compared to manual methods. Additionally, on the PIC dataset, LLM-DQR generated 89 extra, expert-validated rules.</div></div><div><h3>Discussion</h3><div>The stronger performance demonstrates LLMs’ capability to understand complex EHR data patterns and assessment requirements, functioning as data quality analysis assistants with domain knowledge and logical reasoning capabilities.</div></div><div><h3>Conclusion</h3><div>LLM-DQR provides an efficient, scalable solution for automated data quality rule generation in clinical settings, offering considerable improvements over traditional approaches.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104951"},"PeriodicalIF":4.5,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145476965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring multimodal large language models on transthoracic Echocardiogram (TTE) tasks for cardiovascular decision support 探索多模态大语言模型在经胸超声心动图(TTE)任务中的心血管决策支持。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 DOI: 10.1016/j.jbi.2025.104930
Jianfu Li , Yiming Li , Zenan Sun , Evan Yu , Ahmed M. Abdelhameed , Weiguo Cao , Haifang Li , Jianping He , Pengze Li , Jingna Feng , Yue Yu , Xinyue Hu , Manqi Li , Rakesh Kumar , Yifang Dang , Fang Li , Shahyar M Gharacholou , Cui Tao

Objective

Multimodal large language models (LLMs) offer new potential for enhancing cardiovascular decision support, particularly in interpreting echocardiographic data. This study systematically evaluates and benchmarks foundation models from diverse domains on echocardiogram-based tasks to assess their effectiveness, limitations and potential in clinical cardiovascular applications.

Methods

We curated three cardiovascular imaging datasets—EchoNet-Dynamic, TMED2, and an expert-annotated echocardiogram (TTE) dataset—to evaluate performance on four critical tasks: (1) cardiac function evaluation through ejection fraction (EF) prediction, (2) cardiac view classification, (3) aortic stenosis (AS) severity assessment, and (4) cardiovascular disease classification. We evaluated six multimodal LLMs: EchoClip (cardiovascular-specific), BiomedGPT and LLaVA-Med (medical-domain), and MiniCPM-V 2.6, LLaMA-3-Vision-Alpha, and Gemini-1.5 (general-domain). Models were assessed using zero-shot, few-shot, and fine-tuning strategies, where applicable. Performance was measured using mean absolute error (MAE) and root mean squared error (RMSE) for EF prediction, and accuracy, precision, recall, and F1 score for classification tasks.

Results

Domain-specific models such as EchoClip demonstrated the strongest zero-shot performance in EF prediction, achieving an MAE of 10.34. General-domain models showed limited effectiveness without adaptation, with MiniCPM-V 2.6 reporting an MAE of 251.92. Fine-tuning significantly improved outcomes; for example, MiniCPM-V 2.6′s MAE decreased to 31.93, and view classification accuracy increased from 20 % to 63.05 %. In classification tasks, EchoClip achieved F1 scores of 0.2716 for AS severity and 0.4919 for disease classification but exhibited limited performance in view classification (F1 = 0.1457). Few-shot learning yielded modest gains but was generally less effective than fine-tuning.

Conclusions

This evaluation and benchmarking study demonstrated the importance of domain-specific pretraining and model adaptation in cardiovascular decision support tasks. Cardiovascular-focused models and fine-tuned general-domain models achieved superior performance, especially for complex assessments such as EF estimation. These findings offer critical insights into the current capabilities and future directions for clinically meaningful AI integration in cardiovascular medicine.
目的:多模态大语言模型(LLMs)为增强心血管决策支持提供了新的潜力,特别是在解释超声心动图数据方面。本研究系统地评估和基准了基于超声心动图任务的不同领域的基础模型,以评估其在临床心血管应用中的有效性、局限性和潜力。方法:我们整理了三个心血管成像数据集——echonet - dynamic、TMED2和专家注释的超声心动图(TTE)数据集,以评估四个关键任务的表现:(1)通过射血分数(EF)预测心功能评估,(2)心脏视图分类,(3)主动脉狭窄(AS)严重程度评估,以及(4)心血管疾病分类。我们评估了6种多模式llm: EchoClip(心血管特异性)、BiomedGPT和LLaVA-Med(医学领域),以及MiniCPM-V 2.6、LLaMA-3-Vision-Alpha和Gemini-1.5(通用领域)。在适用的情况下,使用零射击、少射击和微调策略评估模型。使用EF预测的平均绝对误差(MAE)和均方根误差(RMSE)以及分类任务的准确度、精密度、召回率和F1分数来衡量性能。结果:EchoClip等领域特定模型在EF预测中表现出最强的零射击性能,MAE为10.34。通用领域模型在没有适应的情况下显示出有限的有效性,minicpm - v2.6报告的MAE为251.92。微调显著改善了结果;例如,minicpm - v2.6的MAE下降到31.93,视图分类准确率从20 %提高到63.05 %。在分类任务中,EchoClip在AS严重程度和疾病分类方面的F1得分分别为0.2716和0.4919,但在视图分类方面表现有限(F1 = 0.1457)。几次学习产生了适度的收益,但通常不如微调有效。结论:这项评估和基准研究证明了特定领域的预训练和模型适应在心血管决策支持任务中的重要性。以心血管为中心的模型和精细调整的通用领域模型取得了卓越的性能,特别是对于复杂的评估,如EF估计。这些发现为心血管医学中有临床意义的人工智能整合的当前能力和未来方向提供了重要见解。
{"title":"Exploring multimodal large language models on transthoracic Echocardiogram (TTE) tasks for cardiovascular decision support","authors":"Jianfu Li ,&nbsp;Yiming Li ,&nbsp;Zenan Sun ,&nbsp;Evan Yu ,&nbsp;Ahmed M. Abdelhameed ,&nbsp;Weiguo Cao ,&nbsp;Haifang Li ,&nbsp;Jianping He ,&nbsp;Pengze Li ,&nbsp;Jingna Feng ,&nbsp;Yue Yu ,&nbsp;Xinyue Hu ,&nbsp;Manqi Li ,&nbsp;Rakesh Kumar ,&nbsp;Yifang Dang ,&nbsp;Fang Li ,&nbsp;Shahyar M Gharacholou ,&nbsp;Cui Tao","doi":"10.1016/j.jbi.2025.104930","DOIUrl":"10.1016/j.jbi.2025.104930","url":null,"abstract":"<div><h3>Objective</h3><div>Multimodal large language models (LLMs) offer new potential for enhancing cardiovascular decision support, particularly in interpreting echocardiographic data. This study systematically evaluates and benchmarks foundation models from diverse domains on echocardiogram-based tasks to assess their effectiveness, limitations and potential in clinical cardiovascular applications.</div></div><div><h3>Methods</h3><div>We curated three cardiovascular imaging datasets—EchoNet-Dynamic, TMED2, and an expert-annotated echocardiogram (TTE) dataset—to evaluate performance on four critical tasks: (1) cardiac function evaluation through ejection fraction (EF) prediction, (2) cardiac view classification, (3) aortic stenosis (AS) severity assessment, and (4) cardiovascular disease classification. We evaluated six multimodal LLMs: EchoClip (cardiovascular-specific), BiomedGPT and LLaVA-Med (medical-domain), and MiniCPM-V 2.6, LLaMA-3-Vision-Alpha, and Gemini-1.5 (general-domain). Models were assessed using zero-shot, few-shot, and fine-tuning strategies, where applicable. Performance was measured using mean absolute error (MAE) and root mean squared error (RMSE) for EF prediction, and accuracy, precision, recall, and F1 score for classification tasks.</div></div><div><h3>Results</h3><div>Domain-specific models such as EchoClip demonstrated the strongest zero-shot performance in EF prediction, achieving an MAE of 10.34. General-domain models showed limited effectiveness without adaptation, with MiniCPM-V 2.6 reporting an MAE of 251.92. Fine-tuning significantly improved outcomes; for example, MiniCPM-V 2.6′s MAE decreased to 31.93, and view classification accuracy increased from 20 % to 63.05 %. In classification tasks, EchoClip achieved F1 scores of 0.2716 for AS severity and 0.4919 for disease classification but exhibited limited performance in view classification (F1 = 0.1457). Few-shot learning yielded modest gains but was generally less effective than fine-tuning.</div></div><div><h3>Conclusions</h3><div>This evaluation and benchmarking study demonstrated the importance of domain-specific pretraining and model adaptation in cardiovascular decision support tasks. Cardiovascular-focused models and fine-tuned general-domain models achieved superior performance, especially for complex assessments such as EF estimation. These findings offer critical insights into the current capabilities and future directions for clinically meaningful AI integration in cardiovascular medicine.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104930"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145370273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable scientific interest profiling using large language models 使用大型语言模型的可伸缩科学兴趣分析。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 DOI: 10.1016/j.jbi.2025.104949
Yilun Liang , Gongbo Zhang , Edward Sun , Betina Idnay , Yilu Fang , Fangyi Chen , Casey Ta , Yifan Peng , Chunhua Weng

Objective

Research profiles highlight scientists’ research focus, enabling talent discovery and fostering collaborations, but they are often outdated. Automated, scalable methods are urgently needed to keep these profiles current.

Methods

In this study, we design and evaluate two Large Language Models (LLMs)-based methods to generate scientific interest profiles—one summarizing researchers’ PubMed abstracts and the other generating a summary using their publications’ Medical Subject Headings (MeSH) terms—and compare these machine-generated profiles with researchers’ self-summarized interests. We collected the titles, MeSH terms, and abstracts of PubMed publications for 595 faculty members affiliated with Columbia University Irving Medical Center (CUIMC), for 167 of whom we obtained human-written online research profiles. Subsequently, GPT-4o-mini, a state-of-the-art LLM, was prompted to summarize each researcher’s interests. Both manual and automated evaluations were conducted to characterize the similarities and differences between the machine-generated and self-written research profiles.

Results

The similarity study showed low ROUGE-L, BLEU, and METEOR scores, reflecting little overlap between terminologies used in machine-generated and self-written profiles. BERTScore analysis revealed moderate semantic similarity between machine-generated and reference summaries (F1: 0.542 for MeSH-based, 0.555 for abstract-based), despite low lexical overlap. In validation, paraphrased summaries achieved a higher F1 of 0.851. A further comparison between the original and paraphrased manually written summaries indicates the limitations of such metrics. Kullback-Leibler (KL) Divergence of term frequency-inverse document frequency (TF-IDF) values (8.56 and 8.58 for profiles derived from MeSH terms and abstracts, respectively) suggests that machine-generated summaries employ different keywords than human-written summaries. Manual reviews further showed that 77.78% rated the overall impression of MeSH-based profiling as “good” or “excellent,” with readability receiving favorable ratings in 93.44% of cases, though granularity and factual accuracy varied. Overall, panel reviews favored 67.86% of machine-generated profiles derived from MeSH terms over those derived from abstracts.

Conclusion

LLMs promise to automate scientific interest profiling at scale. Profiles derived from MeSH terms have better readability than profiles derived from abstracts. Overall, machine-generated summaries differ from human-written ones in their choice of concepts, with the latter initiating more novel ideas.
研究概况突出了科学家的研究重点,使人才发现和促进合作,但它们往往过时。我们迫切需要自动化的、可伸缩的方法来保持这些概要文件的最新状态。方法:在本研究中,我们设计并评估了两种基于大语言模型(llm)的方法来生成科学兴趣概况——一种是总结研究人员的PubMed摘要,另一种是使用他们出版物的医学主题标题(MeSH)术语生成摘要——并将这些机器生成的概况与研究人员自己总结的兴趣进行比较。我们收集了哥伦比亚大学欧文医学中心(CUIMC)附属595名教职员工的PubMed出版物的标题、MeSH术语和摘要,其中167人获得了人工撰写的在线研究资料。随后,gpt - 40 -mini,一个最先进的LLM,被提示总结每个研究人员的兴趣。手工和自动评估都进行了,以表征机器生成的和自己编写的研究概况之间的异同。结果:相似性研究显示ROUGE-L、BLEU和METEOR得分较低,反映了机器生成和自己编写的概要文件中使用的术语之间几乎没有重叠。BERTScore分析显示,机器生成的摘要和参考摘要之间的语义相似度适中(基于mesh的F1为0.542,基于摘要的F1为0.555),尽管词汇重叠很少。在验证中,意译摘要的F1值更高,为0.851。进一步比较原始的和意译的手工写的摘要表明了这种度量的局限性。术语频率逆文档频率(TF-IDF)值的散度(分别为8.56和8.58,分别来自MeSH术语和摘要)表明,机器生成的摘要使用的关键词不同于人工编写的摘要。手工评审进一步表明,77.78%的人将基于mesh的分析的总体印象评为“好”或“优秀”,尽管粒度和事实准确性有所不同,但在93.44%的案例中,可读性获得了好评。总体而言,专家组评审对源自MeSH术语的67.86%的机器生成的概要比对源自摘要的概要更青睐。结论:法学硕士有望大规模自动化科学兴趣分析。从MeSH术语派生的概要文件比从摘要派生的概要文件具有更好的可读性。总的来说,机器生成的摘要与人类编写的摘要在概念选择上不同,后者会提出更多新颖的想法。
{"title":"Scalable scientific interest profiling using large language models","authors":"Yilun Liang ,&nbsp;Gongbo Zhang ,&nbsp;Edward Sun ,&nbsp;Betina Idnay ,&nbsp;Yilu Fang ,&nbsp;Fangyi Chen ,&nbsp;Casey Ta ,&nbsp;Yifan Peng ,&nbsp;Chunhua Weng","doi":"10.1016/j.jbi.2025.104949","DOIUrl":"10.1016/j.jbi.2025.104949","url":null,"abstract":"<div><h3>Objective</h3><div>Research profiles highlight scientists’ research focus, enabling talent discovery and fostering collaborations, but they are often outdated. Automated, scalable methods are urgently needed to keep these profiles current.</div></div><div><h3>Methods</h3><div>In this study, we design and evaluate two Large Language Models (LLMs)-based methods to generate scientific interest profiles—one summarizing researchers’ PubMed abstracts and the other generating a summary using their publications’ Medical Subject Headings (MeSH) terms—and compare these machine-generated profiles with researchers’ self-summarized interests. We collected the titles, MeSH terms, and abstracts of PubMed publications for 595 faculty members affiliated with Columbia University Irving Medical Center (CUIMC), for 167 of whom we obtained human-written online research profiles. Subsequently, GPT-4o-mini, a state-of-the-art LLM, was prompted to summarize each researcher’s interests. Both manual and automated evaluations were conducted to characterize the similarities and differences between the machine-generated and self-written research profiles.</div></div><div><h3>Results</h3><div>The similarity study showed low ROUGE-L, BLEU, and METEOR scores, reflecting little overlap between terminologies used in machine-generated and self-written profiles. BERTScore analysis revealed moderate semantic similarity between machine-generated and reference summaries (F1: 0.542 for MeSH-based, 0.555 for abstract-based), despite low lexical overlap. In validation, paraphrased summaries achieved a higher F1 of 0.851. A further comparison between the original and paraphrased manually written summaries indicates the limitations of such metrics. Kullback-Leibler (KL) Divergence of term frequency-inverse document frequency (TF-IDF) values (8.56 and 8.58 for profiles derived from MeSH terms and abstracts, respectively) suggests that machine-generated summaries employ different keywords than human-written summaries. Manual reviews further showed that 77.78% rated the overall impression of MeSH-based profiling as “good” or “excellent,” with readability receiving favorable ratings in 93.44% of cases, though granularity and factual accuracy varied. Overall, panel reviews favored 67.86% of machine-generated profiles derived from MeSH terms over those derived from abstracts.</div></div><div><h3>Conclusion</h3><div>LLMs promise to automate scientific interest profiling at scale. Profiles derived from MeSH terms have better readability than profiles derived from abstracts. Overall, machine-generated summaries differ from human-written ones in their choice of concepts, with the latter initiating more novel ideas.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104949"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145431708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The crisis of biomedical foundation models 生物医学基础模型的危机。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 DOI: 10.1016/j.jbi.2025.104917
Fei Wang
{"title":"The crisis of biomedical foundation models","authors":"Fei Wang","doi":"10.1016/j.jbi.2025.104917","DOIUrl":"10.1016/j.jbi.2025.104917","url":null,"abstract":"","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104917"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145182104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scaling up biomedical vision-language models: Fine-tuning, instruction tuning, and multi-modal learning 扩大生物医学视觉语言模型:微调,指令调整和多模态学习。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 DOI: 10.1016/j.jbi.2025.104946
Cheng Peng , Kai Zhang , Mengxian Lyu , Hongfang Liu , Lichao Sun , Yonghui Wu

Objective

To advance biomedical vision language model capabilities through scaling up, fine-tuning, and instruction tuning, develop vision-language models with improved performance in handling long text, explore strategies to efficiently adopt vision language models for diverse multi-modal biomedical tasks, and examine the zero-shot learning performance.

Methods

We developed two biomedical vision language models, BiomedGPT-Large and BiomedGPT-XLarge, based on an encoder-decoder-based transformer architecture. We fine-tuned the two models on 23 benchmark datasets from 6 multi-modal biomedical tasks, including one image-only task (image classification), three language-only tasks (text understanding, text summarization, and question answering), and two vision-language tasks (visual question answering and image captioning). We compared the developed scaled models with our previous BiomedGPT-Base model and existing prestigious models reported in the literature. We instruction-tuned the two models using a large-scale multi-modal biomedical instruction-tuning dataset and assessed the zero-shot learning performance and alignment accuracy.

Results and Conclusion

The experimental results show that the new models developed in this study outperform our previous BiomedGPT-Base model on 17 of 23 benchmark datasets and achieve state-of-the-art performance on 15 of 23 datasets when compared to previous models reported in the literature. The new models also demonstrated improved ability in handling long text, particularly on text summarization on the MIMIC-III dataset and text understanding on the SEER dataset, with a remarkable improvement of 4.6–11.4 %. Instruction tuning on the scaled models resulted in significant enhancements in zero-shot learning ability and alignment accuracy in following complex instructions across multiple tasks, including image classification, visual question answering, and image captioning. This study develops two vision-language models in the biomedical domain and examines technologies to improve long text content in vision language models through scaling, fine-tuning, and instruction tuning. This study demonstrates the potential of vision language models to integrate multiple data modalities to solve diverse multimodal tasks in the biomedical domain.
目的:通过放大、微调和指令调优来提升生物医学视觉语言模型的能力,开发具有较好长文本处理性能的视觉语言模型,探索将视觉语言模型高效应用于多种多模态生物医学任务的策略,并检验其零采样学习性能。方法:基于基于编码器-解码器的转换器架构,我们开发了两个生物医学视觉语言模型:BiomedGPT-Large和BiomedGPT-XLarge。我们在来自6个多模态生物医学任务的23个基准数据集上对这两个模型进行了微调,这些任务包括一个纯图像任务(图像分类)、三个纯语言任务(文本理解、文本摘要和问答)和两个视觉语言任务(视觉问答和图像字幕)。我们将开发的缩放模型与我们之前的生物gpt基础模型和文献中报道的现有著名模型进行了比较。我们使用大规模的多模态生物医学指令调整数据集对这两个模型进行了指令调整,并评估了零射击学习性能和对准精度。结果和结论:实验结果表明,与文献中报道的模型相比,本研究开发的新模型在23个基准数据集中的17个上优于我们之前的生物gpt - base模型,在23个数据集中的15个上达到了最先进的性能。新模型在处理长文本方面的能力也有所提高,特别是在MIMIC-III数据集上的文本摘要和SEER数据集上的文本理解方面,显著提高了4.6-11.4 %。缩放模型上的指令调整显著增强了零射击学习能力和跨多个任务执行复杂指令的对齐精度,包括图像分类、视觉问题回答和图像字幕。本研究开发了两种生物医学领域的视觉语言模型,并研究了通过缩放、微调和指令调整来改善视觉语言模型中长文本内容的技术。本研究展示了视觉语言模型集成多种数据模态以解决生物医学领域多种多模态任务的潜力。
{"title":"Scaling up biomedical vision-language models: Fine-tuning, instruction tuning, and multi-modal learning","authors":"Cheng Peng ,&nbsp;Kai Zhang ,&nbsp;Mengxian Lyu ,&nbsp;Hongfang Liu ,&nbsp;Lichao Sun ,&nbsp;Yonghui Wu","doi":"10.1016/j.jbi.2025.104946","DOIUrl":"10.1016/j.jbi.2025.104946","url":null,"abstract":"<div><h3>Objective</h3><div>To advance biomedical vision language model capabilities through scaling up, fine-tuning, and instruction tuning, develop vision-language models with improved performance in handling long text, explore strategies to efficiently adopt vision language models for diverse multi-modal biomedical tasks, and examine the zero-shot learning performance.</div></div><div><h3>Methods</h3><div>We developed two biomedical vision language models, BiomedGPT-Large and BiomedGPT-XLarge, based on an encoder-decoder-based transformer architecture. We fine-tuned the two models on 23 benchmark datasets from 6 multi-modal biomedical tasks, including one image-only task (image classification), three language-only tasks (text understanding, text summarization, and question answering), and two vision-language tasks (visual question answering and image captioning). We compared the developed scaled models with our previous BiomedGPT-Base model and existing prestigious models reported in the literature. We instruction-tuned the two models using a large-scale multi-modal biomedical instruction-tuning dataset and assessed the zero-shot learning performance and alignment accuracy.</div></div><div><h3>Results and Conclusion</h3><div>The experimental results show that the new models developed in this study outperform our previous BiomedGPT-Base model on 17 of 23 benchmark datasets and achieve state-of-the-art performance on 15 of 23 datasets when compared to previous models reported in the literature. The new models also demonstrated improved ability in handling long text, particularly on text summarization on the MIMIC-III dataset and text understanding on the SEER dataset, with a remarkable improvement of 4.6–11.4 %. Instruction tuning on the scaled models resulted in significant enhancements in zero-shot learning ability and alignment accuracy in following complex instructions across multiple tasks, including image classification, visual question answering, and image captioning. This study develops two vision-language models in the biomedical domain and examines technologies to improve long text content in vision language models through scaling, fine-tuning, and instruction tuning. This study demonstrates the potential of vision language models to integrate multiple data modalities to solve diverse multimodal tasks in the biomedical domain.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104946"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145370338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pre-coding skin cancer from free-text pathology reports using noise-robust neural networks 使用噪声鲁棒神经网络从自由文本病理报告中预编码皮肤癌
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 DOI: 10.1016/j.jbi.2025.104943
Tapio Niemi, Gautier Defossez, Simon Germann, Jean-Luc Bulliard

Objective

Population-based cancer registries receive numerous free-text pathology reports from which cancer cases are manually coded according to international standards. Skin cancer is the most frequent cancer in Caucasian populations, and its incidence is increasing. We developed an AI-based method to identify skin cancer, locate relevant key terms in pathological reports, and suggest coding for the main clinical variables.

Methods

We explored multiple neural network architectures and found out that convolutional neural networks with customised noise-robust loss functions offer the best performance for identifying cancer types and pre-coding subsite, morphology, behaviour, grade, laterality, and first line of treatment of skin cancer cases. Previously registered cases were used as training data. We additionally applied an attention mechanism to extract and highlight reports’ key diagnostic terms. These highlights facilitate human review of pre-coding results. We evaluated performance of the method by using manually coded cases in a separate test set.

Results

The accuracies of detecting skin cancer types were 0.98–0.99, and F1 scores 0.93–0.96. Pre-coding accuracy and weighted F1 score were: ICD-O subsite (4 digits): 0.89–0.91 and 0.89–0.91, morphology (4 digits): 0.61–0.90 and 0.63–0.89, morphology (3-digits): 0.86–0.98 and 0.89–0.98, tumour behaviour: 0.96–0.98 and 0.96–0.98, laterality: 0.99 and 0.98–0.99. Also, accuracy (0.96) and weighted F1 score (0.96) for the grade were estimated for squamous cell carcinoma (SCC) of the skin, and treatments for SCC and melanoma (accuracies 0.84 and 0.87, weighted F1 scores and 0.82 and 0.87). The extracted key words matched ICD-O code descriptions with high precision.

Conclusion

We piloted our method in the Vaud Cancer Registry, Switzerland. It was able to identify and pre-code skin cancer cases efficiently and find correct key terms in reports. Medical coders found pre-coding useful and time saving. Integration of the method in the registry document workflow and its extension to other cancer types are intended.
目的基于人群的癌症登记处收到大量的自由文本病理报告,其中癌症病例根据国际标准手动编码。皮肤癌是高加索人群中最常见的癌症,其发病率呈上升趋势。我们开发了一种基于人工智能的方法来识别皮肤癌,定位病理报告中的相关关键术语,并建议对主要临床变量进行编码。方法研究了多种神经网络结构,发现具有自定义噪声鲁棒损失函数的卷积神经网络在识别皮肤癌类型和预编码亚位点、形态、行为、分级、侧边性和一线治疗方面表现最佳。以前登记的病例被用作训练数据。我们还应用了一个注意机制来提取和突出报告的关键诊断术语。这些亮点有助于人工审查预编码结果。我们通过在单独的测试集中使用手动编码的案例来评估该方法的性能。结果该方法检测皮肤癌类型的准确率为0.98 ~ 0.99,F1评分为0.93 ~ 0.96。ICD-O亚位点(4位):0.89-0.91和0.89-0.91,形态学(4位):0.61-0.90和0.63-0.89,形态学(3位):0.86-0.98和0.89-0.98,肿瘤行为:0.96-0.98和0.96-0.98,侧边性:0.99和0.98-0.99。此外,估计皮肤鳞状细胞癌(SCC)分级的准确性(0.96)和加权F1评分(0.96),以及SCC和黑色素瘤的治疗(准确性0.84和0.87,加权F1评分和0.82和0.87)。提取的关键词与ICD-O代码描述匹配精度高。结论:我们在瑞士沃州癌症登记处试用了我们的方法。它能够有效地识别和预编码皮肤癌病例,并在报告中找到正确的关键术语。医疗编码人员发现预编码很有用,而且节省了时间。计划将该方法集成到注册表文档工作流中,并将其扩展到其他癌症类型。
{"title":"Pre-coding skin cancer from free-text pathology reports using noise-robust neural networks","authors":"Tapio Niemi,&nbsp;Gautier Defossez,&nbsp;Simon Germann,&nbsp;Jean-Luc Bulliard","doi":"10.1016/j.jbi.2025.104943","DOIUrl":"10.1016/j.jbi.2025.104943","url":null,"abstract":"<div><h3>Objective</h3><div>Population-based cancer registries receive numerous free-text pathology reports from which cancer cases are manually coded according to international standards. Skin cancer is the most frequent cancer in Caucasian populations, and its incidence is increasing. We developed an AI-based method to identify skin cancer, locate relevant key terms in pathological reports, and suggest coding for the main clinical variables.</div></div><div><h3>Methods</h3><div>We explored multiple neural network architectures and found out that convolutional neural networks with customised noise-robust loss functions offer the best performance for identifying cancer types and pre-coding subsite, morphology, behaviour, grade, laterality, and first line of treatment of skin cancer cases. Previously registered cases were used as training data. We additionally applied an attention mechanism to extract and highlight reports’ key diagnostic terms. These highlights facilitate human review of pre-coding results. We evaluated performance of the method by using manually coded cases in a separate test set.</div></div><div><h3>Results</h3><div>The accuracies of detecting skin cancer types were 0.98–0.99, and F1 scores 0.93–0.96. Pre-coding accuracy and weighted F1 score were: ICD-O subsite (4 digits): 0.89–0.91 and 0.89–0.91, morphology (4 digits): 0.61–0.90 and 0.63–0.89, morphology (3-digits): 0.86–0.98 and 0.89–0.98, tumour behaviour: 0.96–0.98 and 0.96–0.98, laterality: 0.99 and 0.98–0.99. Also, accuracy (0.96) and weighted F1 score (0.96) for the grade were estimated for squamous cell carcinoma (SCC) of the skin, and treatments for SCC and melanoma (accuracies 0.84 and 0.87, weighted F1 scores and 0.82 and 0.87). The extracted key words matched ICD-O code descriptions with high precision.</div></div><div><h3>Conclusion</h3><div>We piloted our method in the Vaud Cancer Registry, Switzerland. It was able to identify and pre-code skin cancer cases efficiently and find correct key terms in reports. Medical coders found pre-coding useful and time saving. Integration of the method in the registry document workflow and its extension to other cancer types are intended.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104943"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145417992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1