首页 > 最新文献

Journal of Biomedical Informatics最新文献

英文 中文
TransDiffECG: Semantically controllable ECG synthesis via transformer-based diffusion modeling TransDiffECG:基于变压器扩散建模的语义可控心电合成。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-10-27 DOI: 10.1016/j.jbi.2025.104948
Yuxin Lin , Jing Ma , Suyu Dong , Chaoyu Sun , Wanting Cong , Kuanquan Wang , Gongning Luo , Wei Wang

Objective:

Existing generative models for electrocardiogram (ECG) synthesis often lack fine-grained, interpretable control, limiting their utility for addressing data scarcity and imbalance. This study aims to develop a model capable of producing diverse and semantically controllable synthetic ECGs to fill this critical gap.

Methods:

We propose TransDiffECG, a novel Transformer-based diffusion model that integrates semantic information injection and global temporal modeling to enable fine-grained control over ECG synthesis. The model allows user-controllable generation of ECG signals with customized physiological details. We establish a comprehensive evaluation protocol, including downstream segmentation and classification tasks, to rigorously assess the authenticity and utility of the generated signals. Extensive experiments are conducted on both single-lead (QTDB) and multi-lead (LUDB) ECG datasets.

Results:

TransDiffECG significantly outperforms state-of-the-art baselines. On the multi-lead LUDB dataset, it achieved superior signal quality (MMD: 3.21×102; Pearson Correlation: 0.6177). The utility of the synthetic data was confirmed in downstream tasks, where data augmentation improved atrial fibrillation classification to an AUROC of 0.9451. Moreover, a segmentation model trained solely on our synthetic data rivaled one trained on real data (e.g., 98% precision/recall on QTDB).

Conclusion:

TransDiffECG represents a significant advancement in synthetic medical signal generation by bridging the gap between clinical interpretability and generative flexibility. Its ability to generate semantically controllable and clinically valid ECGs greatly expands the application potential of generative models in healthcare research and practice.
目的:现有的心电图合成生成模型往往缺乏细粒度、可解释的控制,限制了它们在解决数据稀缺性和不平衡性方面的应用。本研究旨在开发一种能够产生多样化和语义可控的合成心电图的模型来填补这一关键空白。方法:我们提出了一种新的基于变压器的扩散模型TransDiffECG,该模型集成了语义信息注入和全局时间建模,可以对ECG合成进行细粒度控制。该模型允许用户可控地生成具有定制生理细节的心电信号。我们建立了一个全面的评估协议,包括下游分割和分类任务,以严格评估生成信号的真实性和实用性。在单导联(QTDB)和多导联(LUDB)心电数据集上进行了广泛的实验。结果:TransDiffECG显著优于最先进的基线。在多导联LUDB数据集上,它获得了优越的信号质量(MMD: 3.21×10-2; Pearson Correlation: 0.6177)。合成数据的效用在下游任务中得到证实,其中数据增强将房颤分类提高到AUROC为0.9451。此外,仅在我们的合成数据上训练的分割模型可以与在真实数据上训练的模型相媲美(例如,在QTDB上的精确度/召回率为98%)。结论:TransDiffECG通过弥合临床可解释性和生成灵活性之间的差距,代表了合成医学信号生成的重大进步。生成语义可控且临床有效的心电图的能力极大地拓展了生成模型在医疗保健研究和实践中的应用潜力。
{"title":"TransDiffECG: Semantically controllable ECG synthesis via transformer-based diffusion modeling","authors":"Yuxin Lin ,&nbsp;Jing Ma ,&nbsp;Suyu Dong ,&nbsp;Chaoyu Sun ,&nbsp;Wanting Cong ,&nbsp;Kuanquan Wang ,&nbsp;Gongning Luo ,&nbsp;Wei Wang","doi":"10.1016/j.jbi.2025.104948","DOIUrl":"10.1016/j.jbi.2025.104948","url":null,"abstract":"<div><h3>Objective:</h3><div>Existing generative models for electrocardiogram (ECG) synthesis often lack fine-grained, interpretable control, limiting their utility for addressing data scarcity and imbalance. This study aims to develop a model capable of producing diverse and semantically controllable synthetic ECGs to fill this critical gap.</div></div><div><h3>Methods:</h3><div>We propose TransDiffECG, a novel Transformer-based diffusion model that integrates semantic information injection and global temporal modeling to enable fine-grained control over ECG synthesis. The model allows user-controllable generation of ECG signals with customized physiological details. We establish a comprehensive evaluation protocol, including downstream segmentation and classification tasks, to rigorously assess the authenticity and utility of the generated signals. Extensive experiments are conducted on both single-lead (QTDB) and multi-lead (LUDB) ECG datasets.</div></div><div><h3>Results:</h3><div>TransDiffECG significantly outperforms state-of-the-art baselines. On the multi-lead LUDB dataset, it achieved superior signal quality (MMD: <span><math><mrow><mn>3</mn><mo>.</mo><mn>21</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>2</mn></mrow></msup></mrow></math></span>; Pearson Correlation: 0.6177). The utility of the synthetic data was confirmed in downstream tasks, where data augmentation improved atrial fibrillation classification to an AUROC of 0.9451. Moreover, a segmentation model trained solely on our synthetic data rivaled one trained on real data (e.g., <span><math><mrow><mo>∼</mo><mn>98</mn><mtext>%</mtext></mrow></math></span> precision/recall on QTDB).</div></div><div><h3>Conclusion:</h3><div>TransDiffECG represents a significant advancement in synthetic medical signal generation by bridging the gap between clinical interpretability and generative flexibility. Its ability to generate semantically controllable and clinically valid ECGs greatly expands the application potential of generative models in healthcare research and practice.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104948"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145400897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLM-DQR: Large language model-based automated generation of data quality rules for electronic health records LLM-DQR:基于大型语言模型的电子健康记录数据质量规则自动生成。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-11-06 DOI: 10.1016/j.jbi.2025.104951
Shuyang Xie , Hailing Cai , Yaoqin Sun, Xudong Lv

Objective

To develop and evaluate LLM-DQR, an automated approach using large language models to generate electronic health record data quality rules, addressing the limitations of current manual and automated methods that suffer from low efficiency, limited flexibility, and inadequate coverage of complex business logic.

Materials and Methods

We designed a comprehensive pipeline with three core components: (1) standardized input processing integrating database schemas, natural language requirements, and sample data; (2) Chain-of-Thought prompt engineering for guided rule generation; and (3) closed-loop validation with deduplication, sandbox execution, and iterative debugging. The approach was evaluated on two distinct, publicly available datasets: the Paediatric Intensive Care (PIC) dataset and the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. Performance was compared against manual expert construction (expert-DQR) and clinical information model-based generation (CIM-DQR).

Results

LLM-DQR demonstrated higher performance across all evaluation metrics. The GPT implementation achieved overall coverage rates of 97.1% on the PIC dataset and 99.6% on the MIMIC-IV dataset, outperforming CIM-DQR. Performance was particularly strong for complex dimensions: achieving 100% coverage for Consistency rules on both datasets, whereas CIM-DQR achieved 0%. Construction time was reduced by over 10-fold compared to manual methods. Additionally, on the PIC dataset, LLM-DQR generated 89 extra, expert-validated rules.

Discussion

The stronger performance demonstrates LLMs’ capability to understand complex EHR data patterns and assessment requirements, functioning as data quality analysis assistants with domain knowledge and logical reasoning capabilities.

Conclusion

LLM-DQR provides an efficient, scalable solution for automated data quality rule generation in clinical settings, offering considerable improvements over traditional approaches.
目的:开发和评估LLM-DQR,一种使用大型语言模型生成电子健康记录数据质量规则的自动化方法,解决当前手动和自动化方法效率低、灵活性有限以及对复杂业务逻辑覆盖不足的局限性。材料和方法:我们设计了一个完整的管道,包括三个核心组件:(1)集成数据库模式、自然语言需求和样本数据的标准化输入处理;(2)引导规则生成的思维链提示工程;(3)采用重复数据删除、沙盒执行和迭代调试的闭环验证。该方法在两个不同的公开数据集上进行了评估:儿科重症监护(PIC)数据集和重症监护医疗信息集市(MIMIC-IV)数据集。与手工专家构建(expert- dqr)和基于临床信息模型生成(CIM-DQR)的性能进行比较。结果:LLM-DQR在所有评估指标中表现出更高的性能。GPT实现在PIC数据集上的总体覆盖率为97.1%,在MIMIC-IV数据集上的总体覆盖率为99.6%,优于CIM-DQR。对于复杂维度,性能特别强:在两个数据集上实现了100%的一致性规则覆盖率,而CIM-DQR实现了0%。与手工方法相比,施工时间缩短了10倍以上。此外,在PIC数据集上,LLM-DQR生成了89条额外的、经过专家验证的规则。讨论:较强的性能表明llm有能力理解复杂的EHR数据模式和评估需求,具有领域知识和逻辑推理能力的数据质量分析助手。结论:LLM-DQR为临床环境中的自动数据质量规则生成提供了高效、可扩展的解决方案,比传统方法有了很大的改进。
{"title":"LLM-DQR: Large language model-based automated generation of data quality rules for electronic health records","authors":"Shuyang Xie ,&nbsp;Hailing Cai ,&nbsp;Yaoqin Sun,&nbsp;Xudong Lv","doi":"10.1016/j.jbi.2025.104951","DOIUrl":"10.1016/j.jbi.2025.104951","url":null,"abstract":"<div><h3>Objective</h3><div>To develop and evaluate LLM-DQR, an automated approach using large language models to generate electronic health record data quality rules, addressing the limitations of current manual and automated methods that suffer from low efficiency, limited flexibility, and inadequate coverage of complex business logic.</div></div><div><h3>Materials and Methods</h3><div>We designed a comprehensive pipeline with three core components: (1) standardized input processing integrating database schemas, natural language requirements, and sample data; (2) Chain-of-Thought prompt engineering for guided rule generation; and (3) closed-loop validation with deduplication, sandbox execution, and iterative debugging. The approach was evaluated on two distinct, publicly available datasets: the Paediatric Intensive Care (PIC) dataset and the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. Performance was compared against manual expert construction (expert-DQR) and clinical information model-based generation (CIM-DQR).</div></div><div><h3>Results</h3><div>LLM-DQR demonstrated higher performance across all evaluation metrics. The GPT implementation achieved overall coverage rates of 97.1% on the PIC dataset and 99.6% on the MIMIC-IV dataset, outperforming CIM-DQR. Performance was particularly strong for complex dimensions: achieving 100% coverage for Consistency rules on both datasets, whereas CIM-DQR achieved 0%. Construction time was reduced by over 10-fold compared to manual methods. Additionally, on the PIC dataset, LLM-DQR generated 89 extra, expert-validated rules.</div></div><div><h3>Discussion</h3><div>The stronger performance demonstrates LLMs’ capability to understand complex EHR data patterns and assessment requirements, functioning as data quality analysis assistants with domain knowledge and logical reasoning capabilities.</div></div><div><h3>Conclusion</h3><div>LLM-DQR provides an efficient, scalable solution for automated data quality rule generation in clinical settings, offering considerable improvements over traditional approaches.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104951"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145476965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting drug-target interactions based on multivariate information fusion and graph contrast learning 基于多元信息融合和图对比学习的药物-靶标相互作用预测。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-11-18 DOI: 10.1016/j.jbi.2025.104960
Siying Yang , Ping-an He , Pan Zeng , Yajie Meng , Zilong Zhang , Feifei Cui , Yuhua Yao , Jialiang Yang , Junlin Xu
Drug-target interaction (DTI) prediction is of great significant in stimulating innovation and research in the medical field. In recent years, traditional experimental methods for predicting DTIs have proven to be time-consuming and costly. As a result, machine learning methods have been extensively applied to improve the prediction of drug-target interactions. However, the sparsity of inter-node connections often results in insufficiently learned node representations. Furthermore, many methods do not take into account the topological similarity between nodes when integrating similarities. This study proposes a model that integrates multiple sources of information and utilizes Graph Contrastive Learning (GCL) to predict potential drug and target interactions (MGCLDTI). Firstly, MGCLDTI employs the DeepWalk algorithm to extract global topological representations from the heterogeneous graph which incorporates multi-view information of drugs, targets, and diseases. Subsequently, a densification strategy is implemented to alleviate the noise impact arising from the sparsity of the DTI matrix. Furthermore, a GCL model with node masking is applied to enhance local structural awareness and optimize the embeddings of drugs and targets. Finally, DTI scores are predicted using the LightGBM algorithm. Comparative results against state-of-the-art methods demonstrate that MGCLDTI achieves superior predictive performance. Besides, ablation studies reveal the effectiveness of each component. Case studies also provide compelling evidence of MGCLDTI’s accuracy in identifying potential DTIs.
药物-靶标相互作用(DTI)预测对促进医学领域的创新和研究具有重要意义。近年来,传统的预测dti的实验方法被证明是耗时且昂贵的。因此,机器学习方法已被广泛应用于改善药物-靶标相互作用的预测。然而,节点间连接的稀疏性常常导致学习到的节点表示不充分。此外,许多方法在积分相似度时没有考虑节点间的拓扑相似度。本研究提出了一个整合多种信息来源的模型,并利用图对比学习(GCL)来预测潜在的药物和靶标相互作用(MGCLDTI)。首先,MGCLDTI采用DeepWalk算法从包含药物、靶点和疾病多视图信息的异构图中提取全局拓扑表示;随后,实现了致密化策略,以减轻DTI矩阵稀疏性引起的噪声影响。在此基础上,采用基于节点掩蔽的GCL模型增强局部结构感知,优化药物和靶点的嵌入。最后,使用LightGBM算法预测DTI分数。与最先进方法的比较结果表明,MGCLDTI具有优越的预测性能。此外,消融研究揭示了各组分的有效性。案例研究也提供了令人信服的证据,证明MGCLDTI在识别潜在dti方面的准确性。
{"title":"Predicting drug-target interactions based on multivariate information fusion and graph contrast learning","authors":"Siying Yang ,&nbsp;Ping-an He ,&nbsp;Pan Zeng ,&nbsp;Yajie Meng ,&nbsp;Zilong Zhang ,&nbsp;Feifei Cui ,&nbsp;Yuhua Yao ,&nbsp;Jialiang Yang ,&nbsp;Junlin Xu","doi":"10.1016/j.jbi.2025.104960","DOIUrl":"10.1016/j.jbi.2025.104960","url":null,"abstract":"<div><div>Drug-target interaction (DTI) prediction is of great significant in stimulating innovation and research in the medical field. In recent years, traditional experimental methods for predicting DTIs have proven to be time-consuming and costly. As a result, machine learning methods have been extensively applied to improve the prediction of drug-target interactions. However, the sparsity of inter-node connections often results in insufficiently learned node representations. Furthermore, many methods do not take into account the topological similarity between nodes when integrating similarities. This study proposes a model that integrates multiple sources of information and utilizes Graph Contrastive Learning (GCL) to predict potential drug and target interactions (MGCLDTI). Firstly, MGCLDTI employs the DeepWalk algorithm to extract global topological representations from the heterogeneous graph which incorporates multi-view information of drugs, targets, and diseases. Subsequently, a densification strategy is implemented to alleviate the noise impact arising from the sparsity of the DTI matrix. Furthermore, a GCL model with node masking is applied to enhance local structural awareness and optimize the embeddings of drugs and targets. Finally, DTI scores are predicted using the LightGBM algorithm. Comparative results against state-of-the-art methods demonstrate that MGCLDTI achieves superior predictive performance. Besides, ablation studies reveal the effectiveness of each component. Case studies also provide compelling evidence of MGCLDTI’s accuracy in identifying potential DTIs.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104960"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145556918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal large language models and mechanistic modeling for glucose forecasting in type 1 diabetes patients 1型糖尿病患者血糖预测的多模态大语言模型和机制模型。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-10-31 DOI: 10.1016/j.jbi.2025.104945
J.C. Wolber , M. E. Samadi , J. Sellin , A. Schuppert

Introduction:

Management of type 1 Diabetes remains a significant challenge as blood glucose levels can fluctuate dramatically and are highly individual. We introduce an innovative approach that combines multimodal Large Language models (mLLMs), mechanistic modeling of individual glucose metabolism and machine learning (ML) for forecasting blood glucose levels.

Methods:

This study uses the D1NAMO dataset (6 patients with meal images) to demonstrate mLLM integration for glucose prediction. An mLLM (Pixtral Large) was employed to estimate macronutrients from meal images, providing automated meal analysis without manual food logging. We compare three distinct approaches: (1) Baseline using only glucose dynamics and basic insulin features, (2) LastMeal providing additional information about the last meal ingested by the patient, and (3) Bézier incorporating mechanistically modeled temporal features using optimized cubic Bézier curves to model temporal impacts of individual macronutrients on blood glucose. The modeled feature impacts served as input features for a LightGBM model. We also validate the mechanistic modeling component on the AZT1D dataset (24 patients with structured carbohydrate and correction insulin logs).

Results:

The Bézier approach achieved the best performance across both datasets: D1NAMO RMSE of 15.06 at 30 min and 28.15 at 60 min; AZT1D RMSE of 16.61 at 30 min and 24.58 at 60 min. One-way ANOVA revealed statistically significant differences across prediction horizons of 45 to 120 min for the AZT1D dataset. Patient-specific Bézier curves revealed distinct metabolic response patterns: simple sugars peaked at 0.74 h, complex sugars at 3.07 h, and proteins at 4.36 h post-ingestion. Feature importance analysis showed temporal evolution from glucose change dominance to macronutrient prominence at longer horizons. Patient-specific modeling uncovered individual metabolic signatures with varying nutritional sensitivity and circadian influences.

Conclusion:

This study demonstrates the potential of combining mLLMs with mechanistic modeling for personalized diabetes management. The optimized Bézier curve approach provides superior temporal mapping while patient-specific models reveal individual metabolic signatures essential for personalized care.
导论:1型糖尿病的管理仍然是一个重大的挑战,因为血糖水平可以剧烈波动,并且高度个体化。我们介绍了一种结合多模态大语言模型(mLLMs)、个体葡萄糖代谢机制建模和机器学习(ML)预测血糖水平的创新方法。方法:本研究使用D1NAMO数据集(6例患者膳食图像)来证明mLLM集成用于血糖预测。采用mLLM (Pixtral Large)从膳食图像中估计宏量营养素,提供自动化膳食分析而无需手动食物记录。我们比较了三种不同的方法:(1)仅使用葡萄糖动力学和基本胰岛素特征的基线,(2)LastMeal提供关于患者最后一餐摄入的额外信息,以及(3)bsamzier结合机械建模的时间特征,使用优化的立方bsamzier曲线来模拟个体宏量营养素对血糖的时间影响。建模的特征影响作为LightGBM模型的输入特征。我们还在AZT1D数据集(24例结构化碳水化合物患者和校正胰岛素日志)上验证了机制建模组件。结果:bsamzier方法在两个数据集上都取得了最好的性能:D1NAMO在30分钟和60分钟时的RMSE分别为15.06和28.15;AZT1D在30分钟和60分钟的RMSE分别为16.61和24.58。单因素方差分析显示AZT1D数据集在45至120分钟的预测范围内存在统计学上的显著差异。患者特异性bsamzier曲线显示了不同的代谢反应模式:单糖在摄入后0.74小时达到峰值,复合糖在3.07小时达到峰值,蛋白质在4.36小时达到峰值。特征重要性分析表明,在较长的时间尺度上,从葡萄糖变化主导向宏量营养素突出演化。患者特异性模型揭示了具有不同营养敏感性和昼夜节律影响的个体代谢特征。结论:本研究证明了将mllm与机制建模相结合用于个性化糖尿病管理的潜力。优化的bembrozier曲线方法提供了优越的时间映射,而患者特定的模型揭示了个性化护理所必需的个人代谢特征。
{"title":"Multimodal large language models and mechanistic modeling for glucose forecasting in type 1 diabetes patients","authors":"J.C. Wolber ,&nbsp;M. E. Samadi ,&nbsp;J. Sellin ,&nbsp;A. Schuppert","doi":"10.1016/j.jbi.2025.104945","DOIUrl":"10.1016/j.jbi.2025.104945","url":null,"abstract":"<div><h3>Introduction:</h3><div>Management of type 1 Diabetes remains a significant challenge as blood glucose levels can fluctuate dramatically and are highly individual. We introduce an innovative approach that combines multimodal Large Language models (mLLMs), mechanistic modeling of individual glucose metabolism and machine learning (ML) for forecasting blood glucose levels.</div></div><div><h3>Methods:</h3><div>This study uses the D1NAMO dataset (6 patients with meal images) to demonstrate mLLM integration for glucose prediction. An mLLM (Pixtral Large) was employed to estimate macronutrients from meal images, providing automated meal analysis without manual food logging. We compare three distinct approaches: (1) <em>Baseline</em> using only glucose dynamics and basic insulin features, (2) <em>LastMeal</em> providing additional information about the last meal ingested by the patient, and (3) <em>Bézier</em> incorporating mechanistically modeled temporal features using optimized cubic Bézier curves to model temporal impacts of individual macronutrients on blood glucose. The modeled feature impacts served as input features for a LightGBM model. We also validate the mechanistic modeling component on the AZT1D dataset (24 patients with structured carbohydrate and correction insulin logs).</div></div><div><h3>Results:</h3><div>The <em>Bézier</em> approach achieved the best performance across both datasets: D1NAMO RMSE of 15.06 at 30 min and 28.15 at 60 min; AZT1D RMSE of 16.61 at 30 min and 24.58 at 60 min. One-way ANOVA revealed statistically significant differences across prediction horizons of 45 to 120 min for the AZT1D dataset. Patient-specific Bézier curves revealed distinct metabolic response patterns: simple sugars peaked at 0.74 h, complex sugars at 3.07 h, and proteins at 4.36 h post-ingestion. Feature importance analysis showed temporal evolution from glucose change dominance to macronutrient prominence at longer horizons. Patient-specific modeling uncovered individual metabolic signatures with varying nutritional sensitivity and circadian influences.</div></div><div><h3>Conclusion:</h3><div>This study demonstrates the potential of combining mLLMs with mechanistic modeling for personalized diabetes management. The optimized Bézier curve approach provides superior temporal mapping while patient-specific models reveal individual metabolic signatures essential for personalized care.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104945"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145431702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Caption-augmented reasoning model with Hierarchical rank LoRA finetuing for medical visual question Answering 基于层次排序LoRA的医学视觉问答标题增强推理模型。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-11-25 DOI: 10.1016/j.jbi.2025.104964
Yong Li , Jianping Man , Yi Zhou , Likeng Liang

Objective

Medical Visual Question Answering (VQA) is a quintessential application scenario of biomedical Multimodal Large Language Models (MLLMs). Previous studies mainly focused on input image-question pairs, neglecting the rich medical knowledge of the relevant captions of the pretrained datasets. This limits the model’s reasoning capability and causes overfitting. This paper aims to effectively utilize the captions of pretrained datasets to solve the above issues.

Methods

This paper proposes a Caption-Augmented Reasoning Model (CARM), which introduces three innovative components to leverage the captions during finetuning: (1) A Cross-Modal Visual Augmentation (CMVA) module that enriched image feature representations through semantic alignment with retrieved captions; (2) A Retrieval Cross-Modal Attention (RCMA) mechanism that established explicit connections between visual features and domain-specific medical knowledge; (3) A Hierarchical Rank Low-Rank Adaptation (HR-LoRA) module that optimized parameter-efficient finetuning through rank-adaptive decomposition in both unimodal encoders and multimodal fusion layers.

Results

The proposed CARM achieved state-of-the-art performance across three benchmark datasets, with accuracy scores of 0.798 on VQA-RAD, 0.867 on VQA-SLAKE, and 0.718 on VQA-Med-2019, respectively, outperforming existing medical VQA models. Qualitative evaluations revealed that our caption-based augmentation effectively directed model attention to the image regions related to a question.

Conclusions

The proposed CARM effectively improves visual grounding and reasoning accuracy with the systematic integration of medical captions, and the HR-LoRA alleviates overfitting and improves training efficiency.
目的:医学视觉问答(VQA)是生物医学多模态大语言模型(MLLMs)的典型应用场景。以往的研究主要集中在输入图像-问题对上,忽略了预训练数据集相关标题中丰富的医学知识。这限制了模型的推理能力并导致过拟合。本文旨在有效地利用预训练数据集的标题来解决上述问题。方法:本文提出了一种标题增强推理模型(CARM),该模型引入了三个创新组件来在微调过程中利用标题:(1)跨模态视觉增强(CMVA)模块,通过与检索到的标题进行语义对齐来丰富图像特征表示;(2)检索跨模态注意(RCMA)机制建立了视觉特征与特定领域医学知识之间的显式联系;(3)层次秩低秩自适应(HR-LoRA)模块,通过单峰编码器和多峰融合层的秩自适应分解优化参数高效微调。结果:所提出的CARM在三个基准数据集上取得了最先进的性能,VQA- rad、VQA- slake和VQA- med -2019的准确率分别为0.798、0.867和0.718,优于现有的医学VQA模型。定性评估表明,我们基于标题的增强有效地将模型的注意力引导到与问题相关的图像区域。结论:本文提出的CARM通过对医学字幕的系统集成,有效提高了视觉基础和推理精度,HR-LoRA缓解了过拟合,提高了训练效率。
{"title":"Caption-augmented reasoning model with Hierarchical rank LoRA finetuing for medical visual question Answering","authors":"Yong Li ,&nbsp;Jianping Man ,&nbsp;Yi Zhou ,&nbsp;Likeng Liang","doi":"10.1016/j.jbi.2025.104964","DOIUrl":"10.1016/j.jbi.2025.104964","url":null,"abstract":"<div><h3>Objective</h3><div>Medical Visual Question Answering (VQA) is a quintessential application scenario of biomedical Multimodal Large Language Models (MLLMs). Previous studies mainly focused on input image-question pairs, neglecting the rich medical knowledge of the relevant captions of the pretrained datasets. This limits the model’s reasoning capability and causes overfitting. This paper aims to effectively utilize the captions of pretrained datasets to solve the above issues.</div></div><div><h3>Methods</h3><div>This paper proposes a Caption-Augmented Reasoning Model (CARM), which introduces three innovative components to leverage the captions during finetuning: (1) A Cross-Modal Visual Augmentation (CMVA) module that enriched image feature representations through semantic alignment with retrieved captions; (2) A Retrieval Cross-Modal Attention (RCMA) mechanism that established explicit connections between visual features and domain-specific medical knowledge; (3) A Hierarchical Rank Low-Rank Adaptation (HR-LoRA) module that optimized parameter-efficient finetuning through rank-adaptive decomposition in both unimodal encoders and multimodal fusion layers.</div></div><div><h3>Results</h3><div>The proposed CARM achieved state-of-the-art performance across three benchmark datasets, with accuracy scores of 0.798 on VQA-RAD, 0.867 on VQA-SLAKE, and 0.718 on VQA-Med-2019, respectively, outperforming existing medical VQA models. Qualitative evaluations revealed that our caption-based augmentation effectively directed model attention to the image regions related to a question.</div></div><div><h3>Conclusions</h3><div>The proposed CARM effectively improves visual grounding and reasoning accuracy with the systematic integration of medical captions, and the HR-LoRA alleviates overfitting and improves training efficiency.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104964"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145633628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-objective optimization formulation for Alzheimer’s disease trial patient selection 阿尔茨海默病试验患者选择的多目标优化配方。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-11-15 DOI: 10.1016/j.jbi.2025.104955
Alireza Moayedikia , Sara Fin , Uffe Kock Wiil

Objective:

Clinical trial recruitment faces critical challenges with screen failure rates exceeding 80% in Alzheimer’s disease (AD) trials. Traditional patient selection relies on expert consensus without systematic evaluation of trade-offs between statistical power, recruitment feasibility, safety, and cost. We developed a multi-objective optimization framework to systematically identify optimal eligibility criteria configurations that balance competing objectives in AD clinical trial design.

Methods:

We implemented the Non-dominated Sorting Genetic Algorithm III (NSGA-III) to optimize patient selection criteria across three objectives: patient identification accuracy (F1 score), recruitment balance, and economic efficiency. The framework utilized National Alzheimer’s Coordinating Center data comprising 2,743 participants with comprehensive clinical assessments and cerebrospinal fluid biomarker measurements. We optimized 14 eligibility parameters including age boundaries, cognitive thresholds, biomarker criteria, and comorbidity management policies. Statistical validation employed Monte Carlo simulation with 10,000 iterations, bootstrap analysis, and SHAP interpretability analysis.

Results:

Optimization identified 11 Pareto-optimal solutions spanning F1 scores from 0.979 to 0.995 and eligible patient pools from 108 to 327. Compared to standard criteria selecting 101 participants, optimized approaches identified 102 participants with no significant demographic or clinical differences after multiple comparison correction. Monte Carlo simulation revealed mean cost savings of $1,048 per patient (95% CI: -$1,251 to $3,492), with 80.7% probability of positive savings but 19.3% risk of cost increases (SD = $1,208). Cross-validation demonstrated high precision (95.1%) with strategic selectivity (9.4% recall). SHAP analysis identified biomarker requirements as the dominant cost driver. Optimization algorithms converged toward solutions similar to expert-designed criteria, validating both computational and clinical approaches.

Conclusion:

Multi-objective optimization provides meaningful but incremental value through systematic validation and probabilistic efficiency enhancement rather than revolutionary transformation. The convergence toward established practice demonstrates that computational approaches serve as sophisticated validation tools that identify concrete yet uncertain efficiency improvements within existing frameworks. The substantial variability in projected outcomes establishes realistic expectations and highlights the importance of site-specific evaluation, particularly regarding recruitment infrastructure quality as the dominant determinant of success. This establishes a mature paradigm for evidence-based trial design optimization that enhances rather than replaces clinical expertise.
目的:阿尔茨海默病(AD)临床试验筛选失败率超过80%,临床试验招募面临严峻挑战。传统的患者选择依赖于专家共识,而没有对统计能力、招募可行性、安全性和成本之间的权衡进行系统评估。我们开发了一个多目标优化框架,系统地确定最佳资格标准配置,以平衡阿尔茨海默病临床试验设计中的竞争目标。方法:我们实施非支配排序遗传算法III (NSGA-III),以优化患者选择标准,包括三个目标:患者识别准确性(F1评分)、招募平衡和经济效率。该框架利用了国家阿尔茨海默病协调中心的数据,包括2743名参与者的综合临床评估和脑脊液生物标志物测量。我们优化了14个资格参数,包括年龄界限、认知阈值、生物标志物标准和合并症管理政策。统计验证采用具有10,000次迭代的蒙特卡罗模拟、自举分析和SHAP可解释性分析。结果:优化确定了11个pareto最优解,F1评分范围为0.979 ~ 0.995,符合条件的患者池范围为108 ~ 327。与选择101名受试者的标准标准相比,经过多次比较校正后,优化的方法确定了102名无显著人口统计学或临床差异的受试者。蒙特卡罗模拟显示,每位患者平均节省了1,048美元的成本(95% CI: - 1,251美元至3,492美元),节省成本的概率为80.7%,但成本增加的风险为19.3% (SD = 1,208美元)。交叉验证结果表明,该方法具有较高的精密度(95.1%)和策略选择性(9.4%)。SHAP分析确定生物标志物需求是主要的成本驱动因素。优化算法趋向于解决方案类似于专家设计的标准,验证计算和临床方法。结论:多目标优化不是革命性的变革,而是系统性的验证和概率性的效率提升,提供了有意义的增量价值。向既定实践的趋同表明,计算方法可以作为复杂的验证工具,在现有框架中识别具体但不确定的效率改进。预测结果的巨大可变性建立了现实的期望,并突出了具体地点评估的重要性,特别是将招聘基础设施质量作为成功的主要决定因素。这为循证试验设计优化建立了一个成熟的范例,增强而不是取代临床专业知识。
{"title":"Multi-objective optimization formulation for Alzheimer’s disease trial patient selection","authors":"Alireza Moayedikia ,&nbsp;Sara Fin ,&nbsp;Uffe Kock Wiil","doi":"10.1016/j.jbi.2025.104955","DOIUrl":"10.1016/j.jbi.2025.104955","url":null,"abstract":"<div><h3>Objective:</h3><div>Clinical trial recruitment faces critical challenges with screen failure rates exceeding 80% in Alzheimer’s disease (AD) trials. Traditional patient selection relies on expert consensus without systematic evaluation of trade-offs between statistical power, recruitment feasibility, safety, and cost. We developed a multi-objective optimization framework to systematically identify optimal eligibility criteria configurations that balance competing objectives in AD clinical trial design.</div></div><div><h3>Methods:</h3><div>We implemented the Non-dominated Sorting Genetic Algorithm III (NSGA-III) to optimize patient selection criteria across three objectives: patient identification accuracy (F1 score), recruitment balance, and economic efficiency. The framework utilized National Alzheimer’s Coordinating Center data comprising 2,743 participants with comprehensive clinical assessments and cerebrospinal fluid biomarker measurements. We optimized 14 eligibility parameters including age boundaries, cognitive thresholds, biomarker criteria, and comorbidity management policies. Statistical validation employed Monte Carlo simulation with 10,000 iterations, bootstrap analysis, and SHAP interpretability analysis.</div></div><div><h3>Results:</h3><div>Optimization identified 11 Pareto-optimal solutions spanning F1 scores from 0.979 to 0.995 and eligible patient pools from 108 to 327. Compared to standard criteria selecting 101 participants, optimized approaches identified 102 participants with no significant demographic or clinical differences after multiple comparison correction. Monte Carlo simulation revealed mean cost savings of $1,048 per patient (95% CI: -$1,251 to $3,492), with 80.7% probability of positive savings but 19.3% risk of cost increases (SD = $1,208). Cross-validation demonstrated high precision (95.1%) with strategic selectivity (9.4% recall). SHAP analysis identified biomarker requirements as the dominant cost driver. Optimization algorithms converged toward solutions similar to expert-designed criteria, validating both computational and clinical approaches.</div></div><div><h3>Conclusion:</h3><div>Multi-objective optimization provides meaningful but incremental value through systematic validation and probabilistic efficiency enhancement rather than revolutionary transformation. The convergence toward established practice demonstrates that computational approaches serve as sophisticated validation tools that identify concrete yet uncertain efficiency improvements within existing frameworks. The substantial variability in projected outcomes establishes realistic expectations and highlights the importance of site-specific evaluation, particularly regarding recruitment infrastructure quality as the dominant determinant of success. This establishes a mature paradigm for evidence-based trial design optimization that enhances rather than replaces clinical expertise.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104955"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145540741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrated analysis for electronic health records with structured and sporadic missingness 具有结构化和零星缺失的电子健康记录的综合分析。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-16 DOI: 10.1016/j.jbi.2025.104933
Jianbin Tan , Yan Zhang , Chuan Hong , T. Tony Cai , Tianxi Cai , Anru R. Zhang

Objectives:

We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.

Materials and Methods:

We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, Macomss, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for Macomss, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).

Results:

Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, Macomss achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.

Discussion:

The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.

Conclusions:

We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.
目的:我们提出了一种针对具有结构化和偶发缺失的电子健康记录(EHRs)量身定制的新型imputation方法。这种缺失经常出现在下游临床应用异构电子病历数据集的整合中。通过解决这些差距,我们的方法为综合分析提供了一个实用的解决方案,增强了数据的效用,促进了对人口健康的理解。材料和方法:我们首先在电子病历数据的综合分析中展示结构化和零星的缺失机制。在此之后,我们引入了一个新的imputation框架Macomss,专门设计用于处理结构和异构发生的缺失数据。我们为Macomss建立理论保证,确保其在保持综合分析的完整性和可靠性方面的稳健性。为了评估其经验性能,我们进行了广泛的模拟研究,复制了在现实世界的电子病历系统中观察到的复杂缺失模式,并使用杜克大学卫生系统(DUHS)的电子病历数据集进行了验证。结果:仿真研究表明,我们的方法始终优于现有的imputation方法。使用DUHS内三家医院的数据集,Macomss在大多数情况下实现了对缺失数据的最低输入误差,并且与基准方法相比,提供了优越或可比的下游预测性能。讨论:提出的方法有效地解决了电子病历数据集集成分析中出现的关键缺失模式,增强了临床预测的稳健性和泛化性。结论:我们为结构化和零星缺失数据的输入提供了一种理论保证和实践意义的方法,实现了多个EHR数据集的准确可靠的集成分析。提出的方法在推进人口健康研究方面具有重大潜力。
{"title":"Integrated analysis for electronic health records with structured and sporadic missingness","authors":"Jianbin Tan ,&nbsp;Yan Zhang ,&nbsp;Chuan Hong ,&nbsp;T. Tony Cai ,&nbsp;Tianxi Cai ,&nbsp;Anru R. Zhang","doi":"10.1016/j.jbi.2025.104933","DOIUrl":"10.1016/j.jbi.2025.104933","url":null,"abstract":"<div><h3>Objectives:</h3><div>We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.</div></div><div><h3>Materials and Methods:</h3><div>We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, <span>Macomss</span>, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for <span>Macomss</span>, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).</div></div><div><h3>Results:</h3><div>Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, <span>Macomss</span> achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.</div></div><div><h3>Discussion:</h3><div>The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.</div></div><div><h3>Conclusions:</h3><div>We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104933"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145318292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A non-interactive Online Medical Pre-Diagnosis system on encrypted vertically partitioned data 基于加密垂直分区数据的非交互式在线医疗预诊断系统。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-17 DOI: 10.1016/j.jbi.2025.104940
Min Tang , Yuhao Zhang , Ronghua Liang , Guoqiang Deng

Objective:

In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.

Methods:

We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.

Results:

Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6× higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.

Conclusion:

PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.
目的:在医疗环境中,患者记录作为异构特征存储在各个机构中,由于法律或制度的限制,禁止原始数据共享。这种碎片化给在线医疗预诊断(OMPD)系统带来了挑战。现有的方法(如联邦学习)需要在所有参与方(医院和云服务器)之间进行多轮交互,从而导致频繁的通信。此外,由于全局梯度的共享,它们容易受到推理攻击,导致信息泄露。本文提出了一种安全高效的OMPD系统框架来解决垂直数据碎片化问题,旨在解决医疗数据隔离与模型协作之间的矛盾。方法:提出了一种用于构建OMPD系统的安全框架PPNLR。该框架将功能加密和盲因子相结合,设计了样本特征维数加密算法和隐私保护矢量化训练算法。将样本计算与模型训练解耦,仅在医院和云服务器之间进行一次通信即可实现跨客户端数据聚合。结果:安全性分析表明,PPNLR能够抵抗半诚实推理攻击和串通攻击。基于6个真实医学数据集(文本和图像)的评估结果表明:(1)推理准确率接近集中式明文训练基准;(ii)计算效率至少比可比方法高3.6倍;(iii)通过消除对迭代计数的依赖,显著降低了通信复杂性。结论:PPNLR通过加密原语实现了数据保护,在保证医疗数据和模型参数安全的同时,保持了较高的诊断准确率。它的单通信体系结构显著降低了资源受限场景中的部署门槛,为构建隐私友好型OMPD系统提供了实用框架。
{"title":"A non-interactive Online Medical Pre-Diagnosis system on encrypted vertically partitioned data","authors":"Min Tang ,&nbsp;Yuhao Zhang ,&nbsp;Ronghua Liang ,&nbsp;Guoqiang Deng","doi":"10.1016/j.jbi.2025.104940","DOIUrl":"10.1016/j.jbi.2025.104940","url":null,"abstract":"<div><h3>Objective:</h3><div>In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.</div></div><div><h3>Methods:</h3><div>We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.</div></div><div><h3>Results:</h3><div>Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6<span><math><mo>×</mo></math></span> higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.</div></div><div><h3>Conclusion:</h3><div>PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104940"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing causal inference in medicine using biobank data 利用生物银行数据推进医学因果推理。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-09-13 DOI: 10.1016/j.jbi.2025.104903
Hadasa Kaufman , Nadav Rappoport , Amir Gilad , Michal Linial
Causal inference from observational medical record data is critical for advancing precision and personalization in healthcare. Recently, biobanks – collections of biological samples linked with genetic, lifestyle, environmental, and health-related data – have emerged as valuable resources for large-scale population studies. By integrating these resources, biobanks offer a harmonized repository of diverse data for each individual, capturing real-world medical events, including procedures, treatments, and diagnoses. However, these resources are often affected by confounding factors, selection biases, and missing information, posing significant challenges to drawing valid causal conclusions. While randomized controlled trials (RCTs) remain the gold standard for drug development and medical decision-making, the growing availability of observational data highlights the need for robust causal inference methodologies. This study provides an overview of methods for inferring the effect of a treatment on an outcome from observational data applicable to biobank data, focusing on the unique challenges they address. Our objective is to introduce current methods used for causal discovery in observational medical data. We discuss classic and modern methodologies that offer significant opportunities alongside the difficulty in reaching causality. We cover statistical methods designed for large-scale biobanks that have the potential to improve clinical decision-making, guide public health policies, and drive further research.
从观察病历数据中进行因果推断对于提高医疗保健的准确性和个性化至关重要。最近,生物银行——与遗传、生活方式、环境和健康相关数据相关的生物样本的收集——已成为大规模人口研究的宝贵资源。通过整合这些资源,生物银行为每个人提供了一个统一的不同数据存储库,捕捉现实世界的医疗事件,包括程序、治疗和诊断。然而,这些资源经常受到混杂因素、选择偏差和信息缺失的影响,对得出有效的因果结论构成了重大挑战。虽然随机对照试验(rct)仍然是药物开发和医疗决策的黄金标准,但越来越多的观察数据的可用性突出了对可靠的因果推理方法的需求。本研究概述了从适用于生物库数据的观察数据推断治疗对结果的影响的方法,重点介绍了它们解决的独特挑战。我们的目标是介绍目前在观察性医学数据中用于因果发现的方法。我们将讨论经典和现代的方法,这些方法提供了重要的机会,同时也难以达到因果关系。我们涵盖了为大型生物银行设计的统计方法,这些方法有可能改善临床决策,指导公共卫生政策,并推动进一步的研究。
{"title":"Advancing causal inference in medicine using biobank data","authors":"Hadasa Kaufman ,&nbsp;Nadav Rappoport ,&nbsp;Amir Gilad ,&nbsp;Michal Linial","doi":"10.1016/j.jbi.2025.104903","DOIUrl":"10.1016/j.jbi.2025.104903","url":null,"abstract":"<div><div>Causal inference from observational medical record data is critical for advancing precision and personalization in healthcare. Recently, biobanks – collections of biological samples linked with genetic, lifestyle, environmental, and health-related data – have emerged as valuable resources for large-scale population studies. By integrating these resources, biobanks offer a harmonized repository of diverse data for each individual, capturing real-world medical events, including procedures, treatments, and diagnoses. However, these resources are often affected by confounding factors, selection biases, and missing information, posing significant challenges to drawing valid causal conclusions. While randomized controlled trials (RCTs) remain the gold standard for drug development and medical decision-making, the growing availability of observational data highlights the need for robust causal inference methodologies. This study provides an overview of methods for inferring the effect of a treatment on an outcome from observational data applicable to biobank data, focusing on the unique challenges they address. Our objective is to introduce current methods used for causal discovery in observational medical data. We discuss classic and modern methodologies that offer significant opportunities alongside the difficulty in reaching causality. We cover statistical methods designed for large-scale biobanks that have the potential to improve clinical decision-making, guide public health policies, and drive further research.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104903"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145069728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating probabilistic privacy-preserving medical record linkage: A three-party MPC approach 加速概率隐私保护医疗记录链接:一种三方MPC方法
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-01 DOI: 10.1016/j.jbi.2025.104920
Şeyma Selcan Mağara, Noah Dietrich, Ali Burak Ünal, Mete Akgün

Objective:

Record linkage is essential for integrating data from multiple sources with diverse applications in real-world healthcare and research. Probabilistic Privacy-Preserving Record Linkage (PPRL) enables this integration occurs, while protecting sensitive information from unauthorized access, especially when datasets lack exact identifiers. As privacy regulations evolve and multi-institutional collaborations expand globally, there is a growing demand for methods that effectively balance security, accuracy, and efficiency. However, ensuring both privacy and scalability in large-scale record linkage remains a key challenge.

Method:

This paper presents a novel and efficient PPRL method based on a secure 3-party computation (MPC) framework. Our approach allows multiple parties to compute linkage results without exposing their private inputs and significantly improves the speed of linkage process compared to existing PPRL solutions.

Result:

Our method preserves the linkage quality of a state-of-the-art (SOTA) MPC-based PPRL method while achieving up to 14 times faster performance. For example, linking a record against a database of 10,000 records takes just 8.74 s in a realistic network with 700 Mbps bandwidth and 60 ms latency, compared to 92.32 s with the SOTA method. Even on a slower internet connection with 100 Mbps bandwidth and 60 ms latency, the linkage completes in 28 s, where as the SOTA method requires 287.96 s. These results demonstrate the significant scalability and efficiency improvements of our approach.

Conclusion:

Our novel PPRL method, based on secure 3-party computation, offers an efficient and scalable solution for large-scale record linkage while ensuring privacy protection. The approach demonstrates significant performance improvements, making it a promising tool for secure data integration in privacy-sensitive sectors.
目的:记录链接对于在现实世界的医疗保健和研究中整合来自多个来源的不同应用的数据至关重要。概率隐私保护记录链接(PPRL)实现了这种集成,同时保护敏感信息免受未经授权的访问,特别是当数据集缺乏精确的标识符时。随着隐私法规的发展和多机构合作在全球范围内的扩展,对有效平衡安全性、准确性和效率的方法的需求不断增长。然而,在大规模记录链接中,确保隐私和可扩展性仍然是一个关键的挑战。方法:提出一种基于安全三方计算(MPC)框架的新型高效PPRL方法。我们的方法允许多方在不暴露其私有输入的情况下计算链接结果,与现有的PPRL解决方案相比,显著提高了链接过程的速度。结果:我们的方法保留了最先进的(SOTA)基于mpc的PPRL方法的连接质量,同时实现了高达14倍的性能提升。例如,在带宽为700 Mbps、延迟为60 ms的实际网络中,将一条记录与包含10,000条记录的数据库相关联只需要8.74 s,而使用SOTA方法需要92.32 s。即使在带宽为100mbps、延迟为60ms的较慢的互联网连接上,连接也需要在28秒内完成,而SOTA方法需要287.96秒。这些结果表明,我们的方法具有显著的可扩展性和效率改进。结论:基于安全三方计算的PPRL方法在保证隐私保护的同时,为大规模记录链接提供了高效、可扩展的解决方案。该方法显示了显著的性能改进,使其成为隐私敏感领域中安全数据集成的有前途的工具。
{"title":"Accelerating probabilistic privacy-preserving medical record linkage: A three-party MPC approach","authors":"Şeyma Selcan Mağara,&nbsp;Noah Dietrich,&nbsp;Ali Burak Ünal,&nbsp;Mete Akgün","doi":"10.1016/j.jbi.2025.104920","DOIUrl":"10.1016/j.jbi.2025.104920","url":null,"abstract":"<div><h3>Objective:</h3><div>Record linkage is essential for integrating data from multiple sources with diverse applications in real-world healthcare and research. Probabilistic Privacy-Preserving Record Linkage (PPRL) enables this integration occurs, while protecting sensitive information from unauthorized access, especially when datasets lack exact identifiers. As privacy regulations evolve and multi-institutional collaborations expand globally, there is a growing demand for methods that effectively balance security, accuracy, and efficiency. However, ensuring both privacy and scalability in large-scale record linkage remains a key challenge.</div></div><div><h3>Method:</h3><div>This paper presents a novel and efficient PPRL method based on a secure 3-party computation (MPC) framework. Our approach allows multiple parties to compute linkage results without exposing their private inputs and significantly improves the speed of linkage process compared to existing PPRL solutions.</div></div><div><h3>Result:</h3><div>Our method preserves the linkage quality of a state-of-the-art (SOTA) MPC-based PPRL method while achieving up to 14 times faster performance. For example, linking a record against a database of 10,000 records takes just 8.74 s in a realistic network with 700 Mbps bandwidth and 60 ms latency, compared to 92.32 s with the SOTA method. Even on a slower internet connection with 100 Mbps bandwidth and 60 ms latency, the linkage completes in 28 s, where as the SOTA method requires 287.96 s. These results demonstrate the significant scalability and efficiency improvements of our approach.</div></div><div><h3>Conclusion:</h3><div>Our novel PPRL method, based on secure 3-party computation, offers an efficient and scalable solution for large-scale record linkage while ensuring privacy protection. The approach demonstrates significant performance improvements, making it a promising tool for secure data integration in privacy-sensitive sectors.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104920"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145223419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1