首页 > 最新文献

Journal of Biomedical Informatics最新文献

英文 中文
Predicting drug-target interactions based on multivariate information fusion and graph contrast learning 基于多元信息融合和图对比学习的药物-靶标相互作用预测。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-11-18 DOI: 10.1016/j.jbi.2025.104960
Siying Yang , Ping-an He , Pan Zeng , Yajie Meng , Zilong Zhang , Feifei Cui , Yuhua Yao , Jialiang Yang , Junlin Xu
Drug-target interaction (DTI) prediction is of great significant in stimulating innovation and research in the medical field. In recent years, traditional experimental methods for predicting DTIs have proven to be time-consuming and costly. As a result, machine learning methods have been extensively applied to improve the prediction of drug-target interactions. However, the sparsity of inter-node connections often results in insufficiently learned node representations. Furthermore, many methods do not take into account the topological similarity between nodes when integrating similarities. This study proposes a model that integrates multiple sources of information and utilizes Graph Contrastive Learning (GCL) to predict potential drug and target interactions (MGCLDTI). Firstly, MGCLDTI employs the DeepWalk algorithm to extract global topological representations from the heterogeneous graph which incorporates multi-view information of drugs, targets, and diseases. Subsequently, a densification strategy is implemented to alleviate the noise impact arising from the sparsity of the DTI matrix. Furthermore, a GCL model with node masking is applied to enhance local structural awareness and optimize the embeddings of drugs and targets. Finally, DTI scores are predicted using the LightGBM algorithm. Comparative results against state-of-the-art methods demonstrate that MGCLDTI achieves superior predictive performance. Besides, ablation studies reveal the effectiveness of each component. Case studies also provide compelling evidence of MGCLDTI’s accuracy in identifying potential DTIs.
药物-靶标相互作用(DTI)预测对促进医学领域的创新和研究具有重要意义。近年来,传统的预测dti的实验方法被证明是耗时且昂贵的。因此,机器学习方法已被广泛应用于改善药物-靶标相互作用的预测。然而,节点间连接的稀疏性常常导致学习到的节点表示不充分。此外,许多方法在积分相似度时没有考虑节点间的拓扑相似度。本研究提出了一个整合多种信息来源的模型,并利用图对比学习(GCL)来预测潜在的药物和靶标相互作用(MGCLDTI)。首先,MGCLDTI采用DeepWalk算法从包含药物、靶点和疾病多视图信息的异构图中提取全局拓扑表示;随后,实现了致密化策略,以减轻DTI矩阵稀疏性引起的噪声影响。在此基础上,采用基于节点掩蔽的GCL模型增强局部结构感知,优化药物和靶点的嵌入。最后,使用LightGBM算法预测DTI分数。与最先进方法的比较结果表明,MGCLDTI具有优越的预测性能。此外,消融研究揭示了各组分的有效性。案例研究也提供了令人信服的证据,证明MGCLDTI在识别潜在dti方面的准确性。
{"title":"Predicting drug-target interactions based on multivariate information fusion and graph contrast learning","authors":"Siying Yang ,&nbsp;Ping-an He ,&nbsp;Pan Zeng ,&nbsp;Yajie Meng ,&nbsp;Zilong Zhang ,&nbsp;Feifei Cui ,&nbsp;Yuhua Yao ,&nbsp;Jialiang Yang ,&nbsp;Junlin Xu","doi":"10.1016/j.jbi.2025.104960","DOIUrl":"10.1016/j.jbi.2025.104960","url":null,"abstract":"<div><div>Drug-target interaction (DTI) prediction is of great significant in stimulating innovation and research in the medical field. In recent years, traditional experimental methods for predicting DTIs have proven to be time-consuming and costly. As a result, machine learning methods have been extensively applied to improve the prediction of drug-target interactions. However, the sparsity of inter-node connections often results in insufficiently learned node representations. Furthermore, many methods do not take into account the topological similarity between nodes when integrating similarities. This study proposes a model that integrates multiple sources of information and utilizes Graph Contrastive Learning (GCL) to predict potential drug and target interactions (MGCLDTI). Firstly, MGCLDTI employs the DeepWalk algorithm to extract global topological representations from the heterogeneous graph which incorporates multi-view information of drugs, targets, and diseases. Subsequently, a densification strategy is implemented to alleviate the noise impact arising from the sparsity of the DTI matrix. Furthermore, a GCL model with node masking is applied to enhance local structural awareness and optimize the embeddings of drugs and targets. Finally, DTI scores are predicted using the LightGBM algorithm. Comparative results against state-of-the-art methods demonstrate that MGCLDTI achieves superior predictive performance. Besides, ablation studies reveal the effectiveness of each component. Case studies also provide compelling evidence of MGCLDTI’s accuracy in identifying potential DTIs.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104960"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145556918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal large language models and mechanistic modeling for glucose forecasting in type 1 diabetes patients 1型糖尿病患者血糖预测的多模态大语言模型和机制模型。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-10-31 DOI: 10.1016/j.jbi.2025.104945
J.C. Wolber , M. E. Samadi , J. Sellin , A. Schuppert

Introduction:

Management of type 1 Diabetes remains a significant challenge as blood glucose levels can fluctuate dramatically and are highly individual. We introduce an innovative approach that combines multimodal Large Language models (mLLMs), mechanistic modeling of individual glucose metabolism and machine learning (ML) for forecasting blood glucose levels.

Methods:

This study uses the D1NAMO dataset (6 patients with meal images) to demonstrate mLLM integration for glucose prediction. An mLLM (Pixtral Large) was employed to estimate macronutrients from meal images, providing automated meal analysis without manual food logging. We compare three distinct approaches: (1) Baseline using only glucose dynamics and basic insulin features, (2) LastMeal providing additional information about the last meal ingested by the patient, and (3) Bézier incorporating mechanistically modeled temporal features using optimized cubic Bézier curves to model temporal impacts of individual macronutrients on blood glucose. The modeled feature impacts served as input features for a LightGBM model. We also validate the mechanistic modeling component on the AZT1D dataset (24 patients with structured carbohydrate and correction insulin logs).

Results:

The Bézier approach achieved the best performance across both datasets: D1NAMO RMSE of 15.06 at 30 min and 28.15 at 60 min; AZT1D RMSE of 16.61 at 30 min and 24.58 at 60 min. One-way ANOVA revealed statistically significant differences across prediction horizons of 45 to 120 min for the AZT1D dataset. Patient-specific Bézier curves revealed distinct metabolic response patterns: simple sugars peaked at 0.74 h, complex sugars at 3.07 h, and proteins at 4.36 h post-ingestion. Feature importance analysis showed temporal evolution from glucose change dominance to macronutrient prominence at longer horizons. Patient-specific modeling uncovered individual metabolic signatures with varying nutritional sensitivity and circadian influences.

Conclusion:

This study demonstrates the potential of combining mLLMs with mechanistic modeling for personalized diabetes management. The optimized Bézier curve approach provides superior temporal mapping while patient-specific models reveal individual metabolic signatures essential for personalized care.
导论:1型糖尿病的管理仍然是一个重大的挑战,因为血糖水平可以剧烈波动,并且高度个体化。我们介绍了一种结合多模态大语言模型(mLLMs)、个体葡萄糖代谢机制建模和机器学习(ML)预测血糖水平的创新方法。方法:本研究使用D1NAMO数据集(6例患者膳食图像)来证明mLLM集成用于血糖预测。采用mLLM (Pixtral Large)从膳食图像中估计宏量营养素,提供自动化膳食分析而无需手动食物记录。我们比较了三种不同的方法:(1)仅使用葡萄糖动力学和基本胰岛素特征的基线,(2)LastMeal提供关于患者最后一餐摄入的额外信息,以及(3)bsamzier结合机械建模的时间特征,使用优化的立方bsamzier曲线来模拟个体宏量营养素对血糖的时间影响。建模的特征影响作为LightGBM模型的输入特征。我们还在AZT1D数据集(24例结构化碳水化合物患者和校正胰岛素日志)上验证了机制建模组件。结果:bsamzier方法在两个数据集上都取得了最好的性能:D1NAMO在30分钟和60分钟时的RMSE分别为15.06和28.15;AZT1D在30分钟和60分钟的RMSE分别为16.61和24.58。单因素方差分析显示AZT1D数据集在45至120分钟的预测范围内存在统计学上的显著差异。患者特异性bsamzier曲线显示了不同的代谢反应模式:单糖在摄入后0.74小时达到峰值,复合糖在3.07小时达到峰值,蛋白质在4.36小时达到峰值。特征重要性分析表明,在较长的时间尺度上,从葡萄糖变化主导向宏量营养素突出演化。患者特异性模型揭示了具有不同营养敏感性和昼夜节律影响的个体代谢特征。结论:本研究证明了将mllm与机制建模相结合用于个性化糖尿病管理的潜力。优化的bembrozier曲线方法提供了优越的时间映射,而患者特定的模型揭示了个性化护理所必需的个人代谢特征。
{"title":"Multimodal large language models and mechanistic modeling for glucose forecasting in type 1 diabetes patients","authors":"J.C. Wolber ,&nbsp;M. E. Samadi ,&nbsp;J. Sellin ,&nbsp;A. Schuppert","doi":"10.1016/j.jbi.2025.104945","DOIUrl":"10.1016/j.jbi.2025.104945","url":null,"abstract":"<div><h3>Introduction:</h3><div>Management of type 1 Diabetes remains a significant challenge as blood glucose levels can fluctuate dramatically and are highly individual. We introduce an innovative approach that combines multimodal Large Language models (mLLMs), mechanistic modeling of individual glucose metabolism and machine learning (ML) for forecasting blood glucose levels.</div></div><div><h3>Methods:</h3><div>This study uses the D1NAMO dataset (6 patients with meal images) to demonstrate mLLM integration for glucose prediction. An mLLM (Pixtral Large) was employed to estimate macronutrients from meal images, providing automated meal analysis without manual food logging. We compare three distinct approaches: (1) <em>Baseline</em> using only glucose dynamics and basic insulin features, (2) <em>LastMeal</em> providing additional information about the last meal ingested by the patient, and (3) <em>Bézier</em> incorporating mechanistically modeled temporal features using optimized cubic Bézier curves to model temporal impacts of individual macronutrients on blood glucose. The modeled feature impacts served as input features for a LightGBM model. We also validate the mechanistic modeling component on the AZT1D dataset (24 patients with structured carbohydrate and correction insulin logs).</div></div><div><h3>Results:</h3><div>The <em>Bézier</em> approach achieved the best performance across both datasets: D1NAMO RMSE of 15.06 at 30 min and 28.15 at 60 min; AZT1D RMSE of 16.61 at 30 min and 24.58 at 60 min. One-way ANOVA revealed statistically significant differences across prediction horizons of 45 to 120 min for the AZT1D dataset. Patient-specific Bézier curves revealed distinct metabolic response patterns: simple sugars peaked at 0.74 h, complex sugars at 3.07 h, and proteins at 4.36 h post-ingestion. Feature importance analysis showed temporal evolution from glucose change dominance to macronutrient prominence at longer horizons. Patient-specific modeling uncovered individual metabolic signatures with varying nutritional sensitivity and circadian influences.</div></div><div><h3>Conclusion:</h3><div>This study demonstrates the potential of combining mLLMs with mechanistic modeling for personalized diabetes management. The optimized Bézier curve approach provides superior temporal mapping while patient-specific models reveal individual metabolic signatures essential for personalized care.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104945"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145431702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Caption-augmented reasoning model with Hierarchical rank LoRA finetuing for medical visual question Answering 基于层次排序LoRA的医学视觉问答标题增强推理模型。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-11-25 DOI: 10.1016/j.jbi.2025.104964
Yong Li , Jianping Man , Yi Zhou , Likeng Liang

Objective

Medical Visual Question Answering (VQA) is a quintessential application scenario of biomedical Multimodal Large Language Models (MLLMs). Previous studies mainly focused on input image-question pairs, neglecting the rich medical knowledge of the relevant captions of the pretrained datasets. This limits the model’s reasoning capability and causes overfitting. This paper aims to effectively utilize the captions of pretrained datasets to solve the above issues.

Methods

This paper proposes a Caption-Augmented Reasoning Model (CARM), which introduces three innovative components to leverage the captions during finetuning: (1) A Cross-Modal Visual Augmentation (CMVA) module that enriched image feature representations through semantic alignment with retrieved captions; (2) A Retrieval Cross-Modal Attention (RCMA) mechanism that established explicit connections between visual features and domain-specific medical knowledge; (3) A Hierarchical Rank Low-Rank Adaptation (HR-LoRA) module that optimized parameter-efficient finetuning through rank-adaptive decomposition in both unimodal encoders and multimodal fusion layers.

Results

The proposed CARM achieved state-of-the-art performance across three benchmark datasets, with accuracy scores of 0.798 on VQA-RAD, 0.867 on VQA-SLAKE, and 0.718 on VQA-Med-2019, respectively, outperforming existing medical VQA models. Qualitative evaluations revealed that our caption-based augmentation effectively directed model attention to the image regions related to a question.

Conclusions

The proposed CARM effectively improves visual grounding and reasoning accuracy with the systematic integration of medical captions, and the HR-LoRA alleviates overfitting and improves training efficiency.
目的:医学视觉问答(VQA)是生物医学多模态大语言模型(MLLMs)的典型应用场景。以往的研究主要集中在输入图像-问题对上,忽略了预训练数据集相关标题中丰富的医学知识。这限制了模型的推理能力并导致过拟合。本文旨在有效地利用预训练数据集的标题来解决上述问题。方法:本文提出了一种标题增强推理模型(CARM),该模型引入了三个创新组件来在微调过程中利用标题:(1)跨模态视觉增强(CMVA)模块,通过与检索到的标题进行语义对齐来丰富图像特征表示;(2)检索跨模态注意(RCMA)机制建立了视觉特征与特定领域医学知识之间的显式联系;(3)层次秩低秩自适应(HR-LoRA)模块,通过单峰编码器和多峰融合层的秩自适应分解优化参数高效微调。结果:所提出的CARM在三个基准数据集上取得了最先进的性能,VQA- rad、VQA- slake和VQA- med -2019的准确率分别为0.798、0.867和0.718,优于现有的医学VQA模型。定性评估表明,我们基于标题的增强有效地将模型的注意力引导到与问题相关的图像区域。结论:本文提出的CARM通过对医学字幕的系统集成,有效提高了视觉基础和推理精度,HR-LoRA缓解了过拟合,提高了训练效率。
{"title":"Caption-augmented reasoning model with Hierarchical rank LoRA finetuing for medical visual question Answering","authors":"Yong Li ,&nbsp;Jianping Man ,&nbsp;Yi Zhou ,&nbsp;Likeng Liang","doi":"10.1016/j.jbi.2025.104964","DOIUrl":"10.1016/j.jbi.2025.104964","url":null,"abstract":"<div><h3>Objective</h3><div>Medical Visual Question Answering (VQA) is a quintessential application scenario of biomedical Multimodal Large Language Models (MLLMs). Previous studies mainly focused on input image-question pairs, neglecting the rich medical knowledge of the relevant captions of the pretrained datasets. This limits the model’s reasoning capability and causes overfitting. This paper aims to effectively utilize the captions of pretrained datasets to solve the above issues.</div></div><div><h3>Methods</h3><div>This paper proposes a Caption-Augmented Reasoning Model (CARM), which introduces three innovative components to leverage the captions during finetuning: (1) A Cross-Modal Visual Augmentation (CMVA) module that enriched image feature representations through semantic alignment with retrieved captions; (2) A Retrieval Cross-Modal Attention (RCMA) mechanism that established explicit connections between visual features and domain-specific medical knowledge; (3) A Hierarchical Rank Low-Rank Adaptation (HR-LoRA) module that optimized parameter-efficient finetuning through rank-adaptive decomposition in both unimodal encoders and multimodal fusion layers.</div></div><div><h3>Results</h3><div>The proposed CARM achieved state-of-the-art performance across three benchmark datasets, with accuracy scores of 0.798 on VQA-RAD, 0.867 on VQA-SLAKE, and 0.718 on VQA-Med-2019, respectively, outperforming existing medical VQA models. Qualitative evaluations revealed that our caption-based augmentation effectively directed model attention to the image regions related to a question.</div></div><div><h3>Conclusions</h3><div>The proposed CARM effectively improves visual grounding and reasoning accuracy with the systematic integration of medical captions, and the HR-LoRA alleviates overfitting and improves training efficiency.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104964"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145633628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-objective optimization formulation for Alzheimer’s disease trial patient selection 阿尔茨海默病试验患者选择的多目标优化配方。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-01 Epub Date: 2025-11-15 DOI: 10.1016/j.jbi.2025.104955
Alireza Moayedikia , Sara Fin , Uffe Kock Wiil

Objective:

Clinical trial recruitment faces critical challenges with screen failure rates exceeding 80% in Alzheimer’s disease (AD) trials. Traditional patient selection relies on expert consensus without systematic evaluation of trade-offs between statistical power, recruitment feasibility, safety, and cost. We developed a multi-objective optimization framework to systematically identify optimal eligibility criteria configurations that balance competing objectives in AD clinical trial design.

Methods:

We implemented the Non-dominated Sorting Genetic Algorithm III (NSGA-III) to optimize patient selection criteria across three objectives: patient identification accuracy (F1 score), recruitment balance, and economic efficiency. The framework utilized National Alzheimer’s Coordinating Center data comprising 2,743 participants with comprehensive clinical assessments and cerebrospinal fluid biomarker measurements. We optimized 14 eligibility parameters including age boundaries, cognitive thresholds, biomarker criteria, and comorbidity management policies. Statistical validation employed Monte Carlo simulation with 10,000 iterations, bootstrap analysis, and SHAP interpretability analysis.

Results:

Optimization identified 11 Pareto-optimal solutions spanning F1 scores from 0.979 to 0.995 and eligible patient pools from 108 to 327. Compared to standard criteria selecting 101 participants, optimized approaches identified 102 participants with no significant demographic or clinical differences after multiple comparison correction. Monte Carlo simulation revealed mean cost savings of $1,048 per patient (95% CI: -$1,251 to $3,492), with 80.7% probability of positive savings but 19.3% risk of cost increases (SD = $1,208). Cross-validation demonstrated high precision (95.1%) with strategic selectivity (9.4% recall). SHAP analysis identified biomarker requirements as the dominant cost driver. Optimization algorithms converged toward solutions similar to expert-designed criteria, validating both computational and clinical approaches.

Conclusion:

Multi-objective optimization provides meaningful but incremental value through systematic validation and probabilistic efficiency enhancement rather than revolutionary transformation. The convergence toward established practice demonstrates that computational approaches serve as sophisticated validation tools that identify concrete yet uncertain efficiency improvements within existing frameworks. The substantial variability in projected outcomes establishes realistic expectations and highlights the importance of site-specific evaluation, particularly regarding recruitment infrastructure quality as the dominant determinant of success. This establishes a mature paradigm for evidence-based trial design optimization that enhances rather than replaces clinical expertise.
目的:阿尔茨海默病(AD)临床试验筛选失败率超过80%,临床试验招募面临严峻挑战。传统的患者选择依赖于专家共识,而没有对统计能力、招募可行性、安全性和成本之间的权衡进行系统评估。我们开发了一个多目标优化框架,系统地确定最佳资格标准配置,以平衡阿尔茨海默病临床试验设计中的竞争目标。方法:我们实施非支配排序遗传算法III (NSGA-III),以优化患者选择标准,包括三个目标:患者识别准确性(F1评分)、招募平衡和经济效率。该框架利用了国家阿尔茨海默病协调中心的数据,包括2743名参与者的综合临床评估和脑脊液生物标志物测量。我们优化了14个资格参数,包括年龄界限、认知阈值、生物标志物标准和合并症管理政策。统计验证采用具有10,000次迭代的蒙特卡罗模拟、自举分析和SHAP可解释性分析。结果:优化确定了11个pareto最优解,F1评分范围为0.979 ~ 0.995,符合条件的患者池范围为108 ~ 327。与选择101名受试者的标准标准相比,经过多次比较校正后,优化的方法确定了102名无显著人口统计学或临床差异的受试者。蒙特卡罗模拟显示,每位患者平均节省了1,048美元的成本(95% CI: - 1,251美元至3,492美元),节省成本的概率为80.7%,但成本增加的风险为19.3% (SD = 1,208美元)。交叉验证结果表明,该方法具有较高的精密度(95.1%)和策略选择性(9.4%)。SHAP分析确定生物标志物需求是主要的成本驱动因素。优化算法趋向于解决方案类似于专家设计的标准,验证计算和临床方法。结论:多目标优化不是革命性的变革,而是系统性的验证和概率性的效率提升,提供了有意义的增量价值。向既定实践的趋同表明,计算方法可以作为复杂的验证工具,在现有框架中识别具体但不确定的效率改进。预测结果的巨大可变性建立了现实的期望,并突出了具体地点评估的重要性,特别是将招聘基础设施质量作为成功的主要决定因素。这为循证试验设计优化建立了一个成熟的范例,增强而不是取代临床专业知识。
{"title":"Multi-objective optimization formulation for Alzheimer’s disease trial patient selection","authors":"Alireza Moayedikia ,&nbsp;Sara Fin ,&nbsp;Uffe Kock Wiil","doi":"10.1016/j.jbi.2025.104955","DOIUrl":"10.1016/j.jbi.2025.104955","url":null,"abstract":"<div><h3>Objective:</h3><div>Clinical trial recruitment faces critical challenges with screen failure rates exceeding 80% in Alzheimer’s disease (AD) trials. Traditional patient selection relies on expert consensus without systematic evaluation of trade-offs between statistical power, recruitment feasibility, safety, and cost. We developed a multi-objective optimization framework to systematically identify optimal eligibility criteria configurations that balance competing objectives in AD clinical trial design.</div></div><div><h3>Methods:</h3><div>We implemented the Non-dominated Sorting Genetic Algorithm III (NSGA-III) to optimize patient selection criteria across three objectives: patient identification accuracy (F1 score), recruitment balance, and economic efficiency. The framework utilized National Alzheimer’s Coordinating Center data comprising 2,743 participants with comprehensive clinical assessments and cerebrospinal fluid biomarker measurements. We optimized 14 eligibility parameters including age boundaries, cognitive thresholds, biomarker criteria, and comorbidity management policies. Statistical validation employed Monte Carlo simulation with 10,000 iterations, bootstrap analysis, and SHAP interpretability analysis.</div></div><div><h3>Results:</h3><div>Optimization identified 11 Pareto-optimal solutions spanning F1 scores from 0.979 to 0.995 and eligible patient pools from 108 to 327. Compared to standard criteria selecting 101 participants, optimized approaches identified 102 participants with no significant demographic or clinical differences after multiple comparison correction. Monte Carlo simulation revealed mean cost savings of $1,048 per patient (95% CI: -$1,251 to $3,492), with 80.7% probability of positive savings but 19.3% risk of cost increases (SD = $1,208). Cross-validation demonstrated high precision (95.1%) with strategic selectivity (9.4% recall). SHAP analysis identified biomarker requirements as the dominant cost driver. Optimization algorithms converged toward solutions similar to expert-designed criteria, validating both computational and clinical approaches.</div></div><div><h3>Conclusion:</h3><div>Multi-objective optimization provides meaningful but incremental value through systematic validation and probabilistic efficiency enhancement rather than revolutionary transformation. The convergence toward established practice demonstrates that computational approaches serve as sophisticated validation tools that identify concrete yet uncertain efficiency improvements within existing frameworks. The substantial variability in projected outcomes establishes realistic expectations and highlights the importance of site-specific evaluation, particularly regarding recruitment infrastructure quality as the dominant determinant of success. This establishes a mature paradigm for evidence-based trial design optimization that enhances rather than replaces clinical expertise.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104955"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145540741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrated analysis for electronic health records with structured and sporadic missingness 具有结构化和零星缺失的电子健康记录的综合分析。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-16 DOI: 10.1016/j.jbi.2025.104933
Jianbin Tan , Yan Zhang , Chuan Hong , T. Tony Cai , Tianxi Cai , Anru R. Zhang

Objectives:

We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.

Materials and Methods:

We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, Macomss, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for Macomss, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).

Results:

Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, Macomss achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.

Discussion:

The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.

Conclusions:

We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.
目的:我们提出了一种针对具有结构化和偶发缺失的电子健康记录(EHRs)量身定制的新型imputation方法。这种缺失经常出现在下游临床应用异构电子病历数据集的整合中。通过解决这些差距,我们的方法为综合分析提供了一个实用的解决方案,增强了数据的效用,促进了对人口健康的理解。材料和方法:我们首先在电子病历数据的综合分析中展示结构化和零星的缺失机制。在此之后,我们引入了一个新的imputation框架Macomss,专门设计用于处理结构和异构发生的缺失数据。我们为Macomss建立理论保证,确保其在保持综合分析的完整性和可靠性方面的稳健性。为了评估其经验性能,我们进行了广泛的模拟研究,复制了在现实世界的电子病历系统中观察到的复杂缺失模式,并使用杜克大学卫生系统(DUHS)的电子病历数据集进行了验证。结果:仿真研究表明,我们的方法始终优于现有的imputation方法。使用DUHS内三家医院的数据集,Macomss在大多数情况下实现了对缺失数据的最低输入误差,并且与基准方法相比,提供了优越或可比的下游预测性能。讨论:提出的方法有效地解决了电子病历数据集集成分析中出现的关键缺失模式,增强了临床预测的稳健性和泛化性。结论:我们为结构化和零星缺失数据的输入提供了一种理论保证和实践意义的方法,实现了多个EHR数据集的准确可靠的集成分析。提出的方法在推进人口健康研究方面具有重大潜力。
{"title":"Integrated analysis for electronic health records with structured and sporadic missingness","authors":"Jianbin Tan ,&nbsp;Yan Zhang ,&nbsp;Chuan Hong ,&nbsp;T. Tony Cai ,&nbsp;Tianxi Cai ,&nbsp;Anru R. Zhang","doi":"10.1016/j.jbi.2025.104933","DOIUrl":"10.1016/j.jbi.2025.104933","url":null,"abstract":"<div><h3>Objectives:</h3><div>We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.</div></div><div><h3>Materials and Methods:</h3><div>We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, <span>Macomss</span>, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for <span>Macomss</span>, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).</div></div><div><h3>Results:</h3><div>Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, <span>Macomss</span> achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.</div></div><div><h3>Discussion:</h3><div>The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.</div></div><div><h3>Conclusions:</h3><div>We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104933"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145318292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A non-interactive Online Medical Pre-Diagnosis system on encrypted vertically partitioned data 基于加密垂直分区数据的非交互式在线医疗预诊断系统。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-17 DOI: 10.1016/j.jbi.2025.104940
Min Tang , Yuhao Zhang , Ronghua Liang , Guoqiang Deng

Objective:

In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.

Methods:

We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.

Results:

Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6× higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.

Conclusion:

PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.
目的:在医疗环境中,患者记录作为异构特征存储在各个机构中,由于法律或制度的限制,禁止原始数据共享。这种碎片化给在线医疗预诊断(OMPD)系统带来了挑战。现有的方法(如联邦学习)需要在所有参与方(医院和云服务器)之间进行多轮交互,从而导致频繁的通信。此外,由于全局梯度的共享,它们容易受到推理攻击,导致信息泄露。本文提出了一种安全高效的OMPD系统框架来解决垂直数据碎片化问题,旨在解决医疗数据隔离与模型协作之间的矛盾。方法:提出了一种用于构建OMPD系统的安全框架PPNLR。该框架将功能加密和盲因子相结合,设计了样本特征维数加密算法和隐私保护矢量化训练算法。将样本计算与模型训练解耦,仅在医院和云服务器之间进行一次通信即可实现跨客户端数据聚合。结果:安全性分析表明,PPNLR能够抵抗半诚实推理攻击和串通攻击。基于6个真实医学数据集(文本和图像)的评估结果表明:(1)推理准确率接近集中式明文训练基准;(ii)计算效率至少比可比方法高3.6倍;(iii)通过消除对迭代计数的依赖,显著降低了通信复杂性。结论:PPNLR通过加密原语实现了数据保护,在保证医疗数据和模型参数安全的同时,保持了较高的诊断准确率。它的单通信体系结构显著降低了资源受限场景中的部署门槛,为构建隐私友好型OMPD系统提供了实用框架。
{"title":"A non-interactive Online Medical Pre-Diagnosis system on encrypted vertically partitioned data","authors":"Min Tang ,&nbsp;Yuhao Zhang ,&nbsp;Ronghua Liang ,&nbsp;Guoqiang Deng","doi":"10.1016/j.jbi.2025.104940","DOIUrl":"10.1016/j.jbi.2025.104940","url":null,"abstract":"<div><h3>Objective:</h3><div>In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.</div></div><div><h3>Methods:</h3><div>We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.</div></div><div><h3>Results:</h3><div>Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6<span><math><mo>×</mo></math></span> higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.</div></div><div><h3>Conclusion:</h3><div>PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104940"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing causal inference in medicine using biobank data 利用生物银行数据推进医学因果推理。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-09-13 DOI: 10.1016/j.jbi.2025.104903
Hadasa Kaufman , Nadav Rappoport , Amir Gilad , Michal Linial
Causal inference from observational medical record data is critical for advancing precision and personalization in healthcare. Recently, biobanks – collections of biological samples linked with genetic, lifestyle, environmental, and health-related data – have emerged as valuable resources for large-scale population studies. By integrating these resources, biobanks offer a harmonized repository of diverse data for each individual, capturing real-world medical events, including procedures, treatments, and diagnoses. However, these resources are often affected by confounding factors, selection biases, and missing information, posing significant challenges to drawing valid causal conclusions. While randomized controlled trials (RCTs) remain the gold standard for drug development and medical decision-making, the growing availability of observational data highlights the need for robust causal inference methodologies. This study provides an overview of methods for inferring the effect of a treatment on an outcome from observational data applicable to biobank data, focusing on the unique challenges they address. Our objective is to introduce current methods used for causal discovery in observational medical data. We discuss classic and modern methodologies that offer significant opportunities alongside the difficulty in reaching causality. We cover statistical methods designed for large-scale biobanks that have the potential to improve clinical decision-making, guide public health policies, and drive further research.
从观察病历数据中进行因果推断对于提高医疗保健的准确性和个性化至关重要。最近,生物银行——与遗传、生活方式、环境和健康相关数据相关的生物样本的收集——已成为大规模人口研究的宝贵资源。通过整合这些资源,生物银行为每个人提供了一个统一的不同数据存储库,捕捉现实世界的医疗事件,包括程序、治疗和诊断。然而,这些资源经常受到混杂因素、选择偏差和信息缺失的影响,对得出有效的因果结论构成了重大挑战。虽然随机对照试验(rct)仍然是药物开发和医疗决策的黄金标准,但越来越多的观察数据的可用性突出了对可靠的因果推理方法的需求。本研究概述了从适用于生物库数据的观察数据推断治疗对结果的影响的方法,重点介绍了它们解决的独特挑战。我们的目标是介绍目前在观察性医学数据中用于因果发现的方法。我们将讨论经典和现代的方法,这些方法提供了重要的机会,同时也难以达到因果关系。我们涵盖了为大型生物银行设计的统计方法,这些方法有可能改善临床决策,指导公共卫生政策,并推动进一步的研究。
{"title":"Advancing causal inference in medicine using biobank data","authors":"Hadasa Kaufman ,&nbsp;Nadav Rappoport ,&nbsp;Amir Gilad ,&nbsp;Michal Linial","doi":"10.1016/j.jbi.2025.104903","DOIUrl":"10.1016/j.jbi.2025.104903","url":null,"abstract":"<div><div>Causal inference from observational medical record data is critical for advancing precision and personalization in healthcare. Recently, biobanks – collections of biological samples linked with genetic, lifestyle, environmental, and health-related data – have emerged as valuable resources for large-scale population studies. By integrating these resources, biobanks offer a harmonized repository of diverse data for each individual, capturing real-world medical events, including procedures, treatments, and diagnoses. However, these resources are often affected by confounding factors, selection biases, and missing information, posing significant challenges to drawing valid causal conclusions. While randomized controlled trials (RCTs) remain the gold standard for drug development and medical decision-making, the growing availability of observational data highlights the need for robust causal inference methodologies. This study provides an overview of methods for inferring the effect of a treatment on an outcome from observational data applicable to biobank data, focusing on the unique challenges they address. Our objective is to introduce current methods used for causal discovery in observational medical data. We discuss classic and modern methodologies that offer significant opportunities alongside the difficulty in reaching causality. We cover statistical methods designed for large-scale biobanks that have the potential to improve clinical decision-making, guide public health policies, and drive further research.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104903"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145069728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating probabilistic privacy-preserving medical record linkage: A three-party MPC approach 加速概率隐私保护医疗记录链接:一种三方MPC方法
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-01 DOI: 10.1016/j.jbi.2025.104920
Şeyma Selcan Mağara, Noah Dietrich, Ali Burak Ünal, Mete Akgün

Objective:

Record linkage is essential for integrating data from multiple sources with diverse applications in real-world healthcare and research. Probabilistic Privacy-Preserving Record Linkage (PPRL) enables this integration occurs, while protecting sensitive information from unauthorized access, especially when datasets lack exact identifiers. As privacy regulations evolve and multi-institutional collaborations expand globally, there is a growing demand for methods that effectively balance security, accuracy, and efficiency. However, ensuring both privacy and scalability in large-scale record linkage remains a key challenge.

Method:

This paper presents a novel and efficient PPRL method based on a secure 3-party computation (MPC) framework. Our approach allows multiple parties to compute linkage results without exposing their private inputs and significantly improves the speed of linkage process compared to existing PPRL solutions.

Result:

Our method preserves the linkage quality of a state-of-the-art (SOTA) MPC-based PPRL method while achieving up to 14 times faster performance. For example, linking a record against a database of 10,000 records takes just 8.74 s in a realistic network with 700 Mbps bandwidth and 60 ms latency, compared to 92.32 s with the SOTA method. Even on a slower internet connection with 100 Mbps bandwidth and 60 ms latency, the linkage completes in 28 s, where as the SOTA method requires 287.96 s. These results demonstrate the significant scalability and efficiency improvements of our approach.

Conclusion:

Our novel PPRL method, based on secure 3-party computation, offers an efficient and scalable solution for large-scale record linkage while ensuring privacy protection. The approach demonstrates significant performance improvements, making it a promising tool for secure data integration in privacy-sensitive sectors.
目的:记录链接对于在现实世界的医疗保健和研究中整合来自多个来源的不同应用的数据至关重要。概率隐私保护记录链接(PPRL)实现了这种集成,同时保护敏感信息免受未经授权的访问,特别是当数据集缺乏精确的标识符时。随着隐私法规的发展和多机构合作在全球范围内的扩展,对有效平衡安全性、准确性和效率的方法的需求不断增长。然而,在大规模记录链接中,确保隐私和可扩展性仍然是一个关键的挑战。方法:提出一种基于安全三方计算(MPC)框架的新型高效PPRL方法。我们的方法允许多方在不暴露其私有输入的情况下计算链接结果,与现有的PPRL解决方案相比,显著提高了链接过程的速度。结果:我们的方法保留了最先进的(SOTA)基于mpc的PPRL方法的连接质量,同时实现了高达14倍的性能提升。例如,在带宽为700 Mbps、延迟为60 ms的实际网络中,将一条记录与包含10,000条记录的数据库相关联只需要8.74 s,而使用SOTA方法需要92.32 s。即使在带宽为100mbps、延迟为60ms的较慢的互联网连接上,连接也需要在28秒内完成,而SOTA方法需要287.96秒。这些结果表明,我们的方法具有显著的可扩展性和效率改进。结论:基于安全三方计算的PPRL方法在保证隐私保护的同时,为大规模记录链接提供了高效、可扩展的解决方案。该方法显示了显著的性能改进,使其成为隐私敏感领域中安全数据集成的有前途的工具。
{"title":"Accelerating probabilistic privacy-preserving medical record linkage: A three-party MPC approach","authors":"Şeyma Selcan Mağara,&nbsp;Noah Dietrich,&nbsp;Ali Burak Ünal,&nbsp;Mete Akgün","doi":"10.1016/j.jbi.2025.104920","DOIUrl":"10.1016/j.jbi.2025.104920","url":null,"abstract":"<div><h3>Objective:</h3><div>Record linkage is essential for integrating data from multiple sources with diverse applications in real-world healthcare and research. Probabilistic Privacy-Preserving Record Linkage (PPRL) enables this integration occurs, while protecting sensitive information from unauthorized access, especially when datasets lack exact identifiers. As privacy regulations evolve and multi-institutional collaborations expand globally, there is a growing demand for methods that effectively balance security, accuracy, and efficiency. However, ensuring both privacy and scalability in large-scale record linkage remains a key challenge.</div></div><div><h3>Method:</h3><div>This paper presents a novel and efficient PPRL method based on a secure 3-party computation (MPC) framework. Our approach allows multiple parties to compute linkage results without exposing their private inputs and significantly improves the speed of linkage process compared to existing PPRL solutions.</div></div><div><h3>Result:</h3><div>Our method preserves the linkage quality of a state-of-the-art (SOTA) MPC-based PPRL method while achieving up to 14 times faster performance. For example, linking a record against a database of 10,000 records takes just 8.74 s in a realistic network with 700 Mbps bandwidth and 60 ms latency, compared to 92.32 s with the SOTA method. Even on a slower internet connection with 100 Mbps bandwidth and 60 ms latency, the linkage completes in 28 s, where as the SOTA method requires 287.96 s. These results demonstrate the significant scalability and efficiency improvements of our approach.</div></div><div><h3>Conclusion:</h3><div>Our novel PPRL method, based on secure 3-party computation, offers an efficient and scalable solution for large-scale record linkage while ensuring privacy protection. The approach demonstrates significant performance improvements, making it a promising tool for secure data integration in privacy-sensitive sectors.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104920"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145223419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Definitions to data flow: Operationalizing MIABIS in HL7 FHIR 数据流的定义:在HL7 FHIR中实现MIABIS。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-09-27 DOI: 10.1016/j.jbi.2025.104919
Radovan Tomášik , Šimon Koňár , Niina Eklund , Cäcilia Engels , Zdenka Dudova , Radoslava Kacová , Roman Hrstka , Petr Holub

Objective

Biobanks and biomolecular resources are increasingly central to data-driven biomedical research, encompassing not only metadata but also granular, sample-related data from diverse sources such as healthcare systems, national registries, and research outputs. However, the lack of a standardised, machine-readable format for representing such data limits interoperability, data reuse and integration into clinical and research environments. While MIABIS provides a conceptual model for biobank data, its abstract nature and reliance on heterogeneous implementations create barriers to practical, scalable adoption. This study presents a pragmatic, operational implementation of MIABIS focused on enabling real-world exchange and integration of sample-level data.

Methods

We systematically evaluated established data exchange standards, comparing HL7 FHIR and OMOP CDM with respect to their suitability for structuring sample-related data in a semantically robust and machine-readable form. Based on this analysis, we developed a FHIR-based representation of MIABIS that supports complex biobank structures and enables integration with federated data infrastructures. Supporting tools, including a Python library and an implementation guide, were created to ensure usability across diverse research and clinical contexts.

Results

We created nine interoperable FHIR profiles covering core MIABIS entities, ensuring consistency with FHIR standards. To support adoption, we developed an open-source Python library that abstracts FHIR interactions and provides schema validation for MIABIS-compliant data. The library was integrated into an ETL tool in operation at Czech Node of BBMRI-ERIC, European Biobanking and Biomolecular Resources Research Infrastructure, to demonstrate usability with real-world sample-related data. Separately, we validated the representation of MIABIS entities at the organisational level by converting the data structures of BBMRI-ERIC Directory into FHIR, demonstrating compatibility with federated data infrastructures.

Conclusion

This work delivers a machine-readable, interoperable implementation of MIABIS, enabling the exchange of both organisational and sample-level data across biobanks and health information systems. By integrating MIABIS with HL7 FHIR, we provide a host of reusable tools and mechanisms for further evolution of the data model. Combined, these benefits can help with the integration into clinical and research workflows, supporting data discoverability, reuse, and cross-institutional collaboration in biomedical research.
目的:生物银行和生物分子资源在数据驱动的生物医学研究中越来越重要,不仅包括元数据,还包括来自不同来源(如医疗保健系统、国家登记处和研究成果)的颗粒状样本相关数据。然而,缺乏一种标准化的、机器可读的格式来表示这些数据,限制了互操作性、数据重用和临床和研究环境的集成。虽然MIABIS为生物银行数据提供了一个概念模型,但它的抽象性和对异构实现的依赖为实际的、可扩展的采用创造了障碍。本研究提出了一种实用的、可操作的MIABIS实现方法,重点是实现样本级数据的真实交换和集成。方法:我们系统地评估了已建立的数据交换标准,比较了HL7 FHIR和OMOP CDM在以语义鲁棒性和机器可读形式构建样本相关数据方面的适用性。基于这一分析,我们开发了一个基于fhir的MIABIS表示,它支持复杂的生物库结构,并能够与联邦数据基础设施集成。包括Python库和实现指南在内的支持工具被创建,以确保在不同的研究和临床环境中可用性。结果:我们创建了9个可互操作的FHIR配置文件,涵盖了核心MIABIS实体,确保了与FHIR标准的一致性。为了支持采用,我们开发了一个开源Python库,它抽象了FHIR交互,并为符合miabis的数据提供了模式验证。该库被整合到BBMRI-ERIC捷克节点的ETL工具中,欧洲生物银行和生物分子资源研究基础设施,以展示与现实世界样本相关数据的可用性。另外,我们通过将BBMRI-ERIC目录的数据结构转换为FHIR,验证了MIABIS实体在组织级别的表示,展示了与联邦数据基础设施的兼容性。结论:这项工作提供了一个机器可读、可互操作的MIABIS实现,使生物库和卫生信息系统之间的组织和样本级数据交换成为可能。通过将MIABIS与HL7 FHIR集成,我们为数据模型的进一步发展提供了大量可重用的工具和机制。综合起来,这些优势可以帮助整合到临床和研究工作流程中,支持生物医学研究中的数据发现、重用和跨机构协作。
{"title":"Definitions to data flow: Operationalizing MIABIS in HL7 FHIR","authors":"Radovan Tomášik ,&nbsp;Šimon Koňár ,&nbsp;Niina Eklund ,&nbsp;Cäcilia Engels ,&nbsp;Zdenka Dudova ,&nbsp;Radoslava Kacová ,&nbsp;Roman Hrstka ,&nbsp;Petr Holub","doi":"10.1016/j.jbi.2025.104919","DOIUrl":"10.1016/j.jbi.2025.104919","url":null,"abstract":"<div><h3>Objective</h3><div>Biobanks and biomolecular resources are increasingly central to data-driven biomedical research, encompassing not only metadata but also granular, sample-related data from diverse sources such as healthcare systems, national registries, and research outputs. However, the lack of a standardised, machine-readable format for representing such data limits interoperability, data reuse and integration into clinical and research environments. While MIABIS provides a conceptual model for biobank data, its abstract nature and reliance on heterogeneous implementations create barriers to practical, scalable adoption. This study presents a pragmatic, operational implementation of MIABIS focused on enabling real-world exchange and integration of sample-level data.</div></div><div><h3>Methods</h3><div>We systematically evaluated established data exchange standards, comparing HL7 FHIR and OMOP CDM with respect to their suitability for structuring sample-related data in a semantically robust and machine-readable form. Based on this analysis, we developed a FHIR-based representation of MIABIS that supports complex biobank structures and enables integration with federated data infrastructures. Supporting tools, including a Python library and an implementation guide, were created to ensure usability across diverse research and clinical contexts.</div></div><div><h3>Results</h3><div>We <em>created nine interoperable FHIR profiles</em> covering core MIABIS entities, ensuring consistency with FHIR standards. To support adoption, we <em>developed an open-source Python library</em> that abstracts FHIR interactions and provides schema validation for MIABIS-compliant data. The <em>library was integrated into an ETL tool</em> in operation at Czech Node of BBMRI-ERIC, European Biobanking and Biomolecular Resources Research Infrastructure, to demonstrate usability with real-world sample-related data. Separately, we validated the representation of MIABIS entities at the organisational level by converting the data structures of BBMRI-ERIC Directory into FHIR, demonstrating compatibility with federated data infrastructures.</div></div><div><h3>Conclusion</h3><div>This work delivers a machine-readable, interoperable implementation of MIABIS, enabling the exchange of both organisational and sample-level data across biobanks and health information systems. By integrating MIABIS with HL7 FHIR, we provide a host of reusable tools and mechanisms for further evolution of the data model. Combined, these benefits can help with the integration into clinical and research workflows, supporting data discoverability, reuse, and cross-institutional collaboration in biomedical research.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104919"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145191636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A REDCap advanced randomization module to meet the needs of modern trials 一个REDCap高级随机化模块,以满足现代试验的需要。
IF 4.5 2区 医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-10-04 DOI: 10.1016/j.jbi.2025.104925
Luke Stevens , Nan Kennedy , Rob J. Taylor , Adam Lewis , Frank E. Harrell Jr , Matthew S. Shotwell , Emily S. Serdoz , Gordon R. Bernard , Wesley H. Self , Christopher J. Lindsell , Paul A. Harris , Jonathan D. Casey

Objective

Since 2012, the electronic data capture platform REDCap has included an embedded randomization module allowing a single randomization per study record with the ability to stratify by variables such as study site and participant sex at birth. In recent years, platform, adaptive, decentralized, and pragmatic trials have gained popularity. These trial designs often require approaches to randomization not supported by the original REDCap randomization module, including randomizing patients into multiple domains or at multiple points in time, changing allocation tables to add or drop study groups, or adaptively changing allocation ratios based on data from previously enrolled participants. Our team aimed to develop new randomization functions to address these issues.

Methods

A collaborative process facilitated by the NIH-funded Trial Innovation Network was initiated to modernize the randomization module in REDCap, incorporating feedback from clinical trialists, biostatisticians, technologists, and other experts.

Results

This effort led to the development of an advanced randomization module within the REDCap platform. In addition to supporting platform, adaptive, decentralized, and pragmatic trials, the new module introduces several new features, such as improved support for blinded randomization, additional randomization metadata capture (e.g., user identity and timestamp), additional tools allowing REDCap administrators to support investigators using the randomization module, and the ability for clinicians participating in pragmatic or decentralized trials to perform randomization through a survey without needing log-in access to the study database. As of June 19, 2025, multiple randomizations have been used in 211 projects from 55 institutions, randomizations with real-time trigger logic in 108 projects from 64 institutions, and blinded group allocation in 24 projects from 17 institutions.

Conclusion

The new randomization module aims to streamline the randomization process, improve trial efficiency, and ensure robust data integrity, thereby supporting the conduct of more sophisticated and adaptive clinical trials.
目的:自2012年以来,电子数据采集平台REDCap包含了一个嵌入式随机化模块,允许每个研究记录进行单个随机化,并能够根据研究地点和参与者出生时的性别等变量进行分层。近年来,平台化、自适应、去中心化、实用化的审判越来越受欢迎。这些试验设计通常需要采用原始REDCap随机化模块不支持的随机化方法,包括将患者随机分配到多个领域或多个时间点,改变分配表以增加或减少研究组,或根据先前入组的参与者的数据自适应地改变分配比例。我们的团队旨在开发新的随机化功能来解决这些问题。方法:在美国国立卫生研究院资助的试验创新网络的推动下,启动了一个协作过程,将临床试验学家、生物统计学家、技术专家和其他专家的反馈结合起来,使REDCap中的随机化模块现代化。结果:这一努力促成了REDCap平台内高级随机化模块的开发。除了支持平台、自适应、去中心化和实用的试验之外,新模块还引入了几个新功能,例如改进了对盲法随机化的支持、额外的随机化元数据捕获(例如,用户身份和时间戳)、允许REDCap管理员使用随机化模块支持调查人员的额外工具。参与实用或分散试验的临床医生无需登录研究数据库即可通过调查执行随机化的能力。截至2025年6月19日,55所院校211个项目采用了多重随机化,64所院校108个项目采用了实时触发逻辑随机化,17所院校24个项目采用了盲法分组。结论:新的随机化模块旨在简化随机化过程,提高试验效率,确保数据的完整性,从而支持开展更复杂和适应性更强的临床试验。
{"title":"A REDCap advanced randomization module to meet the needs of modern trials","authors":"Luke Stevens ,&nbsp;Nan Kennedy ,&nbsp;Rob J. Taylor ,&nbsp;Adam Lewis ,&nbsp;Frank E. Harrell Jr ,&nbsp;Matthew S. Shotwell ,&nbsp;Emily S. Serdoz ,&nbsp;Gordon R. Bernard ,&nbsp;Wesley H. Self ,&nbsp;Christopher J. Lindsell ,&nbsp;Paul A. Harris ,&nbsp;Jonathan D. Casey","doi":"10.1016/j.jbi.2025.104925","DOIUrl":"10.1016/j.jbi.2025.104925","url":null,"abstract":"<div><h3>Objective</h3><div>Since 2012, the electronic data capture platform REDCap has included an embedded randomization module allowing a single randomization per study record with the ability to stratify by variables such as study site and participant sex at birth. In recent years, platform, adaptive, decentralized, and pragmatic trials have gained popularity. These trial designs often require approaches to randomization not supported by the original REDCap randomization module, including randomizing patients into multiple domains or at multiple points in time, changing allocation tables to add or drop study groups, or adaptively changing allocation ratios based on data from previously enrolled participants. Our team aimed to develop new randomization functions to address these issues.</div></div><div><h3>Methods</h3><div>A collaborative process facilitated by the NIH-funded Trial Innovation Network was initiated to modernize the randomization module in REDCap, incorporating feedback from clinical trialists, biostatisticians, technologists, and other experts.</div></div><div><h3>Results</h3><div>This effort led to the development of an advanced randomization module within the REDCap platform. In addition to supporting platform, adaptive, decentralized, and pragmatic trials, the new module introduces several new features, such as improved support for blinded randomization, additional randomization metadata capture (e.g., user identity and timestamp), additional tools allowing REDCap administrators to support investigators using the randomization module, and the ability for clinicians participating in pragmatic or decentralized trials to perform randomization through a survey without needing log-in access to the study database. As of June 19, 2025, multiple randomizations have been used in 211 projects from 55 institutions, randomizations with real-time trigger logic in 108 projects from 64 institutions, and blinded group allocation in 24 projects from 17 institutions.</div></div><div><h3>Conclusion</h3><div>The new randomization module aims to streamline the randomization process, improve trial efficiency, and ensure robust data integrity, thereby supporting the conduct of more sophisticated and adaptive clinical trials.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104925"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145238683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Biomedical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1