Pub Date : 2025-12-01Epub Date: 2025-11-18DOI: 10.1016/j.jbi.2025.104960
Siying Yang , Ping-an He , Pan Zeng , Yajie Meng , Zilong Zhang , Feifei Cui , Yuhua Yao , Jialiang Yang , Junlin Xu
Drug-target interaction (DTI) prediction is of great significant in stimulating innovation and research in the medical field. In recent years, traditional experimental methods for predicting DTIs have proven to be time-consuming and costly. As a result, machine learning methods have been extensively applied to improve the prediction of drug-target interactions. However, the sparsity of inter-node connections often results in insufficiently learned node representations. Furthermore, many methods do not take into account the topological similarity between nodes when integrating similarities. This study proposes a model that integrates multiple sources of information and utilizes Graph Contrastive Learning (GCL) to predict potential drug and target interactions (MGCLDTI). Firstly, MGCLDTI employs the DeepWalk algorithm to extract global topological representations from the heterogeneous graph which incorporates multi-view information of drugs, targets, and diseases. Subsequently, a densification strategy is implemented to alleviate the noise impact arising from the sparsity of the DTI matrix. Furthermore, a GCL model with node masking is applied to enhance local structural awareness and optimize the embeddings of drugs and targets. Finally, DTI scores are predicted using the LightGBM algorithm. Comparative results against state-of-the-art methods demonstrate that MGCLDTI achieves superior predictive performance. Besides, ablation studies reveal the effectiveness of each component. Case studies also provide compelling evidence of MGCLDTI’s accuracy in identifying potential DTIs.
{"title":"Predicting drug-target interactions based on multivariate information fusion and graph contrast learning","authors":"Siying Yang , Ping-an He , Pan Zeng , Yajie Meng , Zilong Zhang , Feifei Cui , Yuhua Yao , Jialiang Yang , Junlin Xu","doi":"10.1016/j.jbi.2025.104960","DOIUrl":"10.1016/j.jbi.2025.104960","url":null,"abstract":"<div><div>Drug-target interaction (DTI) prediction is of great significant in stimulating innovation and research in the medical field. In recent years, traditional experimental methods for predicting DTIs have proven to be time-consuming and costly. As a result, machine learning methods have been extensively applied to improve the prediction of drug-target interactions. However, the sparsity of inter-node connections often results in insufficiently learned node representations. Furthermore, many methods do not take into account the topological similarity between nodes when integrating similarities. This study proposes a model that integrates multiple sources of information and utilizes Graph Contrastive Learning (GCL) to predict potential drug and target interactions (MGCLDTI). Firstly, MGCLDTI employs the DeepWalk algorithm to extract global topological representations from the heterogeneous graph which incorporates multi-view information of drugs, targets, and diseases. Subsequently, a densification strategy is implemented to alleviate the noise impact arising from the sparsity of the DTI matrix. Furthermore, a GCL model with node masking is applied to enhance local structural awareness and optimize the embeddings of drugs and targets. Finally, DTI scores are predicted using the LightGBM algorithm. Comparative results against state-of-the-art methods demonstrate that MGCLDTI achieves superior predictive performance. Besides, ablation studies reveal the effectiveness of each component. Case studies also provide compelling evidence of MGCLDTI’s accuracy in identifying potential DTIs.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104960"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145556918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-10-31DOI: 10.1016/j.jbi.2025.104945
J.C. Wolber , M. E. Samadi , J. Sellin , A. Schuppert
Introduction:
Management of type 1 Diabetes remains a significant challenge as blood glucose levels can fluctuate dramatically and are highly individual. We introduce an innovative approach that combines multimodal Large Language models (mLLMs), mechanistic modeling of individual glucose metabolism and machine learning (ML) for forecasting blood glucose levels.
Methods:
This study uses the D1NAMO dataset (6 patients with meal images) to demonstrate mLLM integration for glucose prediction. An mLLM (Pixtral Large) was employed to estimate macronutrients from meal images, providing automated meal analysis without manual food logging. We compare three distinct approaches: (1) Baseline using only glucose dynamics and basic insulin features, (2) LastMeal providing additional information about the last meal ingested by the patient, and (3) Bézier incorporating mechanistically modeled temporal features using optimized cubic Bézier curves to model temporal impacts of individual macronutrients on blood glucose. The modeled feature impacts served as input features for a LightGBM model. We also validate the mechanistic modeling component on the AZT1D dataset (24 patients with structured carbohydrate and correction insulin logs).
Results:
The Bézier approach achieved the best performance across both datasets: D1NAMO RMSE of 15.06 at 30 min and 28.15 at 60 min; AZT1D RMSE of 16.61 at 30 min and 24.58 at 60 min. One-way ANOVA revealed statistically significant differences across prediction horizons of 45 to 120 min for the AZT1D dataset. Patient-specific Bézier curves revealed distinct metabolic response patterns: simple sugars peaked at 0.74 h, complex sugars at 3.07 h, and proteins at 4.36 h post-ingestion. Feature importance analysis showed temporal evolution from glucose change dominance to macronutrient prominence at longer horizons. Patient-specific modeling uncovered individual metabolic signatures with varying nutritional sensitivity and circadian influences.
Conclusion:
This study demonstrates the potential of combining mLLMs with mechanistic modeling for personalized diabetes management. The optimized Bézier curve approach provides superior temporal mapping while patient-specific models reveal individual metabolic signatures essential for personalized care.
{"title":"Multimodal large language models and mechanistic modeling for glucose forecasting in type 1 diabetes patients","authors":"J.C. Wolber , M. E. Samadi , J. Sellin , A. Schuppert","doi":"10.1016/j.jbi.2025.104945","DOIUrl":"10.1016/j.jbi.2025.104945","url":null,"abstract":"<div><h3>Introduction:</h3><div>Management of type 1 Diabetes remains a significant challenge as blood glucose levels can fluctuate dramatically and are highly individual. We introduce an innovative approach that combines multimodal Large Language models (mLLMs), mechanistic modeling of individual glucose metabolism and machine learning (ML) for forecasting blood glucose levels.</div></div><div><h3>Methods:</h3><div>This study uses the D1NAMO dataset (6 patients with meal images) to demonstrate mLLM integration for glucose prediction. An mLLM (Pixtral Large) was employed to estimate macronutrients from meal images, providing automated meal analysis without manual food logging. We compare three distinct approaches: (1) <em>Baseline</em> using only glucose dynamics and basic insulin features, (2) <em>LastMeal</em> providing additional information about the last meal ingested by the patient, and (3) <em>Bézier</em> incorporating mechanistically modeled temporal features using optimized cubic Bézier curves to model temporal impacts of individual macronutrients on blood glucose. The modeled feature impacts served as input features for a LightGBM model. We also validate the mechanistic modeling component on the AZT1D dataset (24 patients with structured carbohydrate and correction insulin logs).</div></div><div><h3>Results:</h3><div>The <em>Bézier</em> approach achieved the best performance across both datasets: D1NAMO RMSE of 15.06 at 30 min and 28.15 at 60 min; AZT1D RMSE of 16.61 at 30 min and 24.58 at 60 min. One-way ANOVA revealed statistically significant differences across prediction horizons of 45 to 120 min for the AZT1D dataset. Patient-specific Bézier curves revealed distinct metabolic response patterns: simple sugars peaked at 0.74 h, complex sugars at 3.07 h, and proteins at 4.36 h post-ingestion. Feature importance analysis showed temporal evolution from glucose change dominance to macronutrient prominence at longer horizons. Patient-specific modeling uncovered individual metabolic signatures with varying nutritional sensitivity and circadian influences.</div></div><div><h3>Conclusion:</h3><div>This study demonstrates the potential of combining mLLMs with mechanistic modeling for personalized diabetes management. The optimized Bézier curve approach provides superior temporal mapping while patient-specific models reveal individual metabolic signatures essential for personalized care.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104945"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145431702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-25DOI: 10.1016/j.jbi.2025.104964
Yong Li , Jianping Man , Yi Zhou , Likeng Liang
Objective
Medical Visual Question Answering (VQA) is a quintessential application scenario of biomedical Multimodal Large Language Models (MLLMs). Previous studies mainly focused on input image-question pairs, neglecting the rich medical knowledge of the relevant captions of the pretrained datasets. This limits the model’s reasoning capability and causes overfitting. This paper aims to effectively utilize the captions of pretrained datasets to solve the above issues.
Methods
This paper proposes a Caption-Augmented Reasoning Model (CARM), which introduces three innovative components to leverage the captions during finetuning: (1) A Cross-Modal Visual Augmentation (CMVA) module that enriched image feature representations through semantic alignment with retrieved captions; (2) A Retrieval Cross-Modal Attention (RCMA) mechanism that established explicit connections between visual features and domain-specific medical knowledge; (3) A Hierarchical Rank Low-Rank Adaptation (HR-LoRA) module that optimized parameter-efficient finetuning through rank-adaptive decomposition in both unimodal encoders and multimodal fusion layers.
Results
The proposed CARM achieved state-of-the-art performance across three benchmark datasets, with accuracy scores of 0.798 on VQA-RAD, 0.867 on VQA-SLAKE, and 0.718 on VQA-Med-2019, respectively, outperforming existing medical VQA models. Qualitative evaluations revealed that our caption-based augmentation effectively directed model attention to the image regions related to a question.
Conclusions
The proposed CARM effectively improves visual grounding and reasoning accuracy with the systematic integration of medical captions, and the HR-LoRA alleviates overfitting and improves training efficiency.
目的:医学视觉问答(VQA)是生物医学多模态大语言模型(MLLMs)的典型应用场景。以往的研究主要集中在输入图像-问题对上,忽略了预训练数据集相关标题中丰富的医学知识。这限制了模型的推理能力并导致过拟合。本文旨在有效地利用预训练数据集的标题来解决上述问题。方法:本文提出了一种标题增强推理模型(CARM),该模型引入了三个创新组件来在微调过程中利用标题:(1)跨模态视觉增强(CMVA)模块,通过与检索到的标题进行语义对齐来丰富图像特征表示;(2)检索跨模态注意(RCMA)机制建立了视觉特征与特定领域医学知识之间的显式联系;(3)层次秩低秩自适应(HR-LoRA)模块,通过单峰编码器和多峰融合层的秩自适应分解优化参数高效微调。结果:所提出的CARM在三个基准数据集上取得了最先进的性能,VQA- rad、VQA- slake和VQA- med -2019的准确率分别为0.798、0.867和0.718,优于现有的医学VQA模型。定性评估表明,我们基于标题的增强有效地将模型的注意力引导到与问题相关的图像区域。结论:本文提出的CARM通过对医学字幕的系统集成,有效提高了视觉基础和推理精度,HR-LoRA缓解了过拟合,提高了训练效率。
{"title":"Caption-augmented reasoning model with Hierarchical rank LoRA finetuing for medical visual question Answering","authors":"Yong Li , Jianping Man , Yi Zhou , Likeng Liang","doi":"10.1016/j.jbi.2025.104964","DOIUrl":"10.1016/j.jbi.2025.104964","url":null,"abstract":"<div><h3>Objective</h3><div>Medical Visual Question Answering (VQA) is a quintessential application scenario of biomedical Multimodal Large Language Models (MLLMs). Previous studies mainly focused on input image-question pairs, neglecting the rich medical knowledge of the relevant captions of the pretrained datasets. This limits the model’s reasoning capability and causes overfitting. This paper aims to effectively utilize the captions of pretrained datasets to solve the above issues.</div></div><div><h3>Methods</h3><div>This paper proposes a Caption-Augmented Reasoning Model (CARM), which introduces three innovative components to leverage the captions during finetuning: (1) A Cross-Modal Visual Augmentation (CMVA) module that enriched image feature representations through semantic alignment with retrieved captions; (2) A Retrieval Cross-Modal Attention (RCMA) mechanism that established explicit connections between visual features and domain-specific medical knowledge; (3) A Hierarchical Rank Low-Rank Adaptation (HR-LoRA) module that optimized parameter-efficient finetuning through rank-adaptive decomposition in both unimodal encoders and multimodal fusion layers.</div></div><div><h3>Results</h3><div>The proposed CARM achieved state-of-the-art performance across three benchmark datasets, with accuracy scores of 0.798 on VQA-RAD, 0.867 on VQA-SLAKE, and 0.718 on VQA-Med-2019, respectively, outperforming existing medical VQA models. Qualitative evaluations revealed that our caption-based augmentation effectively directed model attention to the image regions related to a question.</div></div><div><h3>Conclusions</h3><div>The proposed CARM effectively improves visual grounding and reasoning accuracy with the systematic integration of medical captions, and the HR-LoRA alleviates overfitting and improves training efficiency.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104964"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145633628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-15DOI: 10.1016/j.jbi.2025.104955
Alireza Moayedikia , Sara Fin , Uffe Kock Wiil
Objective:
Clinical trial recruitment faces critical challenges with screen failure rates exceeding 80% in Alzheimer’s disease (AD) trials. Traditional patient selection relies on expert consensus without systematic evaluation of trade-offs between statistical power, recruitment feasibility, safety, and cost. We developed a multi-objective optimization framework to systematically identify optimal eligibility criteria configurations that balance competing objectives in AD clinical trial design.
Methods:
We implemented the Non-dominated Sorting Genetic Algorithm III (NSGA-III) to optimize patient selection criteria across three objectives: patient identification accuracy (F1 score), recruitment balance, and economic efficiency. The framework utilized National Alzheimer’s Coordinating Center data comprising 2,743 participants with comprehensive clinical assessments and cerebrospinal fluid biomarker measurements. We optimized 14 eligibility parameters including age boundaries, cognitive thresholds, biomarker criteria, and comorbidity management policies. Statistical validation employed Monte Carlo simulation with 10,000 iterations, bootstrap analysis, and SHAP interpretability analysis.
Results:
Optimization identified 11 Pareto-optimal solutions spanning F1 scores from 0.979 to 0.995 and eligible patient pools from 108 to 327. Compared to standard criteria selecting 101 participants, optimized approaches identified 102 participants with no significant demographic or clinical differences after multiple comparison correction. Monte Carlo simulation revealed mean cost savings of $1,048 per patient (95% CI: -$1,251 to $3,492), with 80.7% probability of positive savings but 19.3% risk of cost increases (SD = $1,208). Cross-validation demonstrated high precision (95.1%) with strategic selectivity (9.4% recall). SHAP analysis identified biomarker requirements as the dominant cost driver. Optimization algorithms converged toward solutions similar to expert-designed criteria, validating both computational and clinical approaches.
Conclusion:
Multi-objective optimization provides meaningful but incremental value through systematic validation and probabilistic efficiency enhancement rather than revolutionary transformation. The convergence toward established practice demonstrates that computational approaches serve as sophisticated validation tools that identify concrete yet uncertain efficiency improvements within existing frameworks. The substantial variability in projected outcomes establishes realistic expectations and highlights the importance of site-specific evaluation, particularly regarding recruitment infrastructure quality as the dominant determinant of success. This establishes a mature paradigm for evidence-based trial design optimization that enhances rather than replaces clinical expertise.
{"title":"Multi-objective optimization formulation for Alzheimer’s disease trial patient selection","authors":"Alireza Moayedikia , Sara Fin , Uffe Kock Wiil","doi":"10.1016/j.jbi.2025.104955","DOIUrl":"10.1016/j.jbi.2025.104955","url":null,"abstract":"<div><h3>Objective:</h3><div>Clinical trial recruitment faces critical challenges with screen failure rates exceeding 80% in Alzheimer’s disease (AD) trials. Traditional patient selection relies on expert consensus without systematic evaluation of trade-offs between statistical power, recruitment feasibility, safety, and cost. We developed a multi-objective optimization framework to systematically identify optimal eligibility criteria configurations that balance competing objectives in AD clinical trial design.</div></div><div><h3>Methods:</h3><div>We implemented the Non-dominated Sorting Genetic Algorithm III (NSGA-III) to optimize patient selection criteria across three objectives: patient identification accuracy (F1 score), recruitment balance, and economic efficiency. The framework utilized National Alzheimer’s Coordinating Center data comprising 2,743 participants with comprehensive clinical assessments and cerebrospinal fluid biomarker measurements. We optimized 14 eligibility parameters including age boundaries, cognitive thresholds, biomarker criteria, and comorbidity management policies. Statistical validation employed Monte Carlo simulation with 10,000 iterations, bootstrap analysis, and SHAP interpretability analysis.</div></div><div><h3>Results:</h3><div>Optimization identified 11 Pareto-optimal solutions spanning F1 scores from 0.979 to 0.995 and eligible patient pools from 108 to 327. Compared to standard criteria selecting 101 participants, optimized approaches identified 102 participants with no significant demographic or clinical differences after multiple comparison correction. Monte Carlo simulation revealed mean cost savings of $1,048 per patient (95% CI: -$1,251 to $3,492), with 80.7% probability of positive savings but 19.3% risk of cost increases (SD = $1,208). Cross-validation demonstrated high precision (95.1%) with strategic selectivity (9.4% recall). SHAP analysis identified biomarker requirements as the dominant cost driver. Optimization algorithms converged toward solutions similar to expert-designed criteria, validating both computational and clinical approaches.</div></div><div><h3>Conclusion:</h3><div>Multi-objective optimization provides meaningful but incremental value through systematic validation and probabilistic efficiency enhancement rather than revolutionary transformation. The convergence toward established practice demonstrates that computational approaches serve as sophisticated validation tools that identify concrete yet uncertain efficiency improvements within existing frameworks. The substantial variability in projected outcomes establishes realistic expectations and highlights the importance of site-specific evaluation, particularly regarding recruitment infrastructure quality as the dominant determinant of success. This establishes a mature paradigm for evidence-based trial design optimization that enhances rather than replaces clinical expertise.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104955"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145540741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-10-16DOI: 10.1016/j.jbi.2025.104933
Jianbin Tan , Yan Zhang , Chuan Hong , T. Tony Cai , Tianxi Cai , Anru R. Zhang
Objectives:
We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.
Materials and Methods:
We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, Macomss, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for Macomss, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).
Results:
Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, Macomss achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.
Discussion:
The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.
Conclusions:
We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.
{"title":"Integrated analysis for electronic health records with structured and sporadic missingness","authors":"Jianbin Tan , Yan Zhang , Chuan Hong , T. Tony Cai , Tianxi Cai , Anru R. Zhang","doi":"10.1016/j.jbi.2025.104933","DOIUrl":"10.1016/j.jbi.2025.104933","url":null,"abstract":"<div><h3>Objectives:</h3><div>We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.</div></div><div><h3>Materials and Methods:</h3><div>We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, <span>Macomss</span>, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for <span>Macomss</span>, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).</div></div><div><h3>Results:</h3><div>Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, <span>Macomss</span> achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.</div></div><div><h3>Discussion:</h3><div>The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.</div></div><div><h3>Conclusions:</h3><div>We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104933"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145318292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-10-17DOI: 10.1016/j.jbi.2025.104940
Min Tang , Yuhao Zhang , Ronghua Liang , Guoqiang Deng
Objective:
In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.
Methods:
We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.
Results:
Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6 higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.
Conclusion:
PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.
{"title":"A non-interactive Online Medical Pre-Diagnosis system on encrypted vertically partitioned data","authors":"Min Tang , Yuhao Zhang , Ronghua Liang , Guoqiang Deng","doi":"10.1016/j.jbi.2025.104940","DOIUrl":"10.1016/j.jbi.2025.104940","url":null,"abstract":"<div><h3>Objective:</h3><div>In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.</div></div><div><h3>Methods:</h3><div>We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.</div></div><div><h3>Results:</h3><div>Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6<span><math><mo>×</mo></math></span> higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.</div></div><div><h3>Conclusion:</h3><div>PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104940"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-09-13DOI: 10.1016/j.jbi.2025.104903
Hadasa Kaufman , Nadav Rappoport , Amir Gilad , Michal Linial
Causal inference from observational medical record data is critical for advancing precision and personalization in healthcare. Recently, biobanks – collections of biological samples linked with genetic, lifestyle, environmental, and health-related data – have emerged as valuable resources for large-scale population studies. By integrating these resources, biobanks offer a harmonized repository of diverse data for each individual, capturing real-world medical events, including procedures, treatments, and diagnoses. However, these resources are often affected by confounding factors, selection biases, and missing information, posing significant challenges to drawing valid causal conclusions. While randomized controlled trials (RCTs) remain the gold standard for drug development and medical decision-making, the growing availability of observational data highlights the need for robust causal inference methodologies. This study provides an overview of methods for inferring the effect of a treatment on an outcome from observational data applicable to biobank data, focusing on the unique challenges they address. Our objective is to introduce current methods used for causal discovery in observational medical data. We discuss classic and modern methodologies that offer significant opportunities alongside the difficulty in reaching causality. We cover statistical methods designed for large-scale biobanks that have the potential to improve clinical decision-making, guide public health policies, and drive further research.
{"title":"Advancing causal inference in medicine using biobank data","authors":"Hadasa Kaufman , Nadav Rappoport , Amir Gilad , Michal Linial","doi":"10.1016/j.jbi.2025.104903","DOIUrl":"10.1016/j.jbi.2025.104903","url":null,"abstract":"<div><div>Causal inference from observational medical record data is critical for advancing precision and personalization in healthcare. Recently, biobanks – collections of biological samples linked with genetic, lifestyle, environmental, and health-related data – have emerged as valuable resources for large-scale population studies. By integrating these resources, biobanks offer a harmonized repository of diverse data for each individual, capturing real-world medical events, including procedures, treatments, and diagnoses. However, these resources are often affected by confounding factors, selection biases, and missing information, posing significant challenges to drawing valid causal conclusions. While randomized controlled trials (RCTs) remain the gold standard for drug development and medical decision-making, the growing availability of observational data highlights the need for robust causal inference methodologies. This study provides an overview of methods for inferring the effect of a treatment on an outcome from observational data applicable to biobank data, focusing on the unique challenges they address. Our objective is to introduce current methods used for causal discovery in observational medical data. We discuss classic and modern methodologies that offer significant opportunities alongside the difficulty in reaching causality. We cover statistical methods designed for large-scale biobanks that have the potential to improve clinical decision-making, guide public health policies, and drive further research.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104903"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145069728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-10-01DOI: 10.1016/j.jbi.2025.104920
Şeyma Selcan Mağara, Noah Dietrich, Ali Burak Ünal, Mete Akgün
Objective:
Record linkage is essential for integrating data from multiple sources with diverse applications in real-world healthcare and research. Probabilistic Privacy-Preserving Record Linkage (PPRL) enables this integration occurs, while protecting sensitive information from unauthorized access, especially when datasets lack exact identifiers. As privacy regulations evolve and multi-institutional collaborations expand globally, there is a growing demand for methods that effectively balance security, accuracy, and efficiency. However, ensuring both privacy and scalability in large-scale record linkage remains a key challenge.
Method:
This paper presents a novel and efficient PPRL method based on a secure 3-party computation (MPC) framework. Our approach allows multiple parties to compute linkage results without exposing their private inputs and significantly improves the speed of linkage process compared to existing PPRL solutions.
Result:
Our method preserves the linkage quality of a state-of-the-art (SOTA) MPC-based PPRL method while achieving up to 14 times faster performance. For example, linking a record against a database of 10,000 records takes just 8.74 s in a realistic network with 700 Mbps bandwidth and 60 ms latency, compared to 92.32 s with the SOTA method. Even on a slower internet connection with 100 Mbps bandwidth and 60 ms latency, the linkage completes in 28 s, where as the SOTA method requires 287.96 s. These results demonstrate the significant scalability and efficiency improvements of our approach.
Conclusion:
Our novel PPRL method, based on secure 3-party computation, offers an efficient and scalable solution for large-scale record linkage while ensuring privacy protection. The approach demonstrates significant performance improvements, making it a promising tool for secure data integration in privacy-sensitive sectors.
{"title":"Accelerating probabilistic privacy-preserving medical record linkage: A three-party MPC approach","authors":"Şeyma Selcan Mağara, Noah Dietrich, Ali Burak Ünal, Mete Akgün","doi":"10.1016/j.jbi.2025.104920","DOIUrl":"10.1016/j.jbi.2025.104920","url":null,"abstract":"<div><h3>Objective:</h3><div>Record linkage is essential for integrating data from multiple sources with diverse applications in real-world healthcare and research. Probabilistic Privacy-Preserving Record Linkage (PPRL) enables this integration occurs, while protecting sensitive information from unauthorized access, especially when datasets lack exact identifiers. As privacy regulations evolve and multi-institutional collaborations expand globally, there is a growing demand for methods that effectively balance security, accuracy, and efficiency. However, ensuring both privacy and scalability in large-scale record linkage remains a key challenge.</div></div><div><h3>Method:</h3><div>This paper presents a novel and efficient PPRL method based on a secure 3-party computation (MPC) framework. Our approach allows multiple parties to compute linkage results without exposing their private inputs and significantly improves the speed of linkage process compared to existing PPRL solutions.</div></div><div><h3>Result:</h3><div>Our method preserves the linkage quality of a state-of-the-art (SOTA) MPC-based PPRL method while achieving up to 14 times faster performance. For example, linking a record against a database of 10,000 records takes just 8.74 s in a realistic network with 700 Mbps bandwidth and 60 ms latency, compared to 92.32 s with the SOTA method. Even on a slower internet connection with 100 Mbps bandwidth and 60 ms latency, the linkage completes in 28 s, where as the SOTA method requires 287.96 s. These results demonstrate the significant scalability and efficiency improvements of our approach.</div></div><div><h3>Conclusion:</h3><div>Our novel PPRL method, based on secure 3-party computation, offers an efficient and scalable solution for large-scale record linkage while ensuring privacy protection. The approach demonstrates significant performance improvements, making it a promising tool for secure data integration in privacy-sensitive sectors.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104920"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145223419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biobanks and biomolecular resources are increasingly central to data-driven biomedical research, encompassing not only metadata but also granular, sample-related data from diverse sources such as healthcare systems, national registries, and research outputs. However, the lack of a standardised, machine-readable format for representing such data limits interoperability, data reuse and integration into clinical and research environments. While MIABIS provides a conceptual model for biobank data, its abstract nature and reliance on heterogeneous implementations create barriers to practical, scalable adoption. This study presents a pragmatic, operational implementation of MIABIS focused on enabling real-world exchange and integration of sample-level data.
Methods
We systematically evaluated established data exchange standards, comparing HL7 FHIR and OMOP CDM with respect to their suitability for structuring sample-related data in a semantically robust and machine-readable form. Based on this analysis, we developed a FHIR-based representation of MIABIS that supports complex biobank structures and enables integration with federated data infrastructures. Supporting tools, including a Python library and an implementation guide, were created to ensure usability across diverse research and clinical contexts.
Results
We created nine interoperable FHIR profiles covering core MIABIS entities, ensuring consistency with FHIR standards. To support adoption, we developed an open-source Python library that abstracts FHIR interactions and provides schema validation for MIABIS-compliant data. The library was integrated into an ETL tool in operation at Czech Node of BBMRI-ERIC, European Biobanking and Biomolecular Resources Research Infrastructure, to demonstrate usability with real-world sample-related data. Separately, we validated the representation of MIABIS entities at the organisational level by converting the data structures of BBMRI-ERIC Directory into FHIR, demonstrating compatibility with federated data infrastructures.
Conclusion
This work delivers a machine-readable, interoperable implementation of MIABIS, enabling the exchange of both organisational and sample-level data across biobanks and health information systems. By integrating MIABIS with HL7 FHIR, we provide a host of reusable tools and mechanisms for further evolution of the data model. Combined, these benefits can help with the integration into clinical and research workflows, supporting data discoverability, reuse, and cross-institutional collaboration in biomedical research.
{"title":"Definitions to data flow: Operationalizing MIABIS in HL7 FHIR","authors":"Radovan Tomášik , Šimon Koňár , Niina Eklund , Cäcilia Engels , Zdenka Dudova , Radoslava Kacová , Roman Hrstka , Petr Holub","doi":"10.1016/j.jbi.2025.104919","DOIUrl":"10.1016/j.jbi.2025.104919","url":null,"abstract":"<div><h3>Objective</h3><div>Biobanks and biomolecular resources are increasingly central to data-driven biomedical research, encompassing not only metadata but also granular, sample-related data from diverse sources such as healthcare systems, national registries, and research outputs. However, the lack of a standardised, machine-readable format for representing such data limits interoperability, data reuse and integration into clinical and research environments. While MIABIS provides a conceptual model for biobank data, its abstract nature and reliance on heterogeneous implementations create barriers to practical, scalable adoption. This study presents a pragmatic, operational implementation of MIABIS focused on enabling real-world exchange and integration of sample-level data.</div></div><div><h3>Methods</h3><div>We systematically evaluated established data exchange standards, comparing HL7 FHIR and OMOP CDM with respect to their suitability for structuring sample-related data in a semantically robust and machine-readable form. Based on this analysis, we developed a FHIR-based representation of MIABIS that supports complex biobank structures and enables integration with federated data infrastructures. Supporting tools, including a Python library and an implementation guide, were created to ensure usability across diverse research and clinical contexts.</div></div><div><h3>Results</h3><div>We <em>created nine interoperable FHIR profiles</em> covering core MIABIS entities, ensuring consistency with FHIR standards. To support adoption, we <em>developed an open-source Python library</em> that abstracts FHIR interactions and provides schema validation for MIABIS-compliant data. The <em>library was integrated into an ETL tool</em> in operation at Czech Node of BBMRI-ERIC, European Biobanking and Biomolecular Resources Research Infrastructure, to demonstrate usability with real-world sample-related data. Separately, we validated the representation of MIABIS entities at the organisational level by converting the data structures of BBMRI-ERIC Directory into FHIR, demonstrating compatibility with federated data infrastructures.</div></div><div><h3>Conclusion</h3><div>This work delivers a machine-readable, interoperable implementation of MIABIS, enabling the exchange of both organisational and sample-level data across biobanks and health information systems. By integrating MIABIS with HL7 FHIR, we provide a host of reusable tools and mechanisms for further evolution of the data model. Combined, these benefits can help with the integration into clinical and research workflows, supporting data discoverability, reuse, and cross-institutional collaboration in biomedical research.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104919"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145191636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-10-04DOI: 10.1016/j.jbi.2025.104925
Luke Stevens , Nan Kennedy , Rob J. Taylor , Adam Lewis , Frank E. Harrell Jr , Matthew S. Shotwell , Emily S. Serdoz , Gordon R. Bernard , Wesley H. Self , Christopher J. Lindsell , Paul A. Harris , Jonathan D. Casey
Objective
Since 2012, the electronic data capture platform REDCap has included an embedded randomization module allowing a single randomization per study record with the ability to stratify by variables such as study site and participant sex at birth. In recent years, platform, adaptive, decentralized, and pragmatic trials have gained popularity. These trial designs often require approaches to randomization not supported by the original REDCap randomization module, including randomizing patients into multiple domains or at multiple points in time, changing allocation tables to add or drop study groups, or adaptively changing allocation ratios based on data from previously enrolled participants. Our team aimed to develop new randomization functions to address these issues.
Methods
A collaborative process facilitated by the NIH-funded Trial Innovation Network was initiated to modernize the randomization module in REDCap, incorporating feedback from clinical trialists, biostatisticians, technologists, and other experts.
Results
This effort led to the development of an advanced randomization module within the REDCap platform. In addition to supporting platform, adaptive, decentralized, and pragmatic trials, the new module introduces several new features, such as improved support for blinded randomization, additional randomization metadata capture (e.g., user identity and timestamp), additional tools allowing REDCap administrators to support investigators using the randomization module, and the ability for clinicians participating in pragmatic or decentralized trials to perform randomization through a survey without needing log-in access to the study database. As of June 19, 2025, multiple randomizations have been used in 211 projects from 55 institutions, randomizations with real-time trigger logic in 108 projects from 64 institutions, and blinded group allocation in 24 projects from 17 institutions.
Conclusion
The new randomization module aims to streamline the randomization process, improve trial efficiency, and ensure robust data integrity, thereby supporting the conduct of more sophisticated and adaptive clinical trials.
{"title":"A REDCap advanced randomization module to meet the needs of modern trials","authors":"Luke Stevens , Nan Kennedy , Rob J. Taylor , Adam Lewis , Frank E. Harrell Jr , Matthew S. Shotwell , Emily S. Serdoz , Gordon R. Bernard , Wesley H. Self , Christopher J. Lindsell , Paul A. Harris , Jonathan D. Casey","doi":"10.1016/j.jbi.2025.104925","DOIUrl":"10.1016/j.jbi.2025.104925","url":null,"abstract":"<div><h3>Objective</h3><div>Since 2012, the electronic data capture platform REDCap has included an embedded randomization module allowing a single randomization per study record with the ability to stratify by variables such as study site and participant sex at birth. In recent years, platform, adaptive, decentralized, and pragmatic trials have gained popularity. These trial designs often require approaches to randomization not supported by the original REDCap randomization module, including randomizing patients into multiple domains or at multiple points in time, changing allocation tables to add or drop study groups, or adaptively changing allocation ratios based on data from previously enrolled participants. Our team aimed to develop new randomization functions to address these issues.</div></div><div><h3>Methods</h3><div>A collaborative process facilitated by the NIH-funded Trial Innovation Network was initiated to modernize the randomization module in REDCap, incorporating feedback from clinical trialists, biostatisticians, technologists, and other experts.</div></div><div><h3>Results</h3><div>This effort led to the development of an advanced randomization module within the REDCap platform. In addition to supporting platform, adaptive, decentralized, and pragmatic trials, the new module introduces several new features, such as improved support for blinded randomization, additional randomization metadata capture (e.g., user identity and timestamp), additional tools allowing REDCap administrators to support investigators using the randomization module, and the ability for clinicians participating in pragmatic or decentralized trials to perform randomization through a survey without needing log-in access to the study database. As of June 19, 2025, multiple randomizations have been used in 211 projects from 55 institutions, randomizations with real-time trigger logic in 108 projects from 64 institutions, and blinded group allocation in 24 projects from 17 institutions.</div></div><div><h3>Conclusion</h3><div>The new randomization module aims to streamline the randomization process, improve trial efficiency, and ensure robust data integrity, thereby supporting the conduct of more sophisticated and adaptive clinical trials.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104925"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145238683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}