Pub Date : 2025-12-01Epub Date: 2025-10-27DOI: 10.1016/j.jbi.2025.104948
Yuxin Lin , Jing Ma , Suyu Dong , Chaoyu Sun , Wanting Cong , Kuanquan Wang , Gongning Luo , Wei Wang
Objective:
Existing generative models for electrocardiogram (ECG) synthesis often lack fine-grained, interpretable control, limiting their utility for addressing data scarcity and imbalance. This study aims to develop a model capable of producing diverse and semantically controllable synthetic ECGs to fill this critical gap.
Methods:
We propose TransDiffECG, a novel Transformer-based diffusion model that integrates semantic information injection and global temporal modeling to enable fine-grained control over ECG synthesis. The model allows user-controllable generation of ECG signals with customized physiological details. We establish a comprehensive evaluation protocol, including downstream segmentation and classification tasks, to rigorously assess the authenticity and utility of the generated signals. Extensive experiments are conducted on both single-lead (QTDB) and multi-lead (LUDB) ECG datasets.
Results:
TransDiffECG significantly outperforms state-of-the-art baselines. On the multi-lead LUDB dataset, it achieved superior signal quality (MMD: ; Pearson Correlation: 0.6177). The utility of the synthetic data was confirmed in downstream tasks, where data augmentation improved atrial fibrillation classification to an AUROC of 0.9451. Moreover, a segmentation model trained solely on our synthetic data rivaled one trained on real data (e.g., precision/recall on QTDB).
Conclusion:
TransDiffECG represents a significant advancement in synthetic medical signal generation by bridging the gap between clinical interpretability and generative flexibility. Its ability to generate semantically controllable and clinically valid ECGs greatly expands the application potential of generative models in healthcare research and practice.
{"title":"TransDiffECG: Semantically controllable ECG synthesis via transformer-based diffusion modeling","authors":"Yuxin Lin , Jing Ma , Suyu Dong , Chaoyu Sun , Wanting Cong , Kuanquan Wang , Gongning Luo , Wei Wang","doi":"10.1016/j.jbi.2025.104948","DOIUrl":"10.1016/j.jbi.2025.104948","url":null,"abstract":"<div><h3>Objective:</h3><div>Existing generative models for electrocardiogram (ECG) synthesis often lack fine-grained, interpretable control, limiting their utility for addressing data scarcity and imbalance. This study aims to develop a model capable of producing diverse and semantically controllable synthetic ECGs to fill this critical gap.</div></div><div><h3>Methods:</h3><div>We propose TransDiffECG, a novel Transformer-based diffusion model that integrates semantic information injection and global temporal modeling to enable fine-grained control over ECG synthesis. The model allows user-controllable generation of ECG signals with customized physiological details. We establish a comprehensive evaluation protocol, including downstream segmentation and classification tasks, to rigorously assess the authenticity and utility of the generated signals. Extensive experiments are conducted on both single-lead (QTDB) and multi-lead (LUDB) ECG datasets.</div></div><div><h3>Results:</h3><div>TransDiffECG significantly outperforms state-of-the-art baselines. On the multi-lead LUDB dataset, it achieved superior signal quality (MMD: <span><math><mrow><mn>3</mn><mo>.</mo><mn>21</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>2</mn></mrow></msup></mrow></math></span>; Pearson Correlation: 0.6177). The utility of the synthetic data was confirmed in downstream tasks, where data augmentation improved atrial fibrillation classification to an AUROC of 0.9451. Moreover, a segmentation model trained solely on our synthetic data rivaled one trained on real data (e.g., <span><math><mrow><mo>∼</mo><mn>98</mn><mtext>%</mtext></mrow></math></span> precision/recall on QTDB).</div></div><div><h3>Conclusion:</h3><div>TransDiffECG represents a significant advancement in synthetic medical signal generation by bridging the gap between clinical interpretability and generative flexibility. Its ability to generate semantically controllable and clinically valid ECGs greatly expands the application potential of generative models in healthcare research and practice.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104948"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145400897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-06DOI: 10.1016/j.jbi.2025.104951
Shuyang Xie , Hailing Cai , Yaoqin Sun, Xudong Lv
Objective
To develop and evaluate LLM-DQR, an automated approach using large language models to generate electronic health record data quality rules, addressing the limitations of current manual and automated methods that suffer from low efficiency, limited flexibility, and inadequate coverage of complex business logic.
Materials and Methods
We designed a comprehensive pipeline with three core components: (1) standardized input processing integrating database schemas, natural language requirements, and sample data; (2) Chain-of-Thought prompt engineering for guided rule generation; and (3) closed-loop validation with deduplication, sandbox execution, and iterative debugging. The approach was evaluated on two distinct, publicly available datasets: the Paediatric Intensive Care (PIC) dataset and the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. Performance was compared against manual expert construction (expert-DQR) and clinical information model-based generation (CIM-DQR).
Results
LLM-DQR demonstrated higher performance across all evaluation metrics. The GPT implementation achieved overall coverage rates of 97.1% on the PIC dataset and 99.6% on the MIMIC-IV dataset, outperforming CIM-DQR. Performance was particularly strong for complex dimensions: achieving 100% coverage for Consistency rules on both datasets, whereas CIM-DQR achieved 0%. Construction time was reduced by over 10-fold compared to manual methods. Additionally, on the PIC dataset, LLM-DQR generated 89 extra, expert-validated rules.
Discussion
The stronger performance demonstrates LLMs’ capability to understand complex EHR data patterns and assessment requirements, functioning as data quality analysis assistants with domain knowledge and logical reasoning capabilities.
Conclusion
LLM-DQR provides an efficient, scalable solution for automated data quality rule generation in clinical settings, offering considerable improvements over traditional approaches.
{"title":"LLM-DQR: Large language model-based automated generation of data quality rules for electronic health records","authors":"Shuyang Xie , Hailing Cai , Yaoqin Sun, Xudong Lv","doi":"10.1016/j.jbi.2025.104951","DOIUrl":"10.1016/j.jbi.2025.104951","url":null,"abstract":"<div><h3>Objective</h3><div>To develop and evaluate LLM-DQR, an automated approach using large language models to generate electronic health record data quality rules, addressing the limitations of current manual and automated methods that suffer from low efficiency, limited flexibility, and inadequate coverage of complex business logic.</div></div><div><h3>Materials and Methods</h3><div>We designed a comprehensive pipeline with three core components: (1) standardized input processing integrating database schemas, natural language requirements, and sample data; (2) Chain-of-Thought prompt engineering for guided rule generation; and (3) closed-loop validation with deduplication, sandbox execution, and iterative debugging. The approach was evaluated on two distinct, publicly available datasets: the Paediatric Intensive Care (PIC) dataset and the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. Performance was compared against manual expert construction (expert-DQR) and clinical information model-based generation (CIM-DQR).</div></div><div><h3>Results</h3><div>LLM-DQR demonstrated higher performance across all evaluation metrics. The GPT implementation achieved overall coverage rates of 97.1% on the PIC dataset and 99.6% on the MIMIC-IV dataset, outperforming CIM-DQR. Performance was particularly strong for complex dimensions: achieving 100% coverage for Consistency rules on both datasets, whereas CIM-DQR achieved 0%. Construction time was reduced by over 10-fold compared to manual methods. Additionally, on the PIC dataset, LLM-DQR generated 89 extra, expert-validated rules.</div></div><div><h3>Discussion</h3><div>The stronger performance demonstrates LLMs’ capability to understand complex EHR data patterns and assessment requirements, functioning as data quality analysis assistants with domain knowledge and logical reasoning capabilities.</div></div><div><h3>Conclusion</h3><div>LLM-DQR provides an efficient, scalable solution for automated data quality rule generation in clinical settings, offering considerable improvements over traditional approaches.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104951"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145476965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-18DOI: 10.1016/j.jbi.2025.104960
Siying Yang , Ping-an He , Pan Zeng , Yajie Meng , Zilong Zhang , Feifei Cui , Yuhua Yao , Jialiang Yang , Junlin Xu
Drug-target interaction (DTI) prediction is of great significant in stimulating innovation and research in the medical field. In recent years, traditional experimental methods for predicting DTIs have proven to be time-consuming and costly. As a result, machine learning methods have been extensively applied to improve the prediction of drug-target interactions. However, the sparsity of inter-node connections often results in insufficiently learned node representations. Furthermore, many methods do not take into account the topological similarity between nodes when integrating similarities. This study proposes a model that integrates multiple sources of information and utilizes Graph Contrastive Learning (GCL) to predict potential drug and target interactions (MGCLDTI). Firstly, MGCLDTI employs the DeepWalk algorithm to extract global topological representations from the heterogeneous graph which incorporates multi-view information of drugs, targets, and diseases. Subsequently, a densification strategy is implemented to alleviate the noise impact arising from the sparsity of the DTI matrix. Furthermore, a GCL model with node masking is applied to enhance local structural awareness and optimize the embeddings of drugs and targets. Finally, DTI scores are predicted using the LightGBM algorithm. Comparative results against state-of-the-art methods demonstrate that MGCLDTI achieves superior predictive performance. Besides, ablation studies reveal the effectiveness of each component. Case studies also provide compelling evidence of MGCLDTI’s accuracy in identifying potential DTIs.
{"title":"Predicting drug-target interactions based on multivariate information fusion and graph contrast learning","authors":"Siying Yang , Ping-an He , Pan Zeng , Yajie Meng , Zilong Zhang , Feifei Cui , Yuhua Yao , Jialiang Yang , Junlin Xu","doi":"10.1016/j.jbi.2025.104960","DOIUrl":"10.1016/j.jbi.2025.104960","url":null,"abstract":"<div><div>Drug-target interaction (DTI) prediction is of great significant in stimulating innovation and research in the medical field. In recent years, traditional experimental methods for predicting DTIs have proven to be time-consuming and costly. As a result, machine learning methods have been extensively applied to improve the prediction of drug-target interactions. However, the sparsity of inter-node connections often results in insufficiently learned node representations. Furthermore, many methods do not take into account the topological similarity between nodes when integrating similarities. This study proposes a model that integrates multiple sources of information and utilizes Graph Contrastive Learning (GCL) to predict potential drug and target interactions (MGCLDTI). Firstly, MGCLDTI employs the DeepWalk algorithm to extract global topological representations from the heterogeneous graph which incorporates multi-view information of drugs, targets, and diseases. Subsequently, a densification strategy is implemented to alleviate the noise impact arising from the sparsity of the DTI matrix. Furthermore, a GCL model with node masking is applied to enhance local structural awareness and optimize the embeddings of drugs and targets. Finally, DTI scores are predicted using the LightGBM algorithm. Comparative results against state-of-the-art methods demonstrate that MGCLDTI achieves superior predictive performance. Besides, ablation studies reveal the effectiveness of each component. Case studies also provide compelling evidence of MGCLDTI’s accuracy in identifying potential DTIs.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104960"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145556918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-10-31DOI: 10.1016/j.jbi.2025.104945
J.C. Wolber , M. E. Samadi , J. Sellin , A. Schuppert
Introduction:
Management of type 1 Diabetes remains a significant challenge as blood glucose levels can fluctuate dramatically and are highly individual. We introduce an innovative approach that combines multimodal Large Language models (mLLMs), mechanistic modeling of individual glucose metabolism and machine learning (ML) for forecasting blood glucose levels.
Methods:
This study uses the D1NAMO dataset (6 patients with meal images) to demonstrate mLLM integration for glucose prediction. An mLLM (Pixtral Large) was employed to estimate macronutrients from meal images, providing automated meal analysis without manual food logging. We compare three distinct approaches: (1) Baseline using only glucose dynamics and basic insulin features, (2) LastMeal providing additional information about the last meal ingested by the patient, and (3) Bézier incorporating mechanistically modeled temporal features using optimized cubic Bézier curves to model temporal impacts of individual macronutrients on blood glucose. The modeled feature impacts served as input features for a LightGBM model. We also validate the mechanistic modeling component on the AZT1D dataset (24 patients with structured carbohydrate and correction insulin logs).
Results:
The Bézier approach achieved the best performance across both datasets: D1NAMO RMSE of 15.06 at 30 min and 28.15 at 60 min; AZT1D RMSE of 16.61 at 30 min and 24.58 at 60 min. One-way ANOVA revealed statistically significant differences across prediction horizons of 45 to 120 min for the AZT1D dataset. Patient-specific Bézier curves revealed distinct metabolic response patterns: simple sugars peaked at 0.74 h, complex sugars at 3.07 h, and proteins at 4.36 h post-ingestion. Feature importance analysis showed temporal evolution from glucose change dominance to macronutrient prominence at longer horizons. Patient-specific modeling uncovered individual metabolic signatures with varying nutritional sensitivity and circadian influences.
Conclusion:
This study demonstrates the potential of combining mLLMs with mechanistic modeling for personalized diabetes management. The optimized Bézier curve approach provides superior temporal mapping while patient-specific models reveal individual metabolic signatures essential for personalized care.
{"title":"Multimodal large language models and mechanistic modeling for glucose forecasting in type 1 diabetes patients","authors":"J.C. Wolber , M. E. Samadi , J. Sellin , A. Schuppert","doi":"10.1016/j.jbi.2025.104945","DOIUrl":"10.1016/j.jbi.2025.104945","url":null,"abstract":"<div><h3>Introduction:</h3><div>Management of type 1 Diabetes remains a significant challenge as blood glucose levels can fluctuate dramatically and are highly individual. We introduce an innovative approach that combines multimodal Large Language models (mLLMs), mechanistic modeling of individual glucose metabolism and machine learning (ML) for forecasting blood glucose levels.</div></div><div><h3>Methods:</h3><div>This study uses the D1NAMO dataset (6 patients with meal images) to demonstrate mLLM integration for glucose prediction. An mLLM (Pixtral Large) was employed to estimate macronutrients from meal images, providing automated meal analysis without manual food logging. We compare three distinct approaches: (1) <em>Baseline</em> using only glucose dynamics and basic insulin features, (2) <em>LastMeal</em> providing additional information about the last meal ingested by the patient, and (3) <em>Bézier</em> incorporating mechanistically modeled temporal features using optimized cubic Bézier curves to model temporal impacts of individual macronutrients on blood glucose. The modeled feature impacts served as input features for a LightGBM model. We also validate the mechanistic modeling component on the AZT1D dataset (24 patients with structured carbohydrate and correction insulin logs).</div></div><div><h3>Results:</h3><div>The <em>Bézier</em> approach achieved the best performance across both datasets: D1NAMO RMSE of 15.06 at 30 min and 28.15 at 60 min; AZT1D RMSE of 16.61 at 30 min and 24.58 at 60 min. One-way ANOVA revealed statistically significant differences across prediction horizons of 45 to 120 min for the AZT1D dataset. Patient-specific Bézier curves revealed distinct metabolic response patterns: simple sugars peaked at 0.74 h, complex sugars at 3.07 h, and proteins at 4.36 h post-ingestion. Feature importance analysis showed temporal evolution from glucose change dominance to macronutrient prominence at longer horizons. Patient-specific modeling uncovered individual metabolic signatures with varying nutritional sensitivity and circadian influences.</div></div><div><h3>Conclusion:</h3><div>This study demonstrates the potential of combining mLLMs with mechanistic modeling for personalized diabetes management. The optimized Bézier curve approach provides superior temporal mapping while patient-specific models reveal individual metabolic signatures essential for personalized care.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104945"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145431702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-25DOI: 10.1016/j.jbi.2025.104964
Yong Li , Jianping Man , Yi Zhou , Likeng Liang
Objective
Medical Visual Question Answering (VQA) is a quintessential application scenario of biomedical Multimodal Large Language Models (MLLMs). Previous studies mainly focused on input image-question pairs, neglecting the rich medical knowledge of the relevant captions of the pretrained datasets. This limits the model’s reasoning capability and causes overfitting. This paper aims to effectively utilize the captions of pretrained datasets to solve the above issues.
Methods
This paper proposes a Caption-Augmented Reasoning Model (CARM), which introduces three innovative components to leverage the captions during finetuning: (1) A Cross-Modal Visual Augmentation (CMVA) module that enriched image feature representations through semantic alignment with retrieved captions; (2) A Retrieval Cross-Modal Attention (RCMA) mechanism that established explicit connections between visual features and domain-specific medical knowledge; (3) A Hierarchical Rank Low-Rank Adaptation (HR-LoRA) module that optimized parameter-efficient finetuning through rank-adaptive decomposition in both unimodal encoders and multimodal fusion layers.
Results
The proposed CARM achieved state-of-the-art performance across three benchmark datasets, with accuracy scores of 0.798 on VQA-RAD, 0.867 on VQA-SLAKE, and 0.718 on VQA-Med-2019, respectively, outperforming existing medical VQA models. Qualitative evaluations revealed that our caption-based augmentation effectively directed model attention to the image regions related to a question.
Conclusions
The proposed CARM effectively improves visual grounding and reasoning accuracy with the systematic integration of medical captions, and the HR-LoRA alleviates overfitting and improves training efficiency.
目的:医学视觉问答(VQA)是生物医学多模态大语言模型(MLLMs)的典型应用场景。以往的研究主要集中在输入图像-问题对上,忽略了预训练数据集相关标题中丰富的医学知识。这限制了模型的推理能力并导致过拟合。本文旨在有效地利用预训练数据集的标题来解决上述问题。方法:本文提出了一种标题增强推理模型(CARM),该模型引入了三个创新组件来在微调过程中利用标题:(1)跨模态视觉增强(CMVA)模块,通过与检索到的标题进行语义对齐来丰富图像特征表示;(2)检索跨模态注意(RCMA)机制建立了视觉特征与特定领域医学知识之间的显式联系;(3)层次秩低秩自适应(HR-LoRA)模块,通过单峰编码器和多峰融合层的秩自适应分解优化参数高效微调。结果:所提出的CARM在三个基准数据集上取得了最先进的性能,VQA- rad、VQA- slake和VQA- med -2019的准确率分别为0.798、0.867和0.718,优于现有的医学VQA模型。定性评估表明,我们基于标题的增强有效地将模型的注意力引导到与问题相关的图像区域。结论:本文提出的CARM通过对医学字幕的系统集成,有效提高了视觉基础和推理精度,HR-LoRA缓解了过拟合,提高了训练效率。
{"title":"Caption-augmented reasoning model with Hierarchical rank LoRA finetuing for medical visual question Answering","authors":"Yong Li , Jianping Man , Yi Zhou , Likeng Liang","doi":"10.1016/j.jbi.2025.104964","DOIUrl":"10.1016/j.jbi.2025.104964","url":null,"abstract":"<div><h3>Objective</h3><div>Medical Visual Question Answering (VQA) is a quintessential application scenario of biomedical Multimodal Large Language Models (MLLMs). Previous studies mainly focused on input image-question pairs, neglecting the rich medical knowledge of the relevant captions of the pretrained datasets. This limits the model’s reasoning capability and causes overfitting. This paper aims to effectively utilize the captions of pretrained datasets to solve the above issues.</div></div><div><h3>Methods</h3><div>This paper proposes a Caption-Augmented Reasoning Model (CARM), which introduces three innovative components to leverage the captions during finetuning: (1) A Cross-Modal Visual Augmentation (CMVA) module that enriched image feature representations through semantic alignment with retrieved captions; (2) A Retrieval Cross-Modal Attention (RCMA) mechanism that established explicit connections between visual features and domain-specific medical knowledge; (3) A Hierarchical Rank Low-Rank Adaptation (HR-LoRA) module that optimized parameter-efficient finetuning through rank-adaptive decomposition in both unimodal encoders and multimodal fusion layers.</div></div><div><h3>Results</h3><div>The proposed CARM achieved state-of-the-art performance across three benchmark datasets, with accuracy scores of 0.798 on VQA-RAD, 0.867 on VQA-SLAKE, and 0.718 on VQA-Med-2019, respectively, outperforming existing medical VQA models. Qualitative evaluations revealed that our caption-based augmentation effectively directed model attention to the image regions related to a question.</div></div><div><h3>Conclusions</h3><div>The proposed CARM effectively improves visual grounding and reasoning accuracy with the systematic integration of medical captions, and the HR-LoRA alleviates overfitting and improves training efficiency.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104964"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145633628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-11-15DOI: 10.1016/j.jbi.2025.104955
Alireza Moayedikia , Sara Fin , Uffe Kock Wiil
Objective:
Clinical trial recruitment faces critical challenges with screen failure rates exceeding 80% in Alzheimer’s disease (AD) trials. Traditional patient selection relies on expert consensus without systematic evaluation of trade-offs between statistical power, recruitment feasibility, safety, and cost. We developed a multi-objective optimization framework to systematically identify optimal eligibility criteria configurations that balance competing objectives in AD clinical trial design.
Methods:
We implemented the Non-dominated Sorting Genetic Algorithm III (NSGA-III) to optimize patient selection criteria across three objectives: patient identification accuracy (F1 score), recruitment balance, and economic efficiency. The framework utilized National Alzheimer’s Coordinating Center data comprising 2,743 participants with comprehensive clinical assessments and cerebrospinal fluid biomarker measurements. We optimized 14 eligibility parameters including age boundaries, cognitive thresholds, biomarker criteria, and comorbidity management policies. Statistical validation employed Monte Carlo simulation with 10,000 iterations, bootstrap analysis, and SHAP interpretability analysis.
Results:
Optimization identified 11 Pareto-optimal solutions spanning F1 scores from 0.979 to 0.995 and eligible patient pools from 108 to 327. Compared to standard criteria selecting 101 participants, optimized approaches identified 102 participants with no significant demographic or clinical differences after multiple comparison correction. Monte Carlo simulation revealed mean cost savings of $1,048 per patient (95% CI: -$1,251 to $3,492), with 80.7% probability of positive savings but 19.3% risk of cost increases (SD = $1,208). Cross-validation demonstrated high precision (95.1%) with strategic selectivity (9.4% recall). SHAP analysis identified biomarker requirements as the dominant cost driver. Optimization algorithms converged toward solutions similar to expert-designed criteria, validating both computational and clinical approaches.
Conclusion:
Multi-objective optimization provides meaningful but incremental value through systematic validation and probabilistic efficiency enhancement rather than revolutionary transformation. The convergence toward established practice demonstrates that computational approaches serve as sophisticated validation tools that identify concrete yet uncertain efficiency improvements within existing frameworks. The substantial variability in projected outcomes establishes realistic expectations and highlights the importance of site-specific evaluation, particularly regarding recruitment infrastructure quality as the dominant determinant of success. This establishes a mature paradigm for evidence-based trial design optimization that enhances rather than replaces clinical expertise.
{"title":"Multi-objective optimization formulation for Alzheimer’s disease trial patient selection","authors":"Alireza Moayedikia , Sara Fin , Uffe Kock Wiil","doi":"10.1016/j.jbi.2025.104955","DOIUrl":"10.1016/j.jbi.2025.104955","url":null,"abstract":"<div><h3>Objective:</h3><div>Clinical trial recruitment faces critical challenges with screen failure rates exceeding 80% in Alzheimer’s disease (AD) trials. Traditional patient selection relies on expert consensus without systematic evaluation of trade-offs between statistical power, recruitment feasibility, safety, and cost. We developed a multi-objective optimization framework to systematically identify optimal eligibility criteria configurations that balance competing objectives in AD clinical trial design.</div></div><div><h3>Methods:</h3><div>We implemented the Non-dominated Sorting Genetic Algorithm III (NSGA-III) to optimize patient selection criteria across three objectives: patient identification accuracy (F1 score), recruitment balance, and economic efficiency. The framework utilized National Alzheimer’s Coordinating Center data comprising 2,743 participants with comprehensive clinical assessments and cerebrospinal fluid biomarker measurements. We optimized 14 eligibility parameters including age boundaries, cognitive thresholds, biomarker criteria, and comorbidity management policies. Statistical validation employed Monte Carlo simulation with 10,000 iterations, bootstrap analysis, and SHAP interpretability analysis.</div></div><div><h3>Results:</h3><div>Optimization identified 11 Pareto-optimal solutions spanning F1 scores from 0.979 to 0.995 and eligible patient pools from 108 to 327. Compared to standard criteria selecting 101 participants, optimized approaches identified 102 participants with no significant demographic or clinical differences after multiple comparison correction. Monte Carlo simulation revealed mean cost savings of $1,048 per patient (95% CI: -$1,251 to $3,492), with 80.7% probability of positive savings but 19.3% risk of cost increases (SD = $1,208). Cross-validation demonstrated high precision (95.1%) with strategic selectivity (9.4% recall). SHAP analysis identified biomarker requirements as the dominant cost driver. Optimization algorithms converged toward solutions similar to expert-designed criteria, validating both computational and clinical approaches.</div></div><div><h3>Conclusion:</h3><div>Multi-objective optimization provides meaningful but incremental value through systematic validation and probabilistic efficiency enhancement rather than revolutionary transformation. The convergence toward established practice demonstrates that computational approaches serve as sophisticated validation tools that identify concrete yet uncertain efficiency improvements within existing frameworks. The substantial variability in projected outcomes establishes realistic expectations and highlights the importance of site-specific evaluation, particularly regarding recruitment infrastructure quality as the dominant determinant of success. This establishes a mature paradigm for evidence-based trial design optimization that enhances rather than replaces clinical expertise.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"172 ","pages":"Article 104955"},"PeriodicalIF":4.5,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145540741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-10-16DOI: 10.1016/j.jbi.2025.104933
Jianbin Tan , Yan Zhang , Chuan Hong , T. Tony Cai , Tianxi Cai , Anru R. Zhang
Objectives:
We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.
Materials and Methods:
We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, Macomss, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for Macomss, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).
Results:
Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, Macomss achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.
Discussion:
The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.
Conclusions:
We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.
{"title":"Integrated analysis for electronic health records with structured and sporadic missingness","authors":"Jianbin Tan , Yan Zhang , Chuan Hong , T. Tony Cai , Tianxi Cai , Anru R. Zhang","doi":"10.1016/j.jbi.2025.104933","DOIUrl":"10.1016/j.jbi.2025.104933","url":null,"abstract":"<div><h3>Objectives:</h3><div>We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.</div></div><div><h3>Materials and Methods:</h3><div>We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, <span>Macomss</span>, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for <span>Macomss</span>, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).</div></div><div><h3>Results:</h3><div>Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, <span>Macomss</span> achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.</div></div><div><h3>Discussion:</h3><div>The proposed method effectively addresses critical missingness patterns that arise in the integrated analysis of EHR datasets, enhancing the robustness and generalizability of clinical predictions.</div></div><div><h3>Conclusions:</h3><div>We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104933"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145318292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-10-17DOI: 10.1016/j.jbi.2025.104940
Min Tang , Yuhao Zhang , Ronghua Liang , Guoqiang Deng
Objective:
In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.
Methods:
We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.
Results:
Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6 higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.
Conclusion:
PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.
{"title":"A non-interactive Online Medical Pre-Diagnosis system on encrypted vertically partitioned data","authors":"Min Tang , Yuhao Zhang , Ronghua Liang , Guoqiang Deng","doi":"10.1016/j.jbi.2025.104940","DOIUrl":"10.1016/j.jbi.2025.104940","url":null,"abstract":"<div><h3>Objective:</h3><div>In medical environments, patient records are stored as heterogeneous features across various institutions, prohibiting raw data sharing due to legal or institutional constraints. This fragmentation presents challenges for Online Medical Pre-Diagnosis (OMPD) systems. Existing methods (such as federated learning) require multiple rounds of interactions among all participating parties (hospitals and cloud servers), resulting in frequent communication. Moreover, due to the sharing of global gradients, they are vulnerable to inference attacks, leading to information leakage. In this paper, we propose a secure and efficient the OMPD system framework to address the problem of vertical data fragmentation, aiming to resolve the contradiction between medical data isolation and model collaboration.</div></div><div><h3>Methods:</h3><div>We propose PPNLR, a secure framework for building the OMPD systems. This framework combines functional encryption and blinding factors to design the sample-feature dimension encryption algorithm and the privacy-preserving vectorization training algorithm. Decoupling sample computation from model training enables cross-client data aggregation with only a single communication between hospitals and cloud servers.</div></div><div><h3>Results:</h3><div>Security analysis shows that PPNLR is resistant to semi-honest inference attacks and collusion attacks. Evaluation results based on six real-world medical datasets (text and images) show that: (i) The inference accuracy is close to that of the centralized plaintext training benchmark; (ii) The computational efficiency is at least 3.6<span><math><mo>×</mo></math></span> higher than that of comparable approaches; (iii) The communication complexity is significantly reduced by eliminating dependencies on iteration count.</div></div><div><h3>Conclusion:</h3><div>PPNLR achieves data protection through cryptographic primitives, maintaining high diagnostic accuracy while ensuring the security of medical data and model parameters. Its single-communication architecture significantly reduces the deployment threshold in resource-constrained scenarios, providing a practical framework for building the privacy-friendly OMPD systems.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104940"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145329250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-09-13DOI: 10.1016/j.jbi.2025.104903
Hadasa Kaufman , Nadav Rappoport , Amir Gilad , Michal Linial
Causal inference from observational medical record data is critical for advancing precision and personalization in healthcare. Recently, biobanks – collections of biological samples linked with genetic, lifestyle, environmental, and health-related data – have emerged as valuable resources for large-scale population studies. By integrating these resources, biobanks offer a harmonized repository of diverse data for each individual, capturing real-world medical events, including procedures, treatments, and diagnoses. However, these resources are often affected by confounding factors, selection biases, and missing information, posing significant challenges to drawing valid causal conclusions. While randomized controlled trials (RCTs) remain the gold standard for drug development and medical decision-making, the growing availability of observational data highlights the need for robust causal inference methodologies. This study provides an overview of methods for inferring the effect of a treatment on an outcome from observational data applicable to biobank data, focusing on the unique challenges they address. Our objective is to introduce current methods used for causal discovery in observational medical data. We discuss classic and modern methodologies that offer significant opportunities alongside the difficulty in reaching causality. We cover statistical methods designed for large-scale biobanks that have the potential to improve clinical decision-making, guide public health policies, and drive further research.
{"title":"Advancing causal inference in medicine using biobank data","authors":"Hadasa Kaufman , Nadav Rappoport , Amir Gilad , Michal Linial","doi":"10.1016/j.jbi.2025.104903","DOIUrl":"10.1016/j.jbi.2025.104903","url":null,"abstract":"<div><div>Causal inference from observational medical record data is critical for advancing precision and personalization in healthcare. Recently, biobanks – collections of biological samples linked with genetic, lifestyle, environmental, and health-related data – have emerged as valuable resources for large-scale population studies. By integrating these resources, biobanks offer a harmonized repository of diverse data for each individual, capturing real-world medical events, including procedures, treatments, and diagnoses. However, these resources are often affected by confounding factors, selection biases, and missing information, posing significant challenges to drawing valid causal conclusions. While randomized controlled trials (RCTs) remain the gold standard for drug development and medical decision-making, the growing availability of observational data highlights the need for robust causal inference methodologies. This study provides an overview of methods for inferring the effect of a treatment on an outcome from observational data applicable to biobank data, focusing on the unique challenges they address. Our objective is to introduce current methods used for causal discovery in observational medical data. We discuss classic and modern methodologies that offer significant opportunities alongside the difficulty in reaching causality. We cover statistical methods designed for large-scale biobanks that have the potential to improve clinical decision-making, guide public health policies, and drive further research.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104903"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145069728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01Epub Date: 2025-10-01DOI: 10.1016/j.jbi.2025.104920
Şeyma Selcan Mağara, Noah Dietrich, Ali Burak Ünal, Mete Akgün
Objective:
Record linkage is essential for integrating data from multiple sources with diverse applications in real-world healthcare and research. Probabilistic Privacy-Preserving Record Linkage (PPRL) enables this integration occurs, while protecting sensitive information from unauthorized access, especially when datasets lack exact identifiers. As privacy regulations evolve and multi-institutional collaborations expand globally, there is a growing demand for methods that effectively balance security, accuracy, and efficiency. However, ensuring both privacy and scalability in large-scale record linkage remains a key challenge.
Method:
This paper presents a novel and efficient PPRL method based on a secure 3-party computation (MPC) framework. Our approach allows multiple parties to compute linkage results without exposing their private inputs and significantly improves the speed of linkage process compared to existing PPRL solutions.
Result:
Our method preserves the linkage quality of a state-of-the-art (SOTA) MPC-based PPRL method while achieving up to 14 times faster performance. For example, linking a record against a database of 10,000 records takes just 8.74 s in a realistic network with 700 Mbps bandwidth and 60 ms latency, compared to 92.32 s with the SOTA method. Even on a slower internet connection with 100 Mbps bandwidth and 60 ms latency, the linkage completes in 28 s, where as the SOTA method requires 287.96 s. These results demonstrate the significant scalability and efficiency improvements of our approach.
Conclusion:
Our novel PPRL method, based on secure 3-party computation, offers an efficient and scalable solution for large-scale record linkage while ensuring privacy protection. The approach demonstrates significant performance improvements, making it a promising tool for secure data integration in privacy-sensitive sectors.
{"title":"Accelerating probabilistic privacy-preserving medical record linkage: A three-party MPC approach","authors":"Şeyma Selcan Mağara, Noah Dietrich, Ali Burak Ünal, Mete Akgün","doi":"10.1016/j.jbi.2025.104920","DOIUrl":"10.1016/j.jbi.2025.104920","url":null,"abstract":"<div><h3>Objective:</h3><div>Record linkage is essential for integrating data from multiple sources with diverse applications in real-world healthcare and research. Probabilistic Privacy-Preserving Record Linkage (PPRL) enables this integration occurs, while protecting sensitive information from unauthorized access, especially when datasets lack exact identifiers. As privacy regulations evolve and multi-institutional collaborations expand globally, there is a growing demand for methods that effectively balance security, accuracy, and efficiency. However, ensuring both privacy and scalability in large-scale record linkage remains a key challenge.</div></div><div><h3>Method:</h3><div>This paper presents a novel and efficient PPRL method based on a secure 3-party computation (MPC) framework. Our approach allows multiple parties to compute linkage results without exposing their private inputs and significantly improves the speed of linkage process compared to existing PPRL solutions.</div></div><div><h3>Result:</h3><div>Our method preserves the linkage quality of a state-of-the-art (SOTA) MPC-based PPRL method while achieving up to 14 times faster performance. For example, linking a record against a database of 10,000 records takes just 8.74 s in a realistic network with 700 Mbps bandwidth and 60 ms latency, compared to 92.32 s with the SOTA method. Even on a slower internet connection with 100 Mbps bandwidth and 60 ms latency, the linkage completes in 28 s, where as the SOTA method requires 287.96 s. These results demonstrate the significant scalability and efficiency improvements of our approach.</div></div><div><h3>Conclusion:</h3><div>Our novel PPRL method, based on secure 3-party computation, offers an efficient and scalable solution for large-scale record linkage while ensuring privacy protection. The approach demonstrates significant performance improvements, making it a promising tool for secure data integration in privacy-sensitive sectors.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"171 ","pages":"Article 104920"},"PeriodicalIF":4.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145223419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}