Pub Date : 2026-03-03DOI: 10.1109/JBHI.2026.3668790
Honghao Wang, Hongrui Zhang, Acong Zhang, Junlei Tang, Ping Li
Graph neural networks (GNNs) have demonstrated remarkable capabilities in molecular property prediction. Existing approaches adopt GNNs by modeling molecules as homogeneous graphs. However, the bonds between atoms can be heterogeneous, whose characterization and role in molecular graph representation learning remain unexplored. To address the heterogeneity issue inherent in molecular graphs, in this work, we build the bond-centric graphs and propose a novel multi-graph learning model, which captures the bond heterogeneity via augmented bond graph view and bond coding for atom features. Different from conventional multi-view learning that focus on late-stage view fusion, our method integrates cross-graph information during the node representation learning phase. Towards this end, we introduce the interleaved message passing graph neural network (IMPGNN), allowing the messages passing across three views of the molecular graph. Moreover, we introduce a novel structure-aware pooling mechanims for graph representation, which yields up to 45.7% gains over simple sum pooling. Comparative experiments on two standard molecular property prediction tasks reveal that our method surpasses all competing approaches (including multimodal models) on 75% of the evaluated benchmark datasets.
{"title":"Bond-Aware Molecular Graph Learning With Multi-Graph Interleaved Message Passing.","authors":"Honghao Wang, Hongrui Zhang, Acong Zhang, Junlei Tang, Ping Li","doi":"10.1109/JBHI.2026.3668790","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3668790","url":null,"abstract":"<p><p>Graph neural networks (GNNs) have demonstrated remarkable capabilities in molecular property prediction. Existing approaches adopt GNNs by modeling molecules as homogeneous graphs. However, the bonds between atoms can be heterogeneous, whose characterization and role in molecular graph representation learning remain unexplored. To address the heterogeneity issue inherent in molecular graphs, in this work, we build the bond-centric graphs and propose a novel multi-graph learning model, which captures the bond heterogeneity via augmented bond graph view and bond coding for atom features. Different from conventional multi-view learning that focus on late-stage view fusion, our method integrates cross-graph information during the node representation learning phase. Towards this end, we introduce the interleaved message passing graph neural network (IMPGNN), allowing the messages passing across three views of the molecular graph. Moreover, we introduce a novel structure-aware pooling mechanims for graph representation, which yields up to 45.7% gains over simple sum pooling. Comparative experiments on two standard molecular property prediction tasks reveal that our method surpasses all competing approaches (including multimodal models) on 75% of the evaluated benchmark datasets.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147348215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nonparametric estimation of time-varying directed networks can unveil the intricate transient organization of directed brain communication while circumventing constraints imposed by prescribed model driven methods. A robust time-frequency representation - the foundation of its causality inference - is critical for enhancing its reliability. This study proposed a novel method, i.e., nonparametric dynamic Granger causality based on Multi-space Spectrum Fusion (ndGCMSF), which integrates complementary spectrum information from different spaces to generate enhanced spectral representations to estimate dynamic causalities across brain regions. Systematic simulations and validations demonstrate that ndGCMSF exhibits superior noise resistance and a powerful ability to capture subtle dynamic changes in directed brain networks. Particularly, ndGCMSF revealed that during motor imagery, the laterality in the hemisphere ipsilateral to the hemiplegic limb emerges upon task beginning and diminishes upon task accomplishment. These intrinsic variations further provide features for assessing motor functions. The ndGCMSF offers powerful functional patterns to derive effective brain networks in dynamically changing operational settings and contributes to extensive areas involving dynamical and directed communications.
{"title":"Nonparametric Dynamic Granger Causality based on Multi-Space Spectrum Fusion for Time-varying Directed Brain Network Construction.","authors":"Chanlin Yi, Jiamin Zhang, Zihan Weng, Wanjun Chen, Dezhong Yao, Fali Li, Zehong Cao, Peiyang Li, Peng Xu","doi":"10.1109/JBHI.2026.3670140","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3670140","url":null,"abstract":"<p><p>Nonparametric estimation of time-varying directed networks can unveil the intricate transient organization of directed brain communication while circumventing constraints imposed by prescribed model driven methods. A robust time-frequency representation - the foundation of its causality inference - is critical for enhancing its reliability. This study proposed a novel method, i.e., nonparametric dynamic Granger causality based on Multi-space Spectrum Fusion (ndGCMSF), which integrates complementary spectrum information from different spaces to generate enhanced spectral representations to estimate dynamic causalities across brain regions. Systematic simulations and validations demonstrate that ndGCMSF exhibits superior noise resistance and a powerful ability to capture subtle dynamic changes in directed brain networks. Particularly, ndGCMSF revealed that during motor imagery, the laterality in the hemisphere ipsilateral to the hemiplegic limb emerges upon task beginning and diminishes upon task accomplishment. These intrinsic variations further provide features for assessing motor functions. The ndGCMSF offers powerful functional patterns to derive effective brain networks in dynamically changing operational settings and contributes to extensive areas involving dynamical and directed communications.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147347438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/JBHI.2026.3669531
Yang Yi Poh, Ethan Grooby, Kenneth Tan, Atul Malhotra, Mehrtash Harandi, Faezeh Marzbanrad
Stethoscope-recorded chest sounds are invaluable for providing a non-invasive, real-time assessment of heart and lung sounds. However, noisy chest sounds can affect the dependability of various algorithms that rely on clean chest sounds, making additional preprocessing necessary to isolate the desired sources and remove noise, interference, and artifacts. This paper is the first to explore the reprogramming of automatic speech recognition (ASR) models to perform neonatal chest sound separation. In particular, we reprogrammed the Whisper ASR model to perform chest sound separation. We proposed two approaches: reprogramming just Whisper's audio encoder and reprogramming the full Whisper model. Using only simple linear layers and learnable parameters, we showed that this parameter-efficient method of reprogramming Whisper effectively separates heart and lung sounds from noise on the artificial dataset. We also demonstrate the effectiveness of using the proposed method as a preprocessing step for various heart and lung sound algorithms, yielding results comparable to state-of-the-art performance. Applying the pre-trained ASR model to perform sound separation demonstrates the feasibility of efficient cross-domain model reprogramming, demonstrating the feasibility of using frozen cross-domain foundational models from a different domain on biomedical data.
{"title":"Reprogramming Automatic Speech Recognition Models for Neonatal Chest Sound Separation.","authors":"Yang Yi Poh, Ethan Grooby, Kenneth Tan, Atul Malhotra, Mehrtash Harandi, Faezeh Marzbanrad","doi":"10.1109/JBHI.2026.3669531","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3669531","url":null,"abstract":"<p><p>Stethoscope-recorded chest sounds are invaluable for providing a non-invasive, real-time assessment of heart and lung sounds. However, noisy chest sounds can affect the dependability of various algorithms that rely on clean chest sounds, making additional preprocessing necessary to isolate the desired sources and remove noise, interference, and artifacts. This paper is the first to explore the reprogramming of automatic speech recognition (ASR) models to perform neonatal chest sound separation. In particular, we reprogrammed the Whisper ASR model to perform chest sound separation. We proposed two approaches: reprogramming just Whisper's audio encoder and reprogramming the full Whisper model. Using only simple linear layers and learnable parameters, we showed that this parameter-efficient method of reprogramming Whisper effectively separates heart and lung sounds from noise on the artificial dataset. We also demonstrate the effectiveness of using the proposed method as a preprocessing step for various heart and lung sound algorithms, yielding results comparable to state-of-the-art performance. Applying the pre-trained ASR model to perform sound separation demonstrates the feasibility of efficient cross-domain model reprogramming, demonstrating the feasibility of using frozen cross-domain foundational models from a different domain on biomedical data.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147347723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/JBHI.2026.3669251
Xing Li, Jingfan Liang, Ge Gao, Li Wang, Haifeng Wang, Shihao Han
Human Action Recognition (HAR) holds significant application value in healthcare informatics, facilitating tasks such as clinical diagnosis and rehabilitation monitoring. Point cloud sequences have emerged as a pivotal modality for balancing privacy preservation with high-fidelity geometric structural representation, ensuring anonymity while retaining critical 3D behavioral information. However, existing point cloud sequence encoding methods struggle to precisely encode micro-geometric details and macro-pose contours within the spatial dimension, as well as the dynamic heterogeneity of actions within the temporal dimension. These limitations impede the realization of high-precision clinical motion analysis. To address these challenges, we propose a Geometry-Pose Frequency Decoupling Network (GPFD-Net) for human action recognition. First, we design a Geometry-Pose Parallel-Collaborative Spatial Encoder (GPCSE). This module designs a parallel dual-stream architecture to explicitly capture and fuse complementary micro-geometric details and macro-pose contours, generating an informative geometry-enhanced pose feature sequence. Second, we introduce a Frequency-Decoupled Temporal Capturer (FDTC). This module adaptively decomposes the geometry-enhanced pose feature sequence into a smooth trend sequence and a transient detail sequence, which are subsequently processed by two parallel expert encoders via differentiated encoding to achieve robust human action recognition. Extensive experiments on four public benchmark datasets demonstrate that GPFD-Net achieves superior performance. The proposed method provides a novel paradigm for high-precision and privacy-preserving motion analysis in healthcare applications.
{"title":"GPFD-Net: A Geometry-Pose Frequency Decoupling Network for Privacy-Preserving Human Action Recognition in Healthcare.","authors":"Xing Li, Jingfan Liang, Ge Gao, Li Wang, Haifeng Wang, Shihao Han","doi":"10.1109/JBHI.2026.3669251","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3669251","url":null,"abstract":"<p><p>Human Action Recognition (HAR) holds significant application value in healthcare informatics, facilitating tasks such as clinical diagnosis and rehabilitation monitoring. Point cloud sequences have emerged as a pivotal modality for balancing privacy preservation with high-fidelity geometric structural representation, ensuring anonymity while retaining critical 3D behavioral information. However, existing point cloud sequence encoding methods struggle to precisely encode micro-geometric details and macro-pose contours within the spatial dimension, as well as the dynamic heterogeneity of actions within the temporal dimension. These limitations impede the realization of high-precision clinical motion analysis. To address these challenges, we propose a Geometry-Pose Frequency Decoupling Network (GPFD-Net) for human action recognition. First, we design a Geometry-Pose Parallel-Collaborative Spatial Encoder (GPCSE). This module designs a parallel dual-stream architecture to explicitly capture and fuse complementary micro-geometric details and macro-pose contours, generating an informative geometry-enhanced pose feature sequence. Second, we introduce a Frequency-Decoupled Temporal Capturer (FDTC). This module adaptively decomposes the geometry-enhanced pose feature sequence into a smooth trend sequence and a transient detail sequence, which are subsequently processed by two parallel expert encoders via differentiated encoding to achieve robust human action recognition. Extensive experiments on four public benchmark datasets demonstrate that GPFD-Net achieves superior performance. The proposed method provides a novel paradigm for high-precision and privacy-preserving motion analysis in healthcare applications.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147348295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/JBHI.2026.3669866
Minxi Ouyang, Lianghui Zhu, Yaqing Bao, Qiang Huang, Jingli Ouyang, Tian Guan, Xitong Ling, Jiawen Li, Song Duan, Wenbin Dai, Li Zheng, Xuemei Zhang, Yonghong He
Multimodal large models have shown great potential in automating pathology image analysis. However, current multimodal models for gastrointestinal pathology are constrained by both data quality and reasoning transparency: pervasive noise and incomplete annotations in public datasets predispose vision-language models to factual hallucinations when generating diagnostic text, while the absence of explicit intermediate reasoning chains renders the outputs difficult to audit and thus less trustworthy in clinical practice. To address these issues, we construct a large-scale gastrointestinal pathology dataset containing both microscopic descriptions and diagnostic conclusions, and propose a prompt augmentation strategy that incorporates lesion classification and anatomical site information. This design guides the model to better capture image-specific features and maintain semantic consistency in generation. Furthermore, we employ a post-training pipeline that combines supervised fine-tuning with Group Relative Policy Optimization (GRPO) to improve reasoning quality and output structure. Experimental results on real-world pathology report generation tasks demonstrate that our approach significantly outperforms state-of-the-art open-source and proprietary baselines in terms of generation quality, structural completeness, and clinical relevance. Our solution outperforms state-of-the-art models with 18.7% higher clinical relevance, 32.4% improved structural completeness, and 41.2% fewer diagnostic errors, demonstrating superior accuracy and clinical utility compared to existing solutions.
{"title":"DiagR1: A Vision-Language Model Trained via Reinforcement Learning for Digestive Pathology Diagnosis.","authors":"Minxi Ouyang, Lianghui Zhu, Yaqing Bao, Qiang Huang, Jingli Ouyang, Tian Guan, Xitong Ling, Jiawen Li, Song Duan, Wenbin Dai, Li Zheng, Xuemei Zhang, Yonghong He","doi":"10.1109/JBHI.2026.3669866","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3669866","url":null,"abstract":"<p><p>Multimodal large models have shown great potential in automating pathology image analysis. However, current multimodal models for gastrointestinal pathology are constrained by both data quality and reasoning transparency: pervasive noise and incomplete annotations in public datasets predispose vision-language models to factual hallucinations when generating diagnostic text, while the absence of explicit intermediate reasoning chains renders the outputs difficult to audit and thus less trustworthy in clinical practice. To address these issues, we construct a large-scale gastrointestinal pathology dataset containing both microscopic descriptions and diagnostic conclusions, and propose a prompt augmentation strategy that incorporates lesion classification and anatomical site information. This design guides the model to better capture image-specific features and maintain semantic consistency in generation. Furthermore, we employ a post-training pipeline that combines supervised fine-tuning with Group Relative Policy Optimization (GRPO) to improve reasoning quality and output structure. Experimental results on real-world pathology report generation tasks demonstrate that our approach significantly outperforms state-of-the-art open-source and proprietary baselines in terms of generation quality, structural completeness, and clinical relevance. Our solution outperforms state-of-the-art models with 18.7% higher clinical relevance, 32.4% improved structural completeness, and 41.2% fewer diagnostic errors, demonstrating superior accuracy and clinical utility compared to existing solutions.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147348236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/JBHI.2026.3669898
Kui Zhao, Enze Shi, Di Zhu, Sigang Yu, Geng Chen, Shijie Zhao, Dingwen Zhang, Shu Zhang
Electroencephalography (EEG) signals exhibit significant and inherent time scales differences across individuals and tasks. Despite notable successes in decoding EEG signals in single-tasks (e.g., detection of epilepsy), where the time scales are relatively consistent, substantial differences in temporal characteristics among various tasks pose a significant challenge. To address these limitations, we propose the MuST, which stands for Multi-Scale Transformer, aiming to dynamically learn characteristics of EEG signals on different time scales. Building on the conventional Convolutional Neural Network (CNN)-Transformer model, the MuST introduces two innovations: (1) A hierarchical Transformer structure to dynamically capture global dependencies and long-range information from EEG signals at different scales. (2) A novel temporal convolutional network (TCN) module to replace the original feed forward network (FFN) module in the Transformer, effectively capturing local temporal patterns and short-term dependencies from EEG signals. To validate the performance of the MuST, we conducted experiments on five public EEG datasets with extreme time-scale differences. The experimental results on these datasets demonstrate that we have achieved an average classification accuracy of 91.69% under identical parameter settings. This surpasses the baseline EEGNet by 5.65%, highlighting its superior capability in handling multi-scale EEG signals for diverse tasks. More critically, MuST demonstrates a successful unified modeling of EEG temporal heterogeneity through mixed dataset training (epilepsy detection and sleep staging classification). This breakthrough validates our multi-scale architecture's capability to dynamically reconcile divergent neurophysiological timescales within a single model. Our code can be found at https://github.com/wisercc/MuST.
{"title":"MuST: Multi-Scale Transformer Incorporating Hierarchical Attention and TCN for EEG Decoding.","authors":"Kui Zhao, Enze Shi, Di Zhu, Sigang Yu, Geng Chen, Shijie Zhao, Dingwen Zhang, Shu Zhang","doi":"10.1109/JBHI.2026.3669898","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3669898","url":null,"abstract":"<p><p>Electroencephalography (EEG) signals exhibit significant and inherent time scales differences across individuals and tasks. Despite notable successes in decoding EEG signals in single-tasks (e.g., detection of epilepsy), where the time scales are relatively consistent, substantial differences in temporal characteristics among various tasks pose a significant challenge. To address these limitations, we propose the MuST, which stands for Multi-Scale Transformer, aiming to dynamically learn characteristics of EEG signals on different time scales. Building on the conventional Convolutional Neural Network (CNN)-Transformer model, the MuST introduces two innovations: (1) A hierarchical Transformer structure to dynamically capture global dependencies and long-range information from EEG signals at different scales. (2) A novel temporal convolutional network (TCN) module to replace the original feed forward network (FFN) module in the Transformer, effectively capturing local temporal patterns and short-term dependencies from EEG signals. To validate the performance of the MuST, we conducted experiments on five public EEG datasets with extreme time-scale differences. The experimental results on these datasets demonstrate that we have achieved an average classification accuracy of 91.69% under identical parameter settings. This surpasses the baseline EEGNet by 5.65%, highlighting its superior capability in handling multi-scale EEG signals for diverse tasks. More critically, MuST demonstrates a successful unified modeling of EEG temporal heterogeneity through mixed dataset training (epilepsy detection and sleep staging classification). This breakthrough validates our multi-scale architecture's capability to dynamically reconcile divergent neurophysiological timescales within a single model. Our code can be found at https://github.com/wisercc/MuST.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147348287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/JBHI.2026.3669893
Fengyi Guo, Ying An, Jianxin Wang
Early diagnosis of arrhythmia, a common cardiovascular condition, is crucial for improving prognosis. Electrocardiogram (ECG) is widely used as a non-invasive diagnostic tool. However, Computer-Aided Diagnosis of rare arrhythmias faces significant challenges due to the severe scarcity of samples for these rare disease classes. To tackle this, we propose a Mamba-based Prototypical Contrastive Learning framework, which can simultaneously identify both common and rare classes under the setting of generalized Few-Shot Learning (FSL). It primarily consists of: (1) the Mamba-based Spatio-Temporal Feature Fusion Network (MST), which integrates spatial features from multi-scale convolutions and temporal dynamics from bidirectional Mamba for ECG modeling; (2) the Prototypical Contrastive Learning framework with Augmented Feature Separation (PCAS), which employs a prototype augmentation strategy with an Augmented Prototype Consistency Loss to optimize prototype representations, and an Separation-Tuned Contrastive Loss to enhance intra-class compactness and inter-class distinctnessy, mitigating the risk of class collapse. Extensive experiments on publicly available datasets PTBXL and Chapman demonstrate the effectiveness of MST-PCAS, achieving superior rare-class recognition accuracies of 79.13% and 50.72%, respectively, for ECG arrhythmia classification.
{"title":"Mamba-Based Prototypical Contrastive Learning With Augmented Feature Separation for Common and Rare Arrhythmia Classification.","authors":"Fengyi Guo, Ying An, Jianxin Wang","doi":"10.1109/JBHI.2026.3669893","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3669893","url":null,"abstract":"<p><p>Early diagnosis of arrhythmia, a common cardiovascular condition, is crucial for improving prognosis. Electrocardiogram (ECG) is widely used as a non-invasive diagnostic tool. However, Computer-Aided Diagnosis of rare arrhythmias faces significant challenges due to the severe scarcity of samples for these rare disease classes. To tackle this, we propose a Mamba-based Prototypical Contrastive Learning framework, which can simultaneously identify both common and rare classes under the setting of generalized Few-Shot Learning (FSL). It primarily consists of: (1) the Mamba-based Spatio-Temporal Feature Fusion Network (MST), which integrates spatial features from multi-scale convolutions and temporal dynamics from bidirectional Mamba for ECG modeling; (2) the Prototypical Contrastive Learning framework with Augmented Feature Separation (PCAS), which employs a prototype augmentation strategy with an Augmented Prototype Consistency Loss to optimize prototype representations, and an Separation-Tuned Contrastive Loss to enhance intra-class compactness and inter-class distinctnessy, mitigating the risk of class collapse. Extensive experiments on publicly available datasets PTBXL and Chapman demonstrate the effectiveness of MST-PCAS, achieving superior rare-class recognition accuracies of 79.13% and 50.72%, respectively, for ECG arrhythmia classification.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147348346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/JBHI.2026.3669663
Bowen Liu, Hongbo Zhu, Xiaotong Wei, Chuan Lin, Wei Wang
Whole slide image (WSI) classification faces challenges due to gigapixel scale and weak supervision, often struggling to balance global context with local details. We propose CoMIL, a dual-branch framework based on symmetric mutual learning. Firstly, to resolve the dilemma where single-stream networks struggle to simultaneously capture global context and fine-grained details, we employ dual parallel pathways: a Transformer branch models long-range instance dependencies, while a CNN branch captures localized tissue morphology. Secondly, to address spatial information loss, we design a Hyper Positional Generator (HyperPG). This module integrates multi-scale adaptive mechanisms with deformable convolutions, enhancing spatial awareness with linear complexity. Finally, to improve model robustness against weak label noise, bidirectional learning between branches is achieved through KL divergence minimization. Extensive experiments show that our proposed method achieves an area under the curve of 98.6% and an accuracy of 95.3% on the Camelyon16 dataset, and an area under the curve of 98.8% and an accuracy of 93.3% on the TCGA_Kidney dataset, surpassing the performance of known advanced WSI classification methods.
{"title":"CoMIL: A Contrastive CNN-Transformer Framework with Multi-Instance Learning for Whole-Slide Pathology Image Classification.","authors":"Bowen Liu, Hongbo Zhu, Xiaotong Wei, Chuan Lin, Wei Wang","doi":"10.1109/JBHI.2026.3669663","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3669663","url":null,"abstract":"<p><p>Whole slide image (WSI) classification faces challenges due to gigapixel scale and weak supervision, often struggling to balance global context with local details. We propose CoMIL, a dual-branch framework based on symmetric mutual learning. Firstly, to resolve the dilemma where single-stream networks struggle to simultaneously capture global context and fine-grained details, we employ dual parallel pathways: a Transformer branch models long-range instance dependencies, while a CNN branch captures localized tissue morphology. Secondly, to address spatial information loss, we design a Hyper Positional Generator (HyperPG). This module integrates multi-scale adaptive mechanisms with deformable convolutions, enhancing spatial awareness with linear complexity. Finally, to improve model robustness against weak label noise, bidirectional learning between branches is achieved through KL divergence minimization. Extensive experiments show that our proposed method achieves an area under the curve of 98.6% and an accuracy of 95.3% on the Camelyon16 dataset, and an area under the curve of 98.8% and an accuracy of 93.3% on the TCGA_Kidney dataset, surpassing the performance of known advanced WSI classification methods.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147348205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/JBHI.2026.3669222
Xiantong Xiang, Longxiao Gao, Yuansheng Liu, Ningji Gong, Yongshun Gong
Accurate and timely diagnosis of cardiovascular diseases, particularly myocardial infarction (MI), remains a critical clinical challenge. Existing electrocardiogram (ECG) analysis methods often rely solely on a single data modality, such as raw signals or waveform images, which limits their ability to capture the broader physiological context. To address this limitation, we propose GFM-MIP, a Graph-informed and FiLM-enhanced Multimodal Fusion framework for myocardial infarction prediction. GFM-MIP integrates 12-lead ECG time-series signals, ECG images, and laboratory test results through a unified architecture. Specifically, it employs a Graphormer encoder to model inter-lead dependencies in ECG signals and a Vision Transformer to extract morphological patterns from ECG images, both modulated by patient-specific laboratory features using Feature-wise Linear Modulation (FiLM). A Transformer-based fusion module captures cross-modal interactions, while a contrastive learning objective encourages alignment between signal and image modalities. Experimental results on a real-world clinical dataset and three public benchmarks demonstrate that GFM-MIP consistently outperforms state-of-the-art baselines across multiple evaluation metrics. Ablation studies further validate the contribution of each modality and architectural component. The proposed framework offers a clinically meaningful and scalable solution for robust, multimodal cardiovascular diagnosis.
{"title":"Graph-Informed and FiLM-Enhanced Multimodal Fusion for Myocardial Infarction Prediction.","authors":"Xiantong Xiang, Longxiao Gao, Yuansheng Liu, Ningji Gong, Yongshun Gong","doi":"10.1109/JBHI.2026.3669222","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3669222","url":null,"abstract":"<p><p>Accurate and timely diagnosis of cardiovascular diseases, particularly myocardial infarction (MI), remains a critical clinical challenge. Existing electrocardiogram (ECG) analysis methods often rely solely on a single data modality, such as raw signals or waveform images, which limits their ability to capture the broader physiological context. To address this limitation, we propose GFM-MIP, a Graph-informed and FiLM-enhanced Multimodal Fusion framework for myocardial infarction prediction. GFM-MIP integrates 12-lead ECG time-series signals, ECG images, and laboratory test results through a unified architecture. Specifically, it employs a Graphormer encoder to model inter-lead dependencies in ECG signals and a Vision Transformer to extract morphological patterns from ECG images, both modulated by patient-specific laboratory features using Feature-wise Linear Modulation (FiLM). A Transformer-based fusion module captures cross-modal interactions, while a contrastive learning objective encourages alignment between signal and image modalities. Experimental results on a real-world clinical dataset and three public benchmarks demonstrate that GFM-MIP consistently outperforms state-of-the-art baselines across multiple evaluation metrics. Ablation studies further validate the contribution of each modality and architectural component. The proposed framework offers a clinically meaningful and scalable solution for robust, multimodal cardiovascular diagnosis.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147348320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/JBHI.2026.3669549
Shuang Zeng, Lei Zhu, Xinliang Zhang, Qian Chen, Hangzhou He, Lujia Jin, Zifeng Tian, Zhaoheng Xie, Micky C Nnamdi, Wenqi Shi, J Ben Tamo, May D Wang, Yanye Lu
Medical image segmentation is a fundamental yet challenging task due to the arduous process of acquiring large volumes of high-quality labeled data from experts. Contrastive learning offers a promising but still problematic solution to this dilemma. Firstly existing medical contrastive learning strategies focus on extracting image-level representation, which ignores abundant multi-level representations. Furthermore they underutilize the decoder either by random initialization or separate pre-training from the encoder, thereby neglecting the potential collaboration between the encoder and decoder. To address these issues, we propose a novel multi-level asymmetric contrastive learning framework named MACL for enhancing medical image segmentation. Specifically, we design an asymmetric contrastive learning structure to pre-train encoder and decoder simultaneously to provide better initialization for segmentation models. Moreover, we develop a multi-level contrastive learning strategy that integrates correspondences across feature-level, image-level, and pixel-level representations to ensure the encoder and decoder capture comprehensive details from representations of varying scales and granularities during the pre-training phase. Finally, experiments on 8 medical image datasets indicate our MACL framework outperforms existing 11 contrastive learning strategies. i.e. Our MACL achieves a superior performance with more precise predictions from visualization figures and 1.72%, 7.87%, 2.49% and 1.48% Dice higher than previous best results on ACDC, MMWHS, HVSMR and CHAOS with 10% labeled data, respectively. And our MACL also has a strong generalization ability among 5 variant U-Net backbones.
{"title":"Multi-level Asymmetric Contrastive Learning for Medical Image Segmentation Pre-training.","authors":"Shuang Zeng, Lei Zhu, Xinliang Zhang, Qian Chen, Hangzhou He, Lujia Jin, Zifeng Tian, Zhaoheng Xie, Micky C Nnamdi, Wenqi Shi, J Ben Tamo, May D Wang, Yanye Lu","doi":"10.1109/JBHI.2026.3669549","DOIUrl":"https://doi.org/10.1109/JBHI.2026.3669549","url":null,"abstract":"<p><p>Medical image segmentation is a fundamental yet challenging task due to the arduous process of acquiring large volumes of high-quality labeled data from experts. Contrastive learning offers a promising but still problematic solution to this dilemma. Firstly existing medical contrastive learning strategies focus on extracting image-level representation, which ignores abundant multi-level representations. Furthermore they underutilize the decoder either by random initialization or separate pre-training from the encoder, thereby neglecting the potential collaboration between the encoder and decoder. To address these issues, we propose a novel multi-level asymmetric contrastive learning framework named MACL for enhancing medical image segmentation. Specifically, we design an asymmetric contrastive learning structure to pre-train encoder and decoder simultaneously to provide better initialization for segmentation models. Moreover, we develop a multi-level contrastive learning strategy that integrates correspondences across feature-level, image-level, and pixel-level representations to ensure the encoder and decoder capture comprehensive details from representations of varying scales and granularities during the pre-training phase. Finally, experiments on 8 medical image datasets indicate our MACL framework outperforms existing 11 contrastive learning strategies. i.e. Our MACL achieves a superior performance with more precise predictions from visualization figures and 1.72%, 7.87%, 2.49% and 1.48% Dice higher than previous best results on ACDC, MMWHS, HVSMR and CHAOS with 10% labeled data, respectively. And our MACL also has a strong generalization ability among 5 variant U-Net backbones.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147348315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}