Pub Date : 2026-01-07DOI: 10.1016/j.inffus.2026.104128
XingchiChen , Fushen Xie , Fa Zhu , Shuanglong Zhang , Xiaoyang Lu , Qing Li , Rong Chen , Dazhou Li , David Camacho
The detection of epileptic seizures using multi-sensor EEG signals is a challenging task due to the inherent complexity of the signals, the variability in sensor configurations, and the difficulty in distinguishing the weak inter-class difference. To address these challenges, we propose a novel multimodal information fusion framework that integrates a large language model (LLM) and a multimodal EEG feature tokenization method for enhanced epilepsy detection. This paper adopts a multimodal feature extraction (MFE) method to effectively generate multimodal feature representations from EEG signals and extract different feature representations of EEG signals from different signal domains. In addition, we design a multimodal EEG feature tokenization method to tokenize EEG signal features and fuse the semantic information, solving the problem of fusing epileptic EEG features with semantic information in prompt words. We use the powerful reasoning and pattern recognition capabilities of pre-trained LLMs to accurately and robustly detect epileptic events. The proposed method is evaluated on a public dataset. Extensive experimental results show that the proposed method outperforms the current comparative methods in multiple performance indicators.
{"title":"Tokenized EEG signals with large language models for epilepsy detection via multimodal information fusion","authors":"XingchiChen , Fushen Xie , Fa Zhu , Shuanglong Zhang , Xiaoyang Lu , Qing Li , Rong Chen , Dazhou Li , David Camacho","doi":"10.1016/j.inffus.2026.104128","DOIUrl":"10.1016/j.inffus.2026.104128","url":null,"abstract":"<div><div>The detection of epileptic seizures using multi-sensor EEG signals is a challenging task due to the inherent complexity of the signals, the variability in sensor configurations, and the difficulty in distinguishing the weak inter-class difference. To address these challenges, we propose a novel multimodal information fusion framework that integrates a large language model (LLM) and a multimodal EEG feature tokenization method for enhanced epilepsy detection. This paper adopts a multimodal feature extraction (MFE) method to effectively generate multimodal feature representations from EEG signals and extract different feature representations of EEG signals from different signal domains. In addition, we design a multimodal EEG feature tokenization method to tokenize EEG signal features and fuse the semantic information, solving the problem of fusing epileptic EEG features with semantic information in prompt words. We use the powerful reasoning and pattern recognition capabilities of pre-trained LLMs to accurately and robustly detect epileptic events. The proposed method is evaluated on a public dataset. Extensive experimental results show that the proposed method outperforms the current comparative methods in multiple performance indicators.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"131 ","pages":"Article 104128"},"PeriodicalIF":15.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.inffus.2026.104127
Menglin Yu , Shuxia Lu , Jiacheng Cong
Graph neural networks (GNNs) perform exceptionally well in node classification, but graph neural networks face severe challenges when dealing with imbalanced node classification. On the one hand, the model is prone to overfitting due to the small number of minority class samples. GNN’s message passing mechanism amplifies this problem, causing the model to overfit specific features and local neighborhood structures of minority class nodes rather than learning general patterns, resulting in poor generalization ability. On the other hand, the scarcity of samples leads to high variance in model training. Model performance is highly dependent on specific training samples and local graph structures, and is extremely sensitive to data partitioning, ultimately resulting in severe performance fluctuations and unstable results. In this work, to address the issues of minority class overfitting and high model variance faced by GNNs in imbalanced scenarios, we propose the dual-graph framework, A similarity-Guided Dual-Graph Learning Framework (SG-DGLF). To address the problem of overfitting for minority classes, the framework introduces a dynamic threshold random capture mechanism based on similarity, which supplements minority class samples by generating pseudo labels. Secondly, we leverage graph diffusion-based propagation and random edge dropping strategy to create new graphs, thereby increasing node diversity to alleviate the problem of excessive model variance. Empirically, SG-DGLF significantly outperforms advanced baseline methods on multiple imbalanced datasets. This validates the effectiveness of our framework in mitigating the problems of overfitting minority classes and high model variance.
{"title":"SG-DGLF: A similarity-guided dual-graph learning framework","authors":"Menglin Yu , Shuxia Lu , Jiacheng Cong","doi":"10.1016/j.inffus.2026.104127","DOIUrl":"10.1016/j.inffus.2026.104127","url":null,"abstract":"<div><div>Graph neural networks (GNNs) perform exceptionally well in node classification, but graph neural networks face severe challenges when dealing with imbalanced node classification. On the one hand, the model is prone to overfitting due to the small number of minority class samples. GNN’s message passing mechanism amplifies this problem, causing the model to overfit specific features and local neighborhood structures of minority class nodes rather than learning general patterns, resulting in poor generalization ability. On the other hand, the scarcity of samples leads to high variance in model training. Model performance is highly dependent on specific training samples and local graph structures, and is extremely sensitive to data partitioning, ultimately resulting in severe performance fluctuations and unstable results. In this work, to address the issues of minority class overfitting and high model variance faced by GNNs in imbalanced scenarios, we propose the dual-graph framework, A similarity-Guided Dual-Graph Learning Framework (SG-DGLF). To address the problem of overfitting for minority classes, the framework introduces a dynamic threshold random capture mechanism based on similarity, which supplements minority class samples by generating pseudo labels. Secondly, we leverage graph diffusion-based propagation and random edge dropping strategy to create new graphs, thereby increasing node diversity to alleviate the problem of excessive model variance. Empirically, SG-DGLF significantly outperforms advanced baseline methods on multiple imbalanced datasets. This validates the effectiveness of our framework in mitigating the problems of overfitting minority classes and high model variance.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104127"},"PeriodicalIF":15.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145939897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1016/j.inffus.2025.104116
Jun Lyu , Xunkang Zhao , Jing Qin , Chengyan Wang
Cardiac cine MRI is the clinical gold standard for dynamic cardiac assessment, but reducing k-space sampling to accelerate acquisition results in low-resolution images that fail to depict fine anatomical details. Existing super-resolution methods struggle to preserve spatial details and temporal coherence due to limitations in handling non-rigid cardiac deformations and lossy feature downsampling. This paper proposes a Wavelet-based Deformable Attention Super-Resolution Network (WDASR) that addresses these limitations through two key innovations. First, a Frequency Subband Adaptive Alignment (FSAA) module applies deformable convolution to wavelet-decomposed frequency subbands, enabling lossless downsampling that prevents offset over-shifting and allows targeted alignment across neighboring and remote frames. Second, a Cross-Resolution Wavelet Attention (CRWA) module uses temporally-aggregated frequency subbands as low-resolution keys and values, and the current frame as high-resolution query, reducing computational complexity by 75% while effectively integrating multi-scale spatiotemporal information for enhanced texture representation. A bidirectional recurrent mechanism further propagates the enhanced features to maintain temporal consistency. Experiments on public and private datasets demonstrate that WDASR achieves 4 × super-resolution with state-of-the-art performance and potential for clinical application.
{"title":"WDASR: A wavelet-based deformable attention network for cardiac cine MRI super-resolution with spatiotemporal motion modeling","authors":"Jun Lyu , Xunkang Zhao , Jing Qin , Chengyan Wang","doi":"10.1016/j.inffus.2025.104116","DOIUrl":"10.1016/j.inffus.2025.104116","url":null,"abstract":"<div><div>Cardiac cine MRI is the clinical gold standard for dynamic cardiac assessment, but reducing k-space sampling to accelerate acquisition results in low-resolution images that fail to depict fine anatomical details. Existing super-resolution methods struggle to preserve spatial details and temporal coherence due to limitations in handling non-rigid cardiac deformations and lossy feature downsampling. This paper proposes a Wavelet-based Deformable Attention Super-Resolution Network (WDASR) that addresses these limitations through two key innovations. First, a Frequency Subband Adaptive Alignment (FSAA) module applies deformable convolution to wavelet-decomposed frequency subbands, enabling lossless downsampling that prevents offset over-shifting and allows targeted alignment across neighboring and remote frames. Second, a Cross-Resolution Wavelet Attention (CRWA) module uses temporally-aggregated frequency subbands as low-resolution keys and values, and the current frame as high-resolution query, reducing computational complexity by 75% while effectively integrating multi-scale spatiotemporal information for enhanced texture representation. A bidirectional recurrent mechanism further propagates the enhanced features to maintain temporal consistency. Experiments on public and private datasets demonstrate that WDASR achieves 4 × super-resolution with state-of-the-art performance and potential for clinical application.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104116"},"PeriodicalIF":15.5,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-05DOI: 10.1016/j.inffus.2026.104123
Yinan Li , Zhi Liu , Jiajun Tang , Binghong Chen , Mingjin Kuai , Jun Long , Zhan Yang
Hashing has been extensively applied in cross-modal retrieval by mapping diverse modalities data into binary codes. Semantic transfer aims to enhance the relevance of heterogeneous representations through migrating valuable information from one modality to another in the unsupervised paradigm. The combination of semantic transfer and hash learning substitutes the dense vector search with Hamming distance, significantly reducing storage requirements and increasing retrieval efficiency. However, the current unsupervised mechanism demonstrates ordinary performance in retrieval precision, which requires more improvement from semantic annotation. Particularly, the mediocre information fusion strategy directly affects the quality of learned hash codes. In this paper, we propose a novel Semantic Transfer framework for Semi-supervised Cross-modal Hashing, denoted as STSCH. Initially, we utilize multiple auto-encoders to learn the high-level semantic representation of each modality. To guarantee the completeness of heterogeneous data, we incorporate them via semantic transfer and analyse the feature distribution of diverse modalities. Furthermore, an asymmetric hash learning framework between individual modality-specific representation and minor semantic labels is constructed. Finally, an effective optimization algorithm is proposed. Comprehensive experiments on Wiki, MIRFlickr, and NUS-WIDE datasets demonstrate the superior performance of STSCH to state-of-the-art hashing approaches.
{"title":"Rethink: reveal the impact of semantic distribution transfer from the cross-modal hashing perspective","authors":"Yinan Li , Zhi Liu , Jiajun Tang , Binghong Chen , Mingjin Kuai , Jun Long , Zhan Yang","doi":"10.1016/j.inffus.2026.104123","DOIUrl":"10.1016/j.inffus.2026.104123","url":null,"abstract":"<div><div>Hashing has been extensively applied in cross-modal retrieval by mapping diverse modalities data into binary codes. Semantic transfer aims to enhance the relevance of heterogeneous representations through migrating valuable information from one modality to another in the unsupervised paradigm. The combination of semantic transfer and hash learning substitutes the dense vector search with Hamming distance, significantly reducing storage requirements and increasing retrieval efficiency. However, the current unsupervised mechanism demonstrates ordinary performance in retrieval precision, which requires more improvement from semantic annotation. Particularly, the mediocre information fusion strategy directly affects the quality of learned hash codes. In this paper, we propose a novel <strong>S</strong>emantic <strong>T</strong>ransfer framework for <strong>S</strong>emi-supervised <strong>C</strong>ross-modal <strong>H</strong>ashing, denoted as STSCH. Initially, we utilize multiple auto-encoders to learn the high-level semantic representation of each modality. To guarantee the completeness of heterogeneous data, we incorporate them via semantic transfer and analyse the feature distribution of diverse modalities. Furthermore, an asymmetric hash learning framework between individual modality-specific representation and minor semantic labels is constructed. Finally, an effective optimization algorithm is proposed. Comprehensive experiments on Wiki, MIRFlickr, and NUS-WIDE datasets demonstrate the superior performance of STSCH to state-of-the-art hashing approaches.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104123"},"PeriodicalIF":15.5,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145902475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-04DOI: 10.1016/j.inffus.2025.104114
Bharat Richhariya , M. Tanveer , Weiping Ding
In several applications, the datasets have an underlying graphical structure, and geometric information about the data is needed in the learning algorithm. Universum data serves as a useful resource for classification problems by providing prior information about the data distribution. However, the graph connectivity information embedded in the universum data has not been utilized in previous algorithms. To address this problem, a novel graph based algorithm is proposed in this work to infuse connectivity information of universum in the optimization problem of the classifier. The proposed algorithm is termed as graph based universum least squares twin support vector machine (GULSTSVM). The proposed algorithm involves manifold regularization on the universum graph to provide geometric information to the classifier. The solution of the proposed algorithm involves a system of linear equations, making it efficient in terms of training time. Moreover, to efficiently capture local and global connectivity information of universum data, a novel multi-hop connectivity method is also proposed in this work. The multi-hop approach provides a fusion of local and global graph connectivity. A concept of minimum spanning tree is presented to capture local connectivity, and feature aggregation is performed to obtain global connectivity information. Experimental results on synthetic and real-world benchmark datasets show the advantages and applicability of the proposed algorithm.
{"title":"GULSTSVM: A fusion of graph information and universum learning in twin SVM","authors":"Bharat Richhariya , M. Tanveer , Weiping Ding","doi":"10.1016/j.inffus.2025.104114","DOIUrl":"10.1016/j.inffus.2025.104114","url":null,"abstract":"<div><div>In several applications, the datasets have an underlying graphical structure, and geometric information about the data is needed in the learning algorithm. Universum data serves as a useful resource for classification problems by providing prior information about the data distribution. However, the graph connectivity information embedded in the universum data has not been utilized in previous algorithms. To address this problem, a novel graph based algorithm is proposed in this work to infuse connectivity information of universum in the optimization problem of the classifier. The proposed algorithm is termed as graph based universum least squares twin support vector machine (GULSTSVM). The proposed algorithm involves manifold regularization on the universum graph to provide geometric information to the classifier. The solution of the proposed algorithm involves a system of linear equations, making it efficient in terms of training time. Moreover, to efficiently capture local and global connectivity information of universum data, a novel multi-hop connectivity method is also proposed in this work. The multi-hop approach provides a fusion of local and global graph connectivity. A concept of minimum spanning tree is presented to capture local connectivity, and feature aggregation is performed to obtain global connectivity information. Experimental results on synthetic and real-world benchmark datasets show the advantages and applicability of the proposed algorithm.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104114"},"PeriodicalIF":15.5,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145897493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.inffus.2025.104100
Yachuan Wang, Bin Zhang, Hao Yuan
Real-world deployment of video-based 3D human pose estimation remains challenging, as limited annotated data collected in constrained lab settings cannot fully capture the complexity of human motion. While motion synthesis for data augmentation has emerged as a mainstream solution to enhance generalization, existing synthesis methods suffer from inherent trade-offs: kinematics-based motion synthesis approaches preserve anatomical plausibility but sacrifice temporal coherence, while coordinate-based methods ensure motion smoothness but violate biomechanical constraints. This results in persistent domain gaps when synthetic data is directly used in the observation space to train pose estimation models. To overcome this, we propose DAK-Pose, which shifts augmentation to the feature space. We disentangle motion into structural and dynamic features, and design two complementary augmentors: (1) A structure-prioritized module enforces kinematic constraints for anatomical validity, and (2) a dynamic-prioritized module generates diverse temporal patterns. Auxiliary encoders trained on synthetic motions generated by these augmentors transfer domain-invariant knowledge to the pose estimator through adversarial alignment. Experiments on Human3.6M, MPI-INF-3DHP, and 3DPW datasets show that DAK-Pose achieves state-of-the-art cross-dataset performance.
{"title":"DAK-Pose: Dual-augmentor knowledge fusion for generalizable video-based 3D human pose estimation","authors":"Yachuan Wang, Bin Zhang, Hao Yuan","doi":"10.1016/j.inffus.2025.104100","DOIUrl":"10.1016/j.inffus.2025.104100","url":null,"abstract":"<div><div>Real-world deployment of video-based 3D human pose estimation remains challenging, as limited annotated data collected in constrained lab settings cannot fully capture the complexity of human motion. While motion synthesis for data augmentation has emerged as a mainstream solution to enhance generalization, existing synthesis methods suffer from inherent trade-offs: kinematics-based motion synthesis approaches preserve anatomical plausibility but sacrifice temporal coherence, while coordinate-based methods ensure motion smoothness but violate biomechanical constraints. This results in persistent domain gaps when synthetic data is directly used in the observation space to train pose estimation models. To overcome this, we propose DAK-Pose, which shifts augmentation to the feature space. We disentangle motion into structural and dynamic features, and design two complementary augmentors: (1) A structure-prioritized module enforces kinematic constraints for anatomical validity, and (2) a dynamic-prioritized module generates diverse temporal patterns. Auxiliary encoders trained on synthetic motions generated by these augmentors transfer domain-invariant knowledge to the pose estimator through adversarial alignment. Experiments on Human3.6M, MPI-INF-3DHP, and 3DPW datasets show that DAK-Pose achieves state-of-the-art cross-dataset performance.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104100"},"PeriodicalIF":15.5,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.inffus.2025.104115
Haoyu Wang , Taylor Yiu , Serena Lee , Ka Gao , Hangling Sun , Chenyu Zhou , Anji Li , Qiangqiang Fu , Yu Wang , Bin Chen
Robotic-assisted endovascular interventions promise to transform cardiovascular therapy by improving procedural precision and minimizing cardiologists’ exposure to occupational risks. However, current systems are limited by their reliance on manual control and lack of adaptability to complex vascular anatomies. To address these challenges, we propose a novel Hierarchical Autonomous Guidewire Navigation and Delivery (HAG-ND) framework that leverages the strengths of multimodal large language models (MLLMs) and a novel reinforcement learning module inspired by Deep Q-Networks (DQNs). The high-level MLLM is trained on diverse blood vessel and guidewire scenarios from various angles and positions, enabling it to assess the suitability and timing of substance release at the target location. Within the MLLM, a parliamentary mechanism is introduced, where multiple specialized models, each focusing on a specific aspect of the vascular environment, vote on the optimal course of action. The low-level reinforcement learning module focuses on optimizing autonomous guidewire navigation to the designated target site by learning from the rich semantic understanding provided by the MLLM. Experimental evaluations demonstrate that the HAG-ND framework significantly improves the accuracy and reliability of guidewire positioning and targeted delivery compared to existing methods. By harnessing the complementary capabilities of MLLMs and novel reinforcement learning techniques in a hierarchical architecture, HAG-ND represents a significant step towards fully autonomous and adaptive robotic-assisted endovascular interventions.
{"title":"A hierarchical information policy fusion framework with multimodal large language models for autonomous guidewire navigation in endovascular procedures","authors":"Haoyu Wang , Taylor Yiu , Serena Lee , Ka Gao , Hangling Sun , Chenyu Zhou , Anji Li , Qiangqiang Fu , Yu Wang , Bin Chen","doi":"10.1016/j.inffus.2025.104115","DOIUrl":"10.1016/j.inffus.2025.104115","url":null,"abstract":"<div><div>Robotic-assisted endovascular interventions promise to transform cardiovascular therapy by improving procedural precision and minimizing cardiologists’ exposure to occupational risks. However, current systems are limited by their reliance on manual control and lack of adaptability to complex vascular anatomies. To address these challenges, we propose a novel <em><strong>H</strong></em>ierarchical <em><strong>A</strong></em>utonomous <em><strong>G</strong></em>uidewire <em><strong>N</strong></em>avigation and <em><strong>D</strong></em>elivery (<em><strong>HAG-ND</strong></em>) framework that leverages the strengths of multimodal large language models (MLLMs) and a novel reinforcement learning module inspired by Deep Q-Networks (DQNs). The high-level MLLM is trained on diverse blood vessel and guidewire scenarios from various angles and positions, enabling it to assess the suitability and timing of substance release at the target location. Within the MLLM, a parliamentary mechanism is introduced, where multiple specialized models, each focusing on a specific aspect of the vascular environment, vote on the optimal course of action. The low-level reinforcement learning module focuses on optimizing autonomous guidewire navigation to the designated target site by learning from the rich semantic understanding provided by the MLLM. Experimental evaluations demonstrate that the HAG-ND framework significantly improves the accuracy and reliability of guidewire positioning and targeted delivery compared to existing methods. By harnessing the complementary capabilities of MLLMs and novel reinforcement learning techniques in a hierarchical architecture, HAG-ND represents a significant step towards fully autonomous and adaptive robotic-assisted endovascular interventions.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104115"},"PeriodicalIF":15.5,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.inffus.2026.104122
Zou Li , Jinzhi Liao , Jiting Li , Ji Wang , Xiang Zhao
The automatic detection of harmful memes is essential for healthy online ecosystems but remains challenging due to the intricate interaction between visual and textual elements. Recently, the remarkable capabilities of multimodal large language models (MLLMs) have significantly enhanced the detection performance, yet scarce labeled data still limits their effectiveness. Although pioneering few-shot studies have explored this regime, they merely leverage surface-level capabilities while ignoring deeper complexities. To approach the core of the problem, we identify its notorious challenges: (1) heterogeneous multimodal features are complex and may exhibit negative correlations; (2) the semantic patterns underlying single modal are hard to uncover; and (3) the insufficient training samples render models more reliant on commonsense. To address the challenges, we propose a structural self-adaption mixture-of-experts framework (SSMoE) for few-shot harmful meme detection, including universal and specialized experts to foster more effective knowledge sharing, modal synergy, and expert specialization within the MLLM structure. Specifically, SSMoE integrates four novel components: (1) Semantic Data Clustering module aims to partition heterogeneous source data and mitigate negative transfer; (2) Targeted Prompt Injection module aims to employ a teacher model for providing cluster-specific external guidance; (3) Asymmetric Expert Specialization module aims to introduce shared and specialized experts for efficient parameter adaptation and knowledge specialization; and (4) Cluster-conditioned Routing module aims to dynamically direct inputs to the most relevant expert pathway based on semantic cluster identity. Extensive experiments on three benchmark datasets (FHM, MAMI, HarM) demonstrate that SSMoE significantly outperforms state-of-the-art baseline methods, particularly in extremely low-data scenarios.
有害模因的自动检测对于健康的在线生态系统至关重要,但由于视觉和文本元素之间复杂的相互作用,仍然具有挑战性。近年来,多模态大语言模型(multimodal large language model, mllm)的显著性能大大提高了检测性能,但标记数据的匮乏仍然限制了其有效性。尽管一些开创性的研究已经探索了这一机制,但它们只是利用了表面的能力,而忽略了更深层次的复杂性。为了接近问题的核心,我们确定了其臭名昭着的挑战:(1)异质多模态特征是复杂的,并且可能表现出负相关;(2)单模态的语义模式难以发现;(3)训练样本不足使模型更依赖于常识。为了解决这些挑战,我们提出了一个结构自适应专家混合框架(SSMoE),用于少量有害模因检测,包括通用和专业专家,以促进更有效的知识共享,模态协同和专家专业化在MLLM结构中。具体而言,SSMoE集成了四个新组件:(1)语义数据聚类模块旨在对异构源数据进行分区,减轻负迁移;(2)针对性提示注入模块,采用教师模式,针对集群进行外部引导;(3)非对称专家专门化模块旨在引入共享专家和专门化专家,实现高效的参数自适应和知识专门化;(4)集群条件路由模块旨在基于语义集群身份动态地将输入引导到最相关的专家路径。在三个基准数据集(FHM, MAMI, HarM)上进行的大量实验表明,SSMoE显著优于最先进的基线方法,特别是在极低数据场景下。
{"title":"Few-shot harmful meme detection via self-adaption mixture-of-experts","authors":"Zou Li , Jinzhi Liao , Jiting Li , Ji Wang , Xiang Zhao","doi":"10.1016/j.inffus.2026.104122","DOIUrl":"10.1016/j.inffus.2026.104122","url":null,"abstract":"<div><div>The automatic detection of harmful memes is essential for healthy online ecosystems but remains challenging due to the intricate interaction between visual and textual elements. Recently, the remarkable capabilities of multimodal large language models (MLLMs) have significantly enhanced the detection performance, yet scarce labeled data still limits their effectiveness. Although pioneering few-shot studies have explored this regime, they merely leverage surface-level capabilities while ignoring deeper complexities. To approach the core of the problem, we identify its notorious challenges: (1) heterogeneous multimodal features are complex and may exhibit negative correlations; (2) the semantic patterns underlying single modal are hard to uncover; and (3) the insufficient training samples render models more reliant on commonsense. To address the challenges, we propose a structural self-adaption mixture-of-experts framework (SSMoE) for few-shot harmful meme detection, including universal and specialized experts to foster more effective knowledge sharing, modal synergy, and expert specialization within the MLLM structure. Specifically, SSMoE integrates four novel components: (1) Semantic Data Clustering module aims to partition heterogeneous source data and mitigate negative transfer; (2) Targeted Prompt Injection module aims to employ a teacher model for providing cluster-specific external guidance; (3) Asymmetric Expert Specialization module aims to introduce shared and specialized experts for efficient parameter adaptation and knowledge specialization; and (4) Cluster-conditioned Routing module aims to dynamically direct inputs to the most relevant expert pathway based on semantic cluster identity. Extensive experiments on three benchmark datasets (FHM, MAMI, HarM) demonstrate that SSMoE significantly outperforms state-of-the-art baseline methods, particularly in extremely low-data scenarios.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104122"},"PeriodicalIF":15.5,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.inffus.2025.104119
Farhana Yasmin , Yu Xue , Mahade Hasan , Ghulam Muhammad
Accurate brain tumor segmentation from magnetic resonance imaging (MRI) remains a significant challenge due to early loss of spatial detail, inadequate contextual representation, and ineffective decoder fusion. In this paper, we propose EPSO-Net, a multi-objective evolutionary neural architecture search (NAS) framework that integrates three specialized modules: UTSA for preserving spatial encoding and enhancing low-level feature representation, Astra for capturing semantic abstraction and multi-scale context, and Revo for improving decoder refinement through attention-guided fusion of feature maps. These modules work synergistically within a flexible modular 3D search space, enabling dynamic architecture optimization during the evolutionary process. EPSO-Net utilizes a particle swarm optimization (PSO)-guided mutation fusion mechanism that enables efficient exploration of the search space, adjusting mutation behavior based on performance feedback. To the best of our knowledge, this is the first multi-objective evolutionary NAS framework employing PSO-guided mutation fusion to adapt mutation strategies, driving the search towards optimal solutions in a resource-efficient manner. Experiments on the BraTS 2021, BraTS 2020, and MSD Brain Tumor datasets demonstrate that EPSO-Net outperforms nine state-of-the-art methods, achieving high dice similarity coefficients (DSC) of 93.89%, 95.02%, and 91.25%, low Hausdorff distance (HD95) of 1.14 mm, 1.02 mm, and 1.44 mm, and strong Grad-CAM IoU (GIoU) of 89.32%, 90.12%, and 85.68%, respectively. EPSO-Net also demonstrates reliable generalization to the CHAOS, PROMISE12, and ACDC datasets. Furthermore, it significantly reduces model complexity, lowers FLOPS, accelerates inference, and enhances interpretability. The full code will be publicly available at: https://github.com/Farhana005/EPSO-Net.
{"title":"EPSO-net: A multi-objective evolutionary neural architecture search with PSO-guided mutation fusion for explainable brain tumor segmentation","authors":"Farhana Yasmin , Yu Xue , Mahade Hasan , Ghulam Muhammad","doi":"10.1016/j.inffus.2025.104119","DOIUrl":"10.1016/j.inffus.2025.104119","url":null,"abstract":"<div><div>Accurate brain tumor segmentation from magnetic resonance imaging (MRI) remains a significant challenge due to early loss of spatial detail, inadequate contextual representation, and ineffective decoder fusion. In this paper, we propose EPSO-Net, a multi-objective evolutionary neural architecture search (NAS) framework that integrates three specialized modules: UTSA for preserving spatial encoding and enhancing low-level feature representation, Astra for capturing semantic abstraction and multi-scale context, and Revo for improving decoder refinement through attention-guided fusion of feature maps. These modules work synergistically within a flexible modular 3D search space, enabling dynamic architecture optimization during the evolutionary process. EPSO-Net utilizes a particle swarm optimization (PSO)-guided mutation fusion mechanism that enables efficient exploration of the search space, adjusting mutation behavior based on performance feedback. To the best of our knowledge, this is the first multi-objective evolutionary NAS framework employing PSO-guided mutation fusion to adapt mutation strategies, driving the search towards optimal solutions in a resource-efficient manner. Experiments on the BraTS 2021, BraTS 2020, and MSD Brain Tumor datasets demonstrate that EPSO-Net outperforms nine state-of-the-art methods, achieving high dice similarity coefficients (DSC) of 93.89%, 95.02%, and 91.25%, low Hausdorff distance (HD95) of 1.14 mm, 1.02 mm, and 1.44 mm, and strong Grad-CAM IoU (GIoU) of 89.32%, 90.12%, and 85.68%, respectively. EPSO-Net also demonstrates reliable generalization to the CHAOS, PROMISE12, and ACDC datasets. Furthermore, it significantly reduces model complexity, lowers FLOPS, accelerates inference, and enhances interpretability. The full code will be publicly available at: <span><span>https://github.com/Farhana005/EPSO-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104119"},"PeriodicalIF":15.5,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.inffus.2025.104117
Xingyue Wang , Weipeng Hu , Jiun Tian Hoe , Jianhui Li , Ping Hu , Yap-Peng Tan
Transforming video perspectives from exocentric (third-person) to egocentric (first-person) is challenging due to limited overlap between two perspectives. Existing approaches often neglect the temporal dynamics-critical for capturing motion cues and reappearing objects-and do not fully exploit source-view inferred semantics. To address these limitations, we propose a Progressive Temporal Compensation and Semantic Enhancement (PCSE) framework for Exocentric-to-Egocentric Video Generation. The Progressive Temporal Compensation (PTC) module focuses on long-term temporal dependencies, progressively aligning exocentric temporal patterns with egocentric representations. By employing a reliance-shifting mechanism with a progression mask, PTC gradually reduces dependence on egocentric supervision, enabling more robust target-view learning. Moreover, to leverage high-level scene context, we introduce a Hierarchical Dual-channel Transformer (HDT), which jointly generates egocentric frames and their corresponding semantic layouts via dual encoder-decoder architectures with hierarchically processed transformer blocks. To further enhance structural coherence and semantic consistency, the generated semantic layouts guide frame refinement through an Uncertainty-aware Semantic Enhancement (USE) module. USE dynamically estimates uncertainty masks to locate and refine ambiguous regions, yielding more coherent and visually accurate results. Extensive experiments demonstrate that PCSE achieves leading performance among cue-free methods.
{"title":"Progressive temporal compensation and semantic enhancement for Exo-to-Ego video generation","authors":"Xingyue Wang , Weipeng Hu , Jiun Tian Hoe , Jianhui Li , Ping Hu , Yap-Peng Tan","doi":"10.1016/j.inffus.2025.104117","DOIUrl":"10.1016/j.inffus.2025.104117","url":null,"abstract":"<div><div>Transforming video perspectives from exocentric (third-person) to egocentric (first-person) is challenging due to limited overlap between two perspectives. Existing approaches often neglect the temporal dynamics-critical for capturing motion cues and reappearing objects-and do not fully exploit source-view inferred semantics. To address these limitations, we propose a Progressive Temporal Compensation and Semantic Enhancement (PCSE) framework for Exocentric-to-Egocentric Video Generation. The Progressive Temporal Compensation (PTC) module focuses on long-term temporal dependencies, progressively aligning exocentric temporal patterns with egocentric representations. By employing a reliance-shifting mechanism with a progression mask, PTC gradually reduces dependence on egocentric supervision, enabling more robust target-view learning. Moreover, to leverage high-level scene context, we introduce a Hierarchical Dual-channel Transformer (HDT), which jointly generates egocentric frames and their corresponding semantic layouts via dual encoder-decoder architectures with hierarchically processed transformer blocks. To further enhance structural coherence and semantic consistency, the generated semantic layouts guide frame refinement through an Uncertainty-aware Semantic Enhancement (USE) module. USE dynamically estimates uncertainty masks to locate and refine ambiguous regions, yielding more coherent and visually accurate results. Extensive experiments demonstrate that PCSE achieves leading performance among cue-free methods.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104117"},"PeriodicalIF":15.5,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}