Pub Date : 2026-02-12DOI: 10.1016/j.inffus.2026.104224
Jian Shen, Huajian Liang, Ruirui Ma, Yihong Xu, Kexin Zhu, Haoran Gao, Kechen Hou, Yanan Zhang, Xiaowei Zhang, Bin Hu
With the continuous development of multimodal learning, emotion recognition using multimodal physiological signals has become a research hotspot. Studies have shown that combining electroencephalogram (EEG) signals and eye movements can significantly improve the results of emotion recognition. However, current research still faces the following challenges: (1) Individuals’ response times and durations to different emotions vary, leading to data diversity and variability. (2) Different modalities exhibit spatiotemporal discrepancies, which may result in varying semantic relevance and significance under the same spatiotemporal conditions. To address these challenges, we propose a Regional to Global Fusion Network with a Spatial-Temporal Semantic Alignment Mechanism (R2GFANet). Initially, R2GFANet addresses the first challenge by employing padding masks in conjunction with a 1D-CNN network to encode temporal semantic information from variable-length EEG signals and eye movements. Subsequently, R2GFANet leverages a Multi-Region Cross-Modal Attention mechanism for parallel temporal semantic alignment within each brain region and applies region-level spatial attention to highlight the semantic information of critical brain regions, effectively addressing spatiotemporal discrepancies across modalities. By comparing our method with numerous state-of-the-art approaches on two public datasets, SEED-IV and SEED-V, we demonstrate the outstanding performance and statistical significance of the proposed R2GFANet. Additionally, we conduct ablation studies and visualization analyses. The results indicate that aligning EEG signals with eye movements not only improves classification performance but also provides neuroscientific interpretability.
{"title":"Emotion Recognition Using Multimodal Physiological Signals Through Regional to Global Fusion with a Spatial-Temporal Semantic Alignment Mechanism","authors":"Jian Shen, Huajian Liang, Ruirui Ma, Yihong Xu, Kexin Zhu, Haoran Gao, Kechen Hou, Yanan Zhang, Xiaowei Zhang, Bin Hu","doi":"10.1016/j.inffus.2026.104224","DOIUrl":"https://doi.org/10.1016/j.inffus.2026.104224","url":null,"abstract":"With the continuous development of multimodal learning, emotion recognition using multimodal physiological signals has become a research hotspot. Studies have shown that combining electroencephalogram (EEG) signals and eye movements can significantly improve the results of emotion recognition. However, current research still faces the following challenges: (1) Individuals’ response times and durations to different emotions vary, leading to data diversity and variability. (2) Different modalities exhibit spatiotemporal discrepancies, which may result in varying semantic relevance and significance under the same spatiotemporal conditions. To address these challenges, we propose a Regional to Global Fusion Network with a Spatial-Temporal Semantic Alignment Mechanism (R2GFANet). Initially, R2GFANet addresses the first challenge by employing padding masks in conjunction with a 1D-CNN network to encode temporal semantic information from variable-length EEG signals and eye movements. Subsequently, R2GFANet leverages a Multi-Region Cross-Modal Attention mechanism for parallel temporal semantic alignment within each brain region and applies region-level spatial attention to highlight the semantic information of critical brain regions, effectively addressing spatiotemporal discrepancies across modalities. By comparing our method with numerous state-of-the-art approaches on two public datasets, SEED-IV and SEED-V, we demonstrate the outstanding performance and statistical significance of the proposed R2GFANet. Additionally, we conduct ablation studies and visualization analyses. The results indicate that aligning EEG signals with eye movements not only improves classification performance but also provides neuroscientific interpretability.","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"36 1","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146208863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-11DOI: 10.1016/j.inffus.2026.104222
Ruijie Xiao, Bo Yang, Qianyang Zhu
{"title":"PIA: Fusing Edge Prior Information into Attention for Semantic Segmentation in Vision Transformer","authors":"Ruijie Xiao, Bo Yang, Qianyang Zhu","doi":"10.1016/j.inffus.2026.104222","DOIUrl":"https://doi.org/10.1016/j.inffus.2026.104222","url":null,"abstract":"","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"30 1","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146160884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-09DOI: 10.1016/j.inffus.2026.104210
Lu Yuan , Zihan Wang , Zhengxuan Zhang , Lei Shi
In the digital era, social media accelerates the spread of misinformation. Existing detection methods often rely on shallow linguistic or propagation features and lack principled multimodal fusion, failing to capture creators’ emotional manipulation and readers’ psychological responses, which limits prediction accuracy. We propose the Dual-Aspect Empathy Framework (DAE), which derives creator and reader perspectives by fusing separately modeled cognitive and emotional empathy. Creators’ cognitive strategies and affective appeals are analyzed, while Large Language Models (LLMs) simulate readers’ judgments and emotional reactions, providing richer and more human-like signals than conventional classifiers, and partially alleviating the analytical challenge posed by insufficient human feedback. An empathy-aware filtering mechanism is further designed to refine outputs, enhancing authenticity and diversity. The pipeline integrates multimodal feature extraction, empathy-oriented representation learning, LLM-based reader simulation, and empathy-aware filtering. Experiments on benchmark datasets such as PolitiFact, GossipCop and Pheme show that the fusion-based DAE consistently outperforms state-of-the-art baselines, offering a novel and human-centric paradigm for misinformation detection.
{"title":"Bridging cognition and emotion: Empathy-driven multimodal misinformation detection","authors":"Lu Yuan , Zihan Wang , Zhengxuan Zhang , Lei Shi","doi":"10.1016/j.inffus.2026.104210","DOIUrl":"10.1016/j.inffus.2026.104210","url":null,"abstract":"<div><div>In the digital era, social media accelerates the spread of misinformation. Existing detection methods often rely on shallow linguistic or propagation features and lack principled multimodal fusion, failing to capture creators’ emotional manipulation and readers’ psychological responses, which limits prediction accuracy. We propose the Dual-Aspect Empathy Framework (DAE), which derives creator and reader perspectives by fusing separately modeled cognitive and emotional empathy. Creators’ cognitive strategies and affective appeals are analyzed, while Large Language Models (LLMs) simulate readers’ judgments and emotional reactions, providing richer and more human-like signals than conventional classifiers, and partially alleviating the analytical challenge posed by insufficient human feedback. An empathy-aware filtering mechanism is further designed to refine outputs, enhancing authenticity and diversity. The pipeline integrates multimodal feature extraction, empathy-oriented representation learning, LLM-based reader simulation, and empathy-aware filtering. Experiments on benchmark datasets such as PolitiFact, GossipCop and Pheme show that the fusion-based DAE consistently outperforms state-of-the-art baselines, offering a novel and human-centric paradigm for misinformation detection.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104210"},"PeriodicalIF":15.5,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146146572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-08DOI: 10.1016/j.inffus.2026.104215
Yaxian Wang , Qikan Lin , Jiangbo Shi , Yisheng An , Jun Liu , Bifan Wei , Xudong Jiang
In recent years, visual question answering has become a significant task at the intersection of computer vision and natural language processing, requiring models to jointly understand images and textual queries. It has emerged as a popular benchmark for evaluating multimodal understanding and reasoning. With advancements in VQA accuracy, there is a growing demand for explainability and transparency for VQA models, which is crucial for improving their trust and applicability in critical domains. This survey explores the emerging field of eXplainable Visual Question Answering (XVQA), which aims not only to provide the correct answer but also to generate meaningful explanations that justify the predicted answers. Firstly, we systematically review existing methods on XVQA, and propose a three-level taxonomy to organize them. The proposed taxonomy primarily categorizes XVQA methods based on the timing of the rationale generation and the forms of the rationales. Secondly, we review the existing VQA datasets annotated with explanations in different forms, including textual, visual and multimodal rationales. Furthermore, we summarize the evaluation metrics of XVQA for different forms of rationales. Finally, we outline the challenges for XVQA and discuss potential future directions. We aim to organize existing research in this domain and inspire future investigations into the explainability of VQA models.
{"title":"Explainable visual question answering: A survey on methods, datasets and evaluation","authors":"Yaxian Wang , Qikan Lin , Jiangbo Shi , Yisheng An , Jun Liu , Bifan Wei , Xudong Jiang","doi":"10.1016/j.inffus.2026.104215","DOIUrl":"10.1016/j.inffus.2026.104215","url":null,"abstract":"<div><div>In recent years, visual question answering has become a significant task at the intersection of computer vision and natural language processing, requiring models to jointly understand images and textual queries. It has emerged as a popular benchmark for evaluating multimodal understanding and reasoning. With advancements in VQA accuracy, there is a growing demand for explainability and transparency for VQA models, which is crucial for improving their trust and applicability in critical domains. This survey explores the emerging field of e<strong>X</strong>plainable <strong>V</strong>isual <strong>Q</strong>uestion <strong>A</strong>nswering (XVQA), which aims not only to provide the correct answer but also to generate meaningful explanations that justify the predicted answers. Firstly, we systematically review existing methods on XVQA, and propose a three-level taxonomy to organize them. The proposed taxonomy primarily categorizes XVQA methods based on the timing of the rationale generation and the forms of the rationales. Secondly, we review the existing VQA datasets annotated with explanations in different forms, including textual, visual and multimodal rationales. Furthermore, we summarize the evaluation metrics of XVQA for different forms of rationales. Finally, we outline the challenges for XVQA and discuss potential future directions. We aim to organize existing research in this domain and inspire future investigations into the explainability of VQA models.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104215"},"PeriodicalIF":15.5,"publicationDate":"2026-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-07DOI: 10.1016/j.inffus.2026.104214
Miguel Campos-Romero , Manuel Carranza-García , Robert-Jan Sips , José C. Riquelme
Visual anomaly detection is a crucial task in industrial manufacturing, enabling early defect identification and minimizing production bottlenecks. Existing methods often struggle to effectively detect both structural anomalies, which appear as unexpected local patterns, and logical anomalies, which arise from violations of global contextual constraints. To address this challenge, we propose MuDeNet, an unsupervised Multi-patch Descriptor Network that performs multi-scale fusion of local structural features and global contextual information for comprehensive anomaly modeling. MuDeNet employs a lightweight teacher-student framework that jointly extracts and fuses local and global patch descriptors across multiple receptive fields within a single forward pass. Knowledge is first distilled from a pre-trained CNN to efficiently obtain semantic representations, which are then processed by two complementary modules: the structural module, targeting fine-grained defects at small receptive fields, and the logical module, modeling long-range contextual dependencies. Their outputs are fused at the decision level, yielding a unified anomaly score that integrates local and global evidence. Extensive experiments on three state-of-the-art datasets position MuDeNet as an efficient and scalable solution for real-time industrial anomaly detection and segmentation, consistently outperforming existing approaches.
{"title":"MuDeNet: A multi-patch descriptor network for anomaly modeling","authors":"Miguel Campos-Romero , Manuel Carranza-García , Robert-Jan Sips , José C. Riquelme","doi":"10.1016/j.inffus.2026.104214","DOIUrl":"10.1016/j.inffus.2026.104214","url":null,"abstract":"<div><div>Visual anomaly detection is a crucial task in industrial manufacturing, enabling early defect identification and minimizing production bottlenecks. Existing methods often struggle to effectively detect both structural anomalies, which appear as unexpected local patterns, and logical anomalies, which arise from violations of global contextual constraints. To address this challenge, we propose MuDeNet, an unsupervised Multi-patch Descriptor Network that performs multi-scale fusion of local structural features and global contextual information for comprehensive anomaly modeling. MuDeNet employs a lightweight teacher-student framework that jointly extracts and fuses local and global patch descriptors across multiple receptive fields within a single forward pass. Knowledge is first distilled from a pre-trained CNN to efficiently obtain semantic representations, which are then processed by two complementary modules: the structural module, targeting fine-grained defects at small receptive fields, and the logical module, modeling long-range contextual dependencies. Their outputs are fused at the decision level, yielding a unified anomaly score that integrates local and global evidence. Extensive experiments on three state-of-the-art datasets position MuDeNet as an efficient and scalable solution for real-time industrial anomaly detection and segmentation, consistently outperforming existing approaches.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104214"},"PeriodicalIF":15.5,"publicationDate":"2026-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-07DOI: 10.1016/j.inffus.2026.104205
Al Rafi Aurnob , Sharia Arfin Tanim , Tahmid Enam Shrestha , M.F. Mridha , Durjoy Mistry
Oral cancer represents a considerable global medical problem that requires the development of new technologies that offer reliable advanced therapies. This study introduced FedFusionNet, a fusion-centric model that was meticulously developed to advance early oral cancer diagnosis while preserving data privacy. The primary objective was to develop a model using federated learning (FL) to train across diverse healthcare facilities globally without compromising patient data confidentiality. This model uses features from the ResNeXt101 32X8D and InceptionV3 models to implement a single-level fusion via feature concatenation. This helps to enhance the effectiveness and stability of the model. Specifically, the federated averaging (FedAvg) technique fosters collaborative model training across multiple hospitals while safeguarding sensitive patient information. This ensured that each participating hospital could contribute to the development of the model without sharing the raw data. The proposed model was trained on a dataset of 10,002 images that included both healthy and cancerous oral tissues. Rigorous training and evaluation were conducted for both Independent and Identically Distributed (IID) and Independent and Non-Identically Distributed (Non-IID) settings. FedFusionNet demonstrated superior performance compared with pre-trained and some custom models for oral cancer diagnosis. This scalable and secure framework has profound implications for healthcare analytics. It is a proof-of-concept demonstration that utilizes publicly available data to establish the technical feasibility of the FedFusionNet framework. Future deployment in actual collaborative environments would demonstrate its security-by-design capabilities across hospitals, where patient data confidentiality is a priority.
{"title":"FedFusionNet: Advancing oral cancer recurrence prediction through federated fusion modeling","authors":"Al Rafi Aurnob , Sharia Arfin Tanim , Tahmid Enam Shrestha , M.F. Mridha , Durjoy Mistry","doi":"10.1016/j.inffus.2026.104205","DOIUrl":"10.1016/j.inffus.2026.104205","url":null,"abstract":"<div><div>Oral cancer represents a considerable global medical problem that requires the development of new technologies that offer reliable advanced therapies. This study introduced FedFusionNet, a fusion-centric model that was meticulously developed to advance early oral cancer diagnosis while preserving data privacy. The primary objective was to develop a model using federated learning (FL) to train across diverse healthcare facilities globally without compromising patient data confidentiality. This model uses features from the ResNeXt101 32X8D and InceptionV3 models to implement a single-level fusion via feature concatenation. This helps to enhance the effectiveness and stability of the model. Specifically, the federated averaging (FedAvg) technique fosters collaborative model training across multiple hospitals while safeguarding sensitive patient information. This ensured that each participating hospital could contribute to the development of the model without sharing the raw data. The proposed model was trained on a dataset of 10,002 images that included both healthy and cancerous oral tissues. Rigorous training and evaluation were conducted for both Independent and Identically Distributed (IID) and Independent and Non-Identically Distributed (Non-IID) settings. FedFusionNet demonstrated superior performance compared with pre-trained and some custom models for oral cancer diagnosis. This scalable and secure framework has profound implications for healthcare analytics. It is a proof-of-concept demonstration that utilizes publicly available data to establish the technical feasibility of the FedFusionNet framework. Future deployment in actual collaborative environments would demonstrate its security-by-design capabilities across hospitals, where patient data confidentiality is a priority.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104205"},"PeriodicalIF":15.5,"publicationDate":"2026-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}