Pub Date : 2026-08-01Epub Date: 2026-01-01DOI: 10.1016/j.inffus.2025.104120
Qurat Ul Ain , Fatima Khalid , Hafsa Ilyas , Ali Javed , Khalid Mahmood Malik , Khan Muhammad , Aun Irtaza
As the technology behind deepfakes advances, detecting audio-visual deepfakes becomes more and more crucial, and the rise of traditional and generative AI-based adversarial/anti-forensics attacks and generative AI-based anti-forensics attacks on deepfake detection technologies is a growing concern. Securing applications against adversarial and generative AI-based attacks is critical for accurate and robust deepfake detection tools. Therefore, this paper provides a comprehensive overview of various adversarial and generative AI-based anti-forensic attacks, which represent one of the core elements of trustworthiness alongside transparency, explainability, and fairness, as well as defensive countermeasures for audio-visual deepfake generation and detection. It covers topics such as adversarial attacks on deepfake detection algorithms and defensive methods, including model fusion and decoy-based approaches, to mitigate these threats. Although extensive research has been conducted in recent years on adversarial attacks and defense on deepfake detection, there have been few attempts to compare existing work qualitatively and quantitatively. This paper aims to help identify and address key issues that need to be considered to bring transferable adversarial attacks and their countermeasures particularly through techniques such as generative defense, knowledge distillation, and beyond.
{"title":"Adversarial and generative AI-based anti-forensics in audio-visual deepfake detection: A comprehensive review and analysis","authors":"Qurat Ul Ain , Fatima Khalid , Hafsa Ilyas , Ali Javed , Khalid Mahmood Malik , Khan Muhammad , Aun Irtaza","doi":"10.1016/j.inffus.2025.104120","DOIUrl":"10.1016/j.inffus.2025.104120","url":null,"abstract":"<div><div>As the technology behind deepfakes advances, detecting audio-visual deepfakes becomes more and more crucial, and the rise of traditional and generative AI-based adversarial/anti-forensics attacks and generative AI-based anti-forensics attacks on deepfake detection technologies is a growing concern. Securing applications against adversarial and generative AI-based attacks is critical for accurate and robust deepfake detection tools. Therefore, this paper provides a comprehensive overview of various adversarial and generative AI-based anti-forensic attacks, which represent one of the core elements of trustworthiness alongside transparency, explainability, and fairness, as well as defensive countermeasures for audio-visual deepfake generation and detection. It covers topics such as adversarial attacks on deepfake detection algorithms and defensive methods, including model fusion and decoy-based approaches, to mitigate these threats. Although extensive research has been conducted in recent years on adversarial attacks and defense on deepfake detection, there have been few attempts to compare existing work qualitatively and quantitatively. This paper aims to help identify and address key issues that need to be considered to bring transferable adversarial attacks and their countermeasures particularly through techniques such as generative defense, knowledge distillation, and beyond.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104120"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146192948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-09DOI: 10.1016/j.inffus.2026.104210
Lu Yuan , Zihan Wang , Zhengxuan Zhang , Lei Shi
In the digital era, social media accelerates the spread of misinformation. Existing detection methods often rely on shallow linguistic or propagation features and lack principled multimodal fusion, failing to capture creators’ emotional manipulation and readers’ psychological responses, which limits prediction accuracy. We propose the Dual-Aspect Empathy Framework (DAE), which derives creator and reader perspectives by fusing separately modeled cognitive and emotional empathy. Creators’ cognitive strategies and affective appeals are analyzed, while Large Language Models (LLMs) simulate readers’ judgments and emotional reactions, providing richer and more human-like signals than conventional classifiers, and partially alleviating the analytical challenge posed by insufficient human feedback. An empathy-aware filtering mechanism is further designed to refine outputs, enhancing authenticity and diversity. The pipeline integrates multimodal feature extraction, empathy-oriented representation learning, LLM-based reader simulation, and empathy-aware filtering. Experiments on benchmark datasets such as PolitiFact, GossipCop and Pheme show that the fusion-based DAE consistently outperforms state-of-the-art baselines, offering a novel and human-centric paradigm for misinformation detection.
{"title":"Bridging cognition and emotion: Empathy-driven multimodal misinformation detection","authors":"Lu Yuan , Zihan Wang , Zhengxuan Zhang , Lei Shi","doi":"10.1016/j.inffus.2026.104210","DOIUrl":"10.1016/j.inffus.2026.104210","url":null,"abstract":"<div><div>In the digital era, social media accelerates the spread of misinformation. Existing detection methods often rely on shallow linguistic or propagation features and lack principled multimodal fusion, failing to capture creators’ emotional manipulation and readers’ psychological responses, which limits prediction accuracy. We propose the Dual-Aspect Empathy Framework (DAE), which derives creator and reader perspectives by fusing separately modeled cognitive and emotional empathy. Creators’ cognitive strategies and affective appeals are analyzed, while Large Language Models (LLMs) simulate readers’ judgments and emotional reactions, providing richer and more human-like signals than conventional classifiers, and partially alleviating the analytical challenge posed by insufficient human feedback. An empathy-aware filtering mechanism is further designed to refine outputs, enhancing authenticity and diversity. The pipeline integrates multimodal feature extraction, empathy-oriented representation learning, LLM-based reader simulation, and empathy-aware filtering. Experiments on benchmark datasets such as PolitiFact, GossipCop and Pheme show that the fusion-based DAE consistently outperforms state-of-the-art baselines, offering a novel and human-centric paradigm for misinformation detection.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104210"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146146572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-07DOI: 10.1016/j.inffus.2026.104214
Miguel Campos-Romero , Manuel Carranza-García , Robert-Jan Sips , José C. Riquelme
Visual anomaly detection is a crucial task in industrial manufacturing, enabling early defect identification and minimizing production bottlenecks. Existing methods often struggle to effectively detect both structural anomalies, which appear as unexpected local patterns, and logical anomalies, which arise from violations of global contextual constraints. To address this challenge, we propose MuDeNet, an unsupervised Multi-patch Descriptor Network that performs multi-scale fusion of local structural features and global contextual information for comprehensive anomaly modeling. MuDeNet employs a lightweight teacher-student framework that jointly extracts and fuses local and global patch descriptors across multiple receptive fields within a single forward pass. Knowledge is first distilled from a pre-trained CNN to efficiently obtain semantic representations, which are then processed by two complementary modules: the structural module, targeting fine-grained defects at small receptive fields, and the logical module, modeling long-range contextual dependencies. Their outputs are fused at the decision level, yielding a unified anomaly score that integrates local and global evidence. Extensive experiments on three state-of-the-art datasets position MuDeNet as an efficient and scalable solution for real-time industrial anomaly detection and segmentation, consistently outperforming existing approaches.
{"title":"MuDeNet: A multi-patch descriptor network for anomaly modeling","authors":"Miguel Campos-Romero , Manuel Carranza-García , Robert-Jan Sips , José C. Riquelme","doi":"10.1016/j.inffus.2026.104214","DOIUrl":"10.1016/j.inffus.2026.104214","url":null,"abstract":"<div><div>Visual anomaly detection is a crucial task in industrial manufacturing, enabling early defect identification and minimizing production bottlenecks. Existing methods often struggle to effectively detect both structural anomalies, which appear as unexpected local patterns, and logical anomalies, which arise from violations of global contextual constraints. To address this challenge, we propose MuDeNet, an unsupervised Multi-patch Descriptor Network that performs multi-scale fusion of local structural features and global contextual information for comprehensive anomaly modeling. MuDeNet employs a lightweight teacher-student framework that jointly extracts and fuses local and global patch descriptors across multiple receptive fields within a single forward pass. Knowledge is first distilled from a pre-trained CNN to efficiently obtain semantic representations, which are then processed by two complementary modules: the structural module, targeting fine-grained defects at small receptive fields, and the logical module, modeling long-range contextual dependencies. Their outputs are fused at the decision level, yielding a unified anomaly score that integrates local and global evidence. Extensive experiments on three state-of-the-art datasets position MuDeNet as an efficient and scalable solution for real-time industrial anomaly detection and segmentation, consistently outperforming existing approaches.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104214"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-08DOI: 10.1016/j.inffus.2026.104215
Yaxian Wang , Qikan Lin , Jiangbo Shi , Yisheng An , Jun Liu , Bifan Wei , Xudong Jiang
In recent years, visual question answering has become a significant task at the intersection of computer vision and natural language processing, requiring models to jointly understand images and textual queries. It has emerged as a popular benchmark for evaluating multimodal understanding and reasoning. With advancements in VQA accuracy, there is a growing demand for explainability and transparency for VQA models, which is crucial for improving their trust and applicability in critical domains. This survey explores the emerging field of eXplainable Visual Question Answering (XVQA), which aims not only to provide the correct answer but also to generate meaningful explanations that justify the predicted answers. Firstly, we systematically review existing methods on XVQA, and propose a three-level taxonomy to organize them. The proposed taxonomy primarily categorizes XVQA methods based on the timing of the rationale generation and the forms of the rationales. Secondly, we review the existing VQA datasets annotated with explanations in different forms, including textual, visual and multimodal rationales. Furthermore, we summarize the evaluation metrics of XVQA for different forms of rationales. Finally, we outline the challenges for XVQA and discuss potential future directions. We aim to organize existing research in this domain and inspire future investigations into the explainability of VQA models.
{"title":"Explainable visual question answering: A survey on methods, datasets and evaluation","authors":"Yaxian Wang , Qikan Lin , Jiangbo Shi , Yisheng An , Jun Liu , Bifan Wei , Xudong Jiang","doi":"10.1016/j.inffus.2026.104215","DOIUrl":"10.1016/j.inffus.2026.104215","url":null,"abstract":"<div><div>In recent years, visual question answering has become a significant task at the intersection of computer vision and natural language processing, requiring models to jointly understand images and textual queries. It has emerged as a popular benchmark for evaluating multimodal understanding and reasoning. With advancements in VQA accuracy, there is a growing demand for explainability and transparency for VQA models, which is crucial for improving their trust and applicability in critical domains. This survey explores the emerging field of e<strong>X</strong>plainable <strong>V</strong>isual <strong>Q</strong>uestion <strong>A</strong>nswering (XVQA), which aims not only to provide the correct answer but also to generate meaningful explanations that justify the predicted answers. Firstly, we systematically review existing methods on XVQA, and propose a three-level taxonomy to organize them. The proposed taxonomy primarily categorizes XVQA methods based on the timing of the rationale generation and the forms of the rationales. Secondly, we review the existing VQA datasets annotated with explanations in different forms, including textual, visual and multimodal rationales. Furthermore, we summarize the evaluation metrics of XVQA for different forms of rationales. Finally, we outline the challenges for XVQA and discuss potential future directions. We aim to organize existing research in this domain and inspire future investigations into the explainability of VQA models.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104215"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-05DOI: 10.1016/j.inffus.2026.104206
Bojia Liu , Conghui Zheng , Li Pan
Heterogeneous graph representation learning seeks to capture the complex structural and semantic properties in heterogeneous graphs. The integration of hyperbolic space, which is well-suited to modeling the intrinsic degree power-law distribution of graphs, has facilitated significant advancements in this area. Recent methods leverage hyperbolic attention mechanisms to fuse semantic information within metapath-induced subgraphs. Despite this progress, a major limitation remains: these methods leverage attention for information aggregation but fail to model the causal relationship between semantic fusion and downstream task performance, leading to spurious semantic associations that reduce robustness to noise and impair cross-task generalization. To address this challenge, we propose a Causal ATtention enhanCed Hyperbolic Heterogeneous Graph Neural Network (CATCH), intending to achieve sufficient semantic information fusion. To the best of our knowledge, CATCH is the first to integrate hyperbolic space with causal inference for heterogeneous graph representations, directly targeting spurious semantic correlations at the source. Specifically, CATCH explicitly encodes the Euclidean node attributes of different types into a shared semantic hyperbolic space. To capture the underlying semantics, context subgraphs based on one-order and high-order metapaths are constructed to facilitate hyperbolic attention-based intra-level and inter-level information aggregation, thus forming comprehensive representations. Finally, a causal attention enhancement mechanism is implemented with direct supervision on attention learning, leveraging counterfactual causal inference to generate counterfactual representations for computing direct causal effects. By jointly optimizing a task-specific objective alongside a causal loss, CATCH promotes more faithful semantic encoding, leading to improved robustness and generalization. Extensive experiments on four real-world datasets validate the superior performance of CATCH across multiple tasks. The implementation is available at https://github.com/Crystal-LiuBojia/CATCH.
Recommendation performance on Amazon-CD and Amazon-Book.
{"title":"CATCH: Causal attention enhanced meta-path semantic fusion for robust hyperbolic heterogeneous graph embedding","authors":"Bojia Liu , Conghui Zheng , Li Pan","doi":"10.1016/j.inffus.2026.104206","DOIUrl":"10.1016/j.inffus.2026.104206","url":null,"abstract":"<div><div>Heterogeneous graph representation learning seeks to capture the complex structural and semantic properties in heterogeneous graphs. The integration of hyperbolic space, which is well-suited to modeling the intrinsic degree power-law distribution of graphs, has facilitated significant advancements in this area. Recent methods leverage hyperbolic attention mechanisms to fuse semantic information within metapath-induced subgraphs. Despite this progress, a major limitation remains: these methods leverage attention for information aggregation but fail to model the causal relationship between semantic fusion and downstream task performance, leading to spurious semantic associations that reduce robustness to noise and impair cross-task generalization. To address this challenge, we propose a <strong>C</strong>ausal <strong>AT</strong>tention enhan<strong>C</strong>ed <strong>H</strong>yperbolic Heterogeneous Graph Neural Network (<strong>CATCH</strong>), intending to achieve sufficient semantic information fusion. To the best of our knowledge, CATCH is the first to integrate hyperbolic space with causal inference for heterogeneous graph representations, directly targeting spurious semantic correlations at the source. Specifically, CATCH explicitly encodes the Euclidean node attributes of different types into a shared semantic hyperbolic space. To capture the underlying semantics, context subgraphs based on one-order and high-order metapaths are constructed to facilitate hyperbolic attention-based intra-level and inter-level information aggregation, thus forming comprehensive representations. Finally, a causal attention enhancement mechanism is implemented with direct supervision on attention learning, leveraging counterfactual causal inference to generate counterfactual representations for computing direct causal effects. By jointly optimizing a task-specific objective alongside a causal loss, CATCH promotes more faithful semantic encoding, leading to improved robustness and generalization. Extensive experiments on four real-world datasets validate the superior performance of CATCH across multiple tasks. The implementation is available at <span><span>https://github.com/Crystal-LiuBojia/CATCH</span><svg><path></path></svg></span>.</div><div>Recommendation performance on Amazon-CD and Amazon-Book.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104206"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146134527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-07DOI: 10.1016/j.inffus.2026.104205
Al Rafi Aurnob , Sharia Arfin Tanim , Tahmid Enam Shrestha , M.F. Mridha , Durjoy Mistry
Oral cancer represents a considerable global medical problem that requires the development of new technologies that offer reliable advanced therapies. This study introduced FedFusionNet, a fusion-centric model that was meticulously developed to advance early oral cancer diagnosis while preserving data privacy. The primary objective was to develop a model using federated learning (FL) to train across diverse healthcare facilities globally without compromising patient data confidentiality. This model uses features from the ResNeXt101 32X8D and InceptionV3 models to implement a single-level fusion via feature concatenation. This helps to enhance the effectiveness and stability of the model. Specifically, the federated averaging (FedAvg) technique fosters collaborative model training across multiple hospitals while safeguarding sensitive patient information. This ensured that each participating hospital could contribute to the development of the model without sharing the raw data. The proposed model was trained on a dataset of 10,002 images that included both healthy and cancerous oral tissues. Rigorous training and evaluation were conducted for both Independent and Identically Distributed (IID) and Independent and Non-Identically Distributed (Non-IID) settings. FedFusionNet demonstrated superior performance compared with pre-trained and some custom models for oral cancer diagnosis. This scalable and secure framework has profound implications for healthcare analytics. It is a proof-of-concept demonstration that utilizes publicly available data to establish the technical feasibility of the FedFusionNet framework. Future deployment in actual collaborative environments would demonstrate its security-by-design capabilities across hospitals, where patient data confidentiality is a priority.
{"title":"FedFusionNet: Advancing oral cancer recurrence prediction through federated fusion modeling","authors":"Al Rafi Aurnob , Sharia Arfin Tanim , Tahmid Enam Shrestha , M.F. Mridha , Durjoy Mistry","doi":"10.1016/j.inffus.2026.104205","DOIUrl":"10.1016/j.inffus.2026.104205","url":null,"abstract":"<div><div>Oral cancer represents a considerable global medical problem that requires the development of new technologies that offer reliable advanced therapies. This study introduced FedFusionNet, a fusion-centric model that was meticulously developed to advance early oral cancer diagnosis while preserving data privacy. The primary objective was to develop a model using federated learning (FL) to train across diverse healthcare facilities globally without compromising patient data confidentiality. This model uses features from the ResNeXt101 32X8D and InceptionV3 models to implement a single-level fusion via feature concatenation. This helps to enhance the effectiveness and stability of the model. Specifically, the federated averaging (FedAvg) technique fosters collaborative model training across multiple hospitals while safeguarding sensitive patient information. This ensured that each participating hospital could contribute to the development of the model without sharing the raw data. The proposed model was trained on a dataset of 10,002 images that included both healthy and cancerous oral tissues. Rigorous training and evaluation were conducted for both Independent and Identically Distributed (IID) and Independent and Non-Identically Distributed (Non-IID) settings. FedFusionNet demonstrated superior performance compared with pre-trained and some custom models for oral cancer diagnosis. This scalable and secure framework has profound implications for healthcare analytics. It is a proof-of-concept demonstration that utilizes publicly available data to establish the technical feasibility of the FedFusionNet framework. Future deployment in actual collaborative environments would demonstrate its security-by-design capabilities across hospitals, where patient data confidentiality is a priority.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104205"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-07DOI: 10.1016/j.inffus.2026.104211
Xiaoying Huang , Haonan Cheng , Sanyi Zhang , Xiaoxuan Guo , Long Ye
Music recommendation, as the core task of smart speakers, have an important impact on user experience in terms of recommendation speed and accuracy. However, existing music recommendation algorithms face challenges in generating adaptive playlists tailored to the user’s current state. This is primarily because achieving high recommendation accuracy typically necessitates substantial computing overheads. In addition, most of the existing music recommendation algorithms ignore smooth transitions between tracks, which further hurts the quality of the recommendations. To tackle these issues, we propose a novel Lightweight Music Recommendation (LMR) method via Multi-Physiological feature Fusion (MPF), which can be effectively applied in embedded smart speaker systems. Specifically, our proposed LMR method contains two core modules: a MPF-based music mapping module and a global-local similarity computation (GLSC) based playlist recommendation module. The lightweight MPF-based music mapping model is designed to solve the track-user adaptation problem. Furthermore, we propose a GLSC-based playlist recommendation algorithm to address the incoherence and unsmooth transitions within track sequences. Experiments demonstrate that the proposed method achieves more consistent playlist recommendations aligned with user contextual information, while also enabling smoother transitions between tracks and ensuring long-term content consistency across the entire sequence. Compared with other methods, our approach achieves a favorable balance between accuracy and efficiency.
{"title":"Lightweight music recommendation via multi-physiological feature fusion","authors":"Xiaoying Huang , Haonan Cheng , Sanyi Zhang , Xiaoxuan Guo , Long Ye","doi":"10.1016/j.inffus.2026.104211","DOIUrl":"10.1016/j.inffus.2026.104211","url":null,"abstract":"<div><div>Music recommendation, as the core task of smart speakers, have an important impact on user experience in terms of recommendation speed and accuracy. However, existing music recommendation algorithms face challenges in generating adaptive playlists tailored to the user’s current state. This is primarily because achieving high recommendation accuracy typically necessitates substantial computing overheads. In addition, most of the existing music recommendation algorithms ignore smooth transitions between tracks, which further hurts the quality of the recommendations. To tackle these issues, we propose a novel Lightweight Music Recommendation (LMR) method via Multi-Physiological feature Fusion (MPF), which can be effectively applied in embedded smart speaker systems. Specifically, our proposed LMR method contains two core modules: a MPF-based music mapping module and a global-local similarity computation (GLSC) based playlist recommendation module. The lightweight MPF-based music mapping model is designed to solve the track-user adaptation problem. Furthermore, we propose a GLSC-based playlist recommendation algorithm to address the incoherence and unsmooth transitions within track sequences. Experiments demonstrate that the proposed method achieves more consistent playlist recommendations aligned with user contextual information, while also enabling smoother transitions between tracks and ensuring long-term content consistency across the entire sequence. Compared with other methods, our approach achieves a favorable balance between accuracy and efficiency.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104211"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Affective states recognition is a challenging task that requires a large amount of input data, such as audio, video, and text. Current multi-modal approaches are often single-task and corpus-specific, resulting in overfitting, poor generalization across corpora, and reduced real-world performance. In this work, we address these limitations by: (1) multi-lingual training on corpora that include Russian (RAMAS) and English (MELD, CMU-MOSEI) speech; (2) multi-task learning for joint emotion and sentiment recognition; and (3) a novel Triple Fusion strategy that employs cross-modal integration at both hierarchical uni-modal and fused multi-modal feature levels, enhancing intra- and inter-modal relationships of different affective states and modalities. Additionally, to optimize performance of the approach proposed, we compare temporal encoders (Transformer-based, Mamba, xLSTM) and fusion strategies (double and triple fusion strategies with and without a label encoder) to comprehensively understand their capabilities and limitations. On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean weighted F1-score (mWF) of 88.6% for emotion recognition and weighted F1-score (WF) of 84.8% for sentiment recognition (respectively +9.5% and +6.0% absolute over prior multi-task baselines). On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% for emotion and 60.0% for sentiment (+8.4% WF for emotion recognition over the strongest multi-task baseline). On the Test subset of the RAMAS corpus, the proposed approach showed a competitive performance with WF of 71.8% and 90.0% for emotion and sentiment, respectively. We compare the performance of the approach proposed with that of the state-of-the-art ones. The source code and demo of the developed approach is publicly available at https://smil-spcras.github.io/MASAI/.
{"title":"Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion","authors":"Maxim Markitantov , Elena Ryumina , Anastasia Dvoynikova , Alexey Karpov","doi":"10.1016/j.inffus.2026.104207","DOIUrl":"10.1016/j.inffus.2026.104207","url":null,"abstract":"<div><div>Affective states recognition is a challenging task that requires a large amount of input data, such as audio, video, and text. Current multi-modal approaches are often single-task and corpus-specific, resulting in overfitting, poor generalization across corpora, and reduced real-world performance. In this work, we address these limitations by: (1) multi-lingual training on corpora that include Russian (RAMAS) and English (MELD, CMU-MOSEI) speech; (2) multi-task learning for joint emotion and sentiment recognition; and (3) a novel Triple Fusion strategy that employs cross-modal integration at both hierarchical uni-modal and fused multi-modal feature levels, enhancing intra- and inter-modal relationships of different affective states and modalities. Additionally, to optimize performance of the approach proposed, we compare temporal encoders (Transformer-based, Mamba, xLSTM) and fusion strategies (double and triple fusion strategies with and without a label encoder) to comprehensively understand their capabilities and limitations. On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean weighted F1-score (mWF) of 88.6% for emotion recognition and weighted F1-score (WF) of 84.8% for sentiment recognition (respectively +9.5% and +6.0% absolute over prior multi-task baselines). On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% for emotion and 60.0% for sentiment (+8.4% WF for emotion recognition over the strongest multi-task baseline). On the Test subset of the RAMAS corpus, the proposed approach showed a competitive performance with WF of 71.8% and 90.0% for emotion and sentiment, respectively. We compare the performance of the approach proposed with that of the state-of-the-art ones. The source code and demo of the developed approach is publicly available at <span><span>https://smil-spcras.github.io/MASAI/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104207"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146134528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-29DOI: 10.1016/j.inffus.2026.104196
Yue Zhao , Xinning Chen , Kehan Li , Yifan Lin , Yang Liu , Fan Wang
Accurate 3D dental model segmentation is critical for digital dental treatment, as it provides valuable clinical references. Existing methods fail to adaptively evaluate the importance or contribution of different geometric attributes during heterogeneous features fusion, hindering the accuracy of end-to-end segmentation. In this paper, we pioneer the description of geometric attributes of 3D dental model as views. A multi-view geometry-adaptive fusion network (MGAFNet) is proposed to dynamically seek the optimal combination of views through distinctive and sharable features exploration for fine-grained 3D dental model segmentation. Specifically, during distinctive features extraction, we design geometry-aware enhancement module (GAE) to improve topological variations learning in teeth. After that, a multivariate sharable cross-interaction module (SCIM) is developed to facilitate the flow of information and capture sharable features among views. Subsequently, a multivariate adaptive representation fusion module (MARF) is implemented to adaptively balance the importance or contribution of views by constructing weight matrices for distinctive and sharable features from different feature sources. Compared to eight advanced methods, our MGAFNet achieves state-of-the-art performance on both a public benchmark and a private clinical dataset. It demonstrates robustness in handling various dental conditions (e.g., misaligned, missing and supernumerary teeth), avoiding category confusion and blurry boundary segmentation.
{"title":"Sharable and discriminative multi-view geometry-adaptive fusion network for 3D dental model segmentation","authors":"Yue Zhao , Xinning Chen , Kehan Li , Yifan Lin , Yang Liu , Fan Wang","doi":"10.1016/j.inffus.2026.104196","DOIUrl":"10.1016/j.inffus.2026.104196","url":null,"abstract":"<div><div>Accurate 3D dental model segmentation is critical for digital dental treatment, as it provides valuable clinical references. <em>Existing methods</em> fail to adaptively evaluate the importance or contribution of different geometric attributes during heterogeneous features fusion, hindering the accuracy of end-to-end segmentation. <em>In this paper</em>, we pioneer the description of geometric attributes of 3D dental model as views. A multi-view geometry-adaptive fusion network (MGAFNet) is proposed to dynamically seek the optimal combination of views through distinctive and sharable features exploration for fine-grained 3D dental model segmentation. <em>Specifically</em>, during distinctive features extraction, we design geometry-aware enhancement module (GAE) to improve topological variations learning in teeth. After that, a multivariate sharable cross-interaction module (SCIM) is developed to facilitate the flow of information and capture sharable features among views. Subsequently, a multivariate adaptive representation fusion module (MARF) is implemented to adaptively balance the importance or contribution of views by constructing weight matrices for distinctive and sharable features from different feature sources. Compared to eight advanced methods, our MGAFNet achieves state-of-the-art performance on both a public benchmark and a private clinical dataset. It demonstrates robustness in handling various dental conditions (e.g., misaligned, missing and supernumerary teeth), avoiding category confusion and blurry boundary segmentation.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"132 ","pages":"Article 104196"},"PeriodicalIF":15.5,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146072490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-01-23DOI: 10.1016/j.inffus.2026.104130
Xilai Li , Wuyang Liu , Xiaosong Li , Fuqiang Zhou , Huafeng Li , Feiping Nie
Multi-modality image fusion (MMIF) combines complementary information from different image modalities to provide a comprehensive and objective interpretation of scenes. However, existing fusion methods cannot resist diverse weather interference in real-world scenes, limiting their practical applicability. To bridge this gap, we propose an end-to-end, unified all-weather MMIF model. Rather than focusing solely on pixel-level recovery, our method emphasizes maximizing the representation of key scene information through joint feature fusion and restoration. Specifically, we first decompose images into low-rank and sparse components, enabling effective feature separation for enhanced multi-modality perception. During feature recovery, we introduce a physically-aware clear feature prediction module, inferring variations in light transmission via illumination and reflectance. Clear features generated by the network are used to enhance the representation of salient information. We also construct a large-scale MMIF dataset with 100,000 image pairs comprehensively across rain, haze, and snow conditions, as well as covering various degradation levels and diverse scenes. Experimental results in both real-world and synthetic scenes demonstrate that the proposed method excels in image fusion and downstream tasks such as object detection, semantic segmentation, and depth estimation. The source code is available at https://github.com/ixilai/AWFusion.
{"title":"All-weather multi-modality image fusion: Unified framework and 100k benchmark","authors":"Xilai Li , Wuyang Liu , Xiaosong Li , Fuqiang Zhou , Huafeng Li , Feiping Nie","doi":"10.1016/j.inffus.2026.104130","DOIUrl":"10.1016/j.inffus.2026.104130","url":null,"abstract":"<div><div>Multi-modality image fusion (MMIF) combines complementary information from different image modalities to provide a comprehensive and objective interpretation of scenes. However, existing fusion methods cannot resist diverse weather interference in real-world scenes, limiting their practical applicability. To bridge this gap, we propose an end-to-end, unified all-weather MMIF model. Rather than focusing solely on pixel-level recovery, our method emphasizes maximizing the representation of key scene information through joint feature fusion and restoration. Specifically, we first decompose images into low-rank and sparse components, enabling effective feature separation for enhanced multi-modality perception. During feature recovery, we introduce a physically-aware clear feature prediction module, inferring variations in light transmission via illumination and reflectance. Clear features generated by the network are used to enhance the representation of salient information. We also construct a large-scale MMIF dataset with 100,000 image pairs comprehensively across rain, haze, and snow conditions, as well as covering various degradation levels and diverse scenes. Experimental results in both real-world and synthetic scenes demonstrate that the proposed method excels in image fusion and downstream tasks such as object detection, semantic segmentation, and depth estimation. The source code is available at <span><span>https://github.com/ixilai/AWFusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"131 ","pages":"Article 104130"},"PeriodicalIF":15.5,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}