Pub Date : 2026-06-01Epub Date: 2025-12-18DOI: 10.1016/j.inffus.2025.104071
Giuseppe De Simone , Luca Greco , Alessia Saggese , Mario Vento
Gender and emotion recognition are traditionally analyzed independently using audio and video modalities, which introduces challenges when fusing their outputs and often results in increased computational overhead and latency. To address these limitations, in this work we introduces MAGNET (Multimodal Architecture for GeNder and Emotion Tasks), a novel multimodal multitask learning framework that jointly performs gender and emotion recognition by simultaneously analyzing audio and visual inputs. MAGNET employs soft parameter sharing, guided by GradNorm to balance task-specific learning dynamics. This design not only enhances recognition accuracy through effective modality fusion but also reduces model complexity by leveraging multitask learning. As a result, our approach is particularly well-suited for deployment on embedded devices, where computational efficiency and responsiveness are critical. Evaluated on the CREMA-D dataset, MAGNET consistently outperforms unimodal baselines and current state-of-the-art methods, demonstrating its effectiveness for efficient and accurate soft biometric analysis.
{"title":"Integrating visual and audio cues for emotion and gender recognition: A multi modal and multi task approach","authors":"Giuseppe De Simone , Luca Greco , Alessia Saggese , Mario Vento","doi":"10.1016/j.inffus.2025.104071","DOIUrl":"10.1016/j.inffus.2025.104071","url":null,"abstract":"<div><div>Gender and emotion recognition are traditionally analyzed independently using audio and video modalities, which introduces challenges when fusing their outputs and often results in increased computational overhead and latency. To address these limitations, in this work we introduces MAGNET (Multimodal Architecture for GeNder and Emotion Tasks), a novel multimodal multitask learning framework that jointly performs gender and emotion recognition by simultaneously analyzing audio and visual inputs. MAGNET employs soft parameter sharing, guided by GradNorm to balance task-specific learning dynamics. This design not only enhances recognition accuracy through effective modality fusion but also reduces model complexity by leveraging multitask learning. As a result, our approach is particularly well-suited for deployment on embedded devices, where computational efficiency and responsiveness are critical. Evaluated on the CREMA-D dataset, MAGNET consistently outperforms unimodal baselines and current state-of-the-art methods, demonstrating its effectiveness for efficient and accurate soft biometric analysis.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104071"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2025-12-18DOI: 10.1016/j.inffus.2025.104075
Xingyu Shen , Jinshi Xiao , Xiang Zhang , Long Lan , Xinwang Liu
Video Moment Retrieval (VMR) aims to identify the temporal span in an untrimmed video that semantically corresponds to a natural language query. Existing methods often overlook temporal invariance, making them sensitive to variations in query span and limiting their performance, especially for retrieving short-span moments. To address this limitation, we propose a Span-aware Temporal Aggregation (STA) network that introduces span-aware features to capture temporal invariant patterns, thereby enhancing robustness to varying query spans. STA consists of two key components: (i) A span-aware feature aggregation (SFA) module constructs span-specific visual representations that are aligned with the query to generate span-aware features, which are then integrated into local candidate moments; (ii) a Query-guided Moment Reasoning (QMR) module, which dynamically adapts the receptive fields of temporal convolutions based on query span semantics to achieve fine-grained reasoning. Extensive experiments on three challenging benchmark datasets demonstrate that STA consistently outperforms state-of-the-art methods, with particularly notable gains for short-span moments.
视频时刻检索(Video Moment Retrieval, VMR)的目的是识别在语义上与自然语言查询相对应的未修剪视频的时间跨度。现有的方法经常忽略时间不变性,使它们对查询范围的变化很敏感,限制了它们的性能,特别是在检索短跨度矩时。为了解决这一限制,我们提出了一个跨度感知的时间聚合(STA)网络,该网络引入了跨度感知的特征来捕获时间不变模式,从而增强了对不同查询跨度的鲁棒性。STA由两个关键组件组成:(i)跨度感知特征聚合(SFA)模块构建与查询对齐的特定于跨度的视觉表示,以生成跨度感知特征,然后将其集成到局部候选矩中;(ii)查询引导矩推理(query -guided Moment Reasoning, QMR)模块,基于查询跨度语义动态调整时间卷积的接受域,实现细粒度推理。在三个具有挑战性的基准数据集上进行的大量实验表明,STA始终优于最先进的方法,在短跨度矩方面的收益尤其显著。
{"title":"Span-aware temporal aggregation network for video moment retrieval","authors":"Xingyu Shen , Jinshi Xiao , Xiang Zhang , Long Lan , Xinwang Liu","doi":"10.1016/j.inffus.2025.104075","DOIUrl":"10.1016/j.inffus.2025.104075","url":null,"abstract":"<div><div>Video Moment Retrieval (VMR) aims to identify the temporal span in an untrimmed video that semantically corresponds to a natural language query. Existing methods often overlook temporal invariance, making them sensitive to variations in query span and limiting their performance, especially for retrieving short-span moments. To address this limitation, we propose a Span-aware Temporal Aggregation (STA) network that introduces span-aware features to capture temporal invariant patterns, thereby enhancing robustness to varying query spans. STA consists of two key components: (i) A span-aware feature aggregation (SFA) module constructs span-specific visual representations that are aligned with the query to generate span-aware features, which are then integrated into local candidate moments; (ii) a Query-guided Moment Reasoning (QMR) module, which dynamically adapts the receptive fields of temporal convolutions based on query span semantics to achieve fine-grained reasoning. Extensive experiments on three challenging benchmark datasets demonstrate that STA consistently outperforms state-of-the-art methods, with particularly notable gains for short-span moments.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104075"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To mitigate severe cloud interference in optical remote sensing imagery and address the challenges of deploying complex cloud removal models on satellite platforms, this study proposes a lightweight gated parallel attention network, GCEPANet. By integrating optical and SAR data, the network fully exploits the penetration capability of SAR imagery and combines a Gated Convolution Module (GCONV) with an Enhanced Parallel Attention Module (EPA) to establish a “cloud perception–cloud refinement” cooperative mechanism. This mechanism enables the model to identify and filter features according to cloud intensity, effectively separating the feature flows of clear and cloudy regions, and adaptively compensating for cloud-induced degradation to reconstruct the true structural and radiative characteristics of surface objects. Furthermore, a joint spectral–structural loss is introduced to simultaneously constrain spectral consistency and structural fidelity. Extensive experiments on the SEN12MS-CR dataset demonstrate that the proposed GCEPANet consistently outperforms existing methods across multiple metrics, including PSNR, SSIM, MAE, RMSE, SAM, and ERGAS. Compared with the SCTCR model, GCEPANet achieves a 0.9306 dB improvement in PSNR, reduces the number of parameters by 85.5% (to 12.77M), and decreases FLOPs by 76.0% (to 9.71G). These results demonstrate that the proposed method achieves superior cloud removal performance while significantly reducing model complexity, providing an efficient and practical solution for real-time on-orbit cloud removal in optical–SAR fused remote sensing imagery.
{"title":"GCEPANet: A lightweight and efficient remote sensing image cloud removal network model for optical-SAR image fusion","authors":"Qinglong Zhou , Xing Wang , Jiahao Fang , Wenbo Wu , Bingxian Zhang","doi":"10.1016/j.inffus.2025.104090","DOIUrl":"10.1016/j.inffus.2025.104090","url":null,"abstract":"<div><div>To mitigate severe cloud interference in optical remote sensing imagery and address the challenges of deploying complex cloud removal models on satellite platforms, this study proposes a lightweight gated parallel attention network, GCEPANet. By integrating optical and SAR data, the network fully exploits the penetration capability of SAR imagery and combines a Gated Convolution Module (GCONV) with an Enhanced Parallel Attention Module (EPA) to establish a “cloud perception–cloud refinement” cooperative mechanism. This mechanism enables the model to identify and filter features according to cloud intensity, effectively separating the feature flows of clear and cloudy regions, and adaptively compensating for cloud-induced degradation to reconstruct the true structural and radiative characteristics of surface objects. Furthermore, a joint spectral–structural loss is introduced to simultaneously constrain spectral consistency and structural fidelity. Extensive experiments on the SEN12MS-CR dataset demonstrate that the proposed GCEPANet consistently outperforms existing methods across multiple metrics, including PSNR, SSIM, MAE, RMSE, SAM, and ERGAS. Compared with the SCTCR model, GCEPANet achieves a 0.9306 dB improvement in PSNR, reduces the number of parameters by 85.5% (to 12.77M), and decreases FLOPs by 76.0% (to 9.71G). These results demonstrate that the proposed method achieves superior cloud removal performance while significantly reducing model complexity, providing an efficient and practical solution for real-time on-orbit cloud removal in optical–SAR fused remote sensing imagery.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104090"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145845111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI-safety benchmarks focus on single-modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news; remains largely unaddressed. In this work, we introduce the Vision-Language Disinformation Detection Benchmark (VLDBench), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. VLDBench comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluation of state-of-the-art LLMs and VLMs on VLDBench shows that adding visual cues improves detection accuracy, with gains ranging from 5 points for strong baselines (e.g., LLaMA-3.2-11B-Vision 74.82% vs. LLaMA-3.2-1B-Instruct 70.29%) to 25-30 points for smaller families (e.g., LLaVA-v1.5-Vicuna7B 72.32% vs. Vicuna-7B-v1.5 55.21%), reflecting complementary evidence from images (e.g., meme-like visuals, image-text consistency) that text alone cannot capture. We provide data and code for evaluation, fine-tuning and robustness tests to support disinformation analysis. Developed in alignment with the AI Goverance frameworks (MIT AI Risk Repository), VLDBench offers a principled foundation for advancing trustworthy disinformation detection in multimodal media.
随着人工智能工具使合成内容易于生成和传播,检测混合了操纵文本和图像的虚假信息变得越来越具有挑战性。虽然大多数现有的人工智能安全基准关注的是单模态的虚假信息(即,没有欺骗意图的虚假内容共享),但有意的多模态虚假信息,如模仿可信新闻的宣传或阴谋论;仍未得到解决。在这项工作中,我们引入了视觉语言虚假信息检测基准(VLDBench),这是第一个支持单模态(纯文本)和多模态(文本+图像)虚假信息检测的大规模资源。VLDBench包含大约62,000个标记文本图像对,横跨13个类别,来自58个新闻媒体。22位领域专家使用半自动的管道进行专家评审,投入了500多个小时,生成了高质量的注释,并在注释者之间达成了一致。在VLDBench上对最先进的llm和vlm的评估表明,添加视觉线索可以提高检测精度,从强大基线的5分(例如,LLaMA-3.2-11B-Vision 74.82% vs. LLaMA-3.2-1B-Instruct 70.29%)到较小家族的25-30分(例如,LLaVA-v1.5-Vicuna7B 72.32% vs. Vicuna-7B-v1.5 55.21%),反映了来自图像的补充证据(例如,像米姆一样的视觉,图像-文本一致性),仅文本无法捕获。我们为评估、微调和稳健性测试提供数据和代码,以支持虚假信息分析。VLDBench与人工智能治理框架(麻省理工学院人工智能风险存储库)一致开发,为在多模式媒体中推进可信赖的虚假信息检测提供了原则基础。项目:https://vectorinstitute.github.io/VLDBench/Data:https://huggingface.co/datasets/vector-institute/VLDBenchCode:https://github.com/VectorInstitute/VLDBench
{"title":"VLDBench Evaluating multimodal disinformation with regulatory alignment","authors":"Shaina Raza , Ashmal Vayani , Aditya Jain , Aravind Narayanan , Vahid Reza Khazaie , Syed Raza Bashir , Elham Dolatabadi , Gias Uddin , Christos Emmanouilidis , Rizwan Qureshi , Mubarak Shah","doi":"10.1016/j.inffus.2025.104092","DOIUrl":"10.1016/j.inffus.2025.104092","url":null,"abstract":"<div><div>Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI-safety benchmarks focus on single-modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news; remains largely unaddressed. In this work, we introduce the <strong>V</strong>ision-<strong>L</strong>anguage <strong>D</strong>isinformation Detection <strong>Bench</strong>mark (<span><strong>VLDBench</strong></span>), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. <span><strong>VLDBench</strong></span> comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluation of state-of-the-art LLMs and VLMs on <span><strong>VLDBench</strong></span> shows that adding visual cues improves detection accuracy, with gains ranging from 5 points for strong baselines (e.g., LLaMA-3.2-11B-Vision 74.82% vs. LLaMA-3.2-1B-Instruct 70.29%) to 25-30 points for smaller families (e.g., LLaVA-v1.5-Vicuna7B 72.32% vs. Vicuna-7B-v1.5 55.21%), reflecting complementary evidence from images (e.g., meme-like visuals, image-text consistency) that text alone cannot capture. We provide data and code for evaluation, fine-tuning and robustness tests to support disinformation analysis. Developed in alignment with the AI Goverance frameworks (MIT AI Risk Repository), <span><strong>VLDBench</strong></span> offers a principled foundation for advancing trustworthy disinformation detection in multimodal media.<span><div><div><table><tbody><tr><th><figure><img></figure></th><td><strong>Project:</strong></td><td><span><span>https://vectorinstitute.github.io/VLDBench/</span><svg><path></path></svg></span></td></tr><tr><th><figure><img></figure></th><td><strong>Data:</strong></td><td><span><span>https://huggingface.co/datasets/vector-institute/VLDBench</span><svg><path></path></svg></span></td></tr><tr><th><figure><img></figure></th><td><strong>Code:</strong></td><td><span><span>https://github.com/VectorInstitute/VLDBench</span><svg><path></path></svg></span></td></tr></tbody></table></div></div></span></div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104092"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-01-03DOI: 10.1016/j.inffus.2025.104115
Haoyu Wang , Taylor Yiu , Serena Lee , Ka Gao , Hangling Sun , Chenyu Zhou , Anji Li , Qiangqiang Fu , Yu Wang , Bin Chen
Robotic-assisted endovascular interventions promise to transform cardiovascular therapy by improving procedural precision and minimizing cardiologists’ exposure to occupational risks. However, current systems are limited by their reliance on manual control and lack of adaptability to complex vascular anatomies. To address these challenges, we propose a novel Hierarchical Autonomous Guidewire Navigation and Delivery (HAG-ND) framework that leverages the strengths of multimodal large language models (MLLMs) and a novel reinforcement learning module inspired by Deep Q-Networks (DQNs). The high-level MLLM is trained on diverse blood vessel and guidewire scenarios from various angles and positions, enabling it to assess the suitability and timing of substance release at the target location. Within the MLLM, a parliamentary mechanism is introduced, where multiple specialized models, each focusing on a specific aspect of the vascular environment, vote on the optimal course of action. The low-level reinforcement learning module focuses on optimizing autonomous guidewire navigation to the designated target site by learning from the rich semantic understanding provided by the MLLM. Experimental evaluations demonstrate that the HAG-ND framework significantly improves the accuracy and reliability of guidewire positioning and targeted delivery compared to existing methods. By harnessing the complementary capabilities of MLLMs and novel reinforcement learning techniques in a hierarchical architecture, HAG-ND represents a significant step towards fully autonomous and adaptive robotic-assisted endovascular interventions.
{"title":"A hierarchical information policy fusion framework with multimodal large language models for autonomous guidewire navigation in endovascular procedures","authors":"Haoyu Wang , Taylor Yiu , Serena Lee , Ka Gao , Hangling Sun , Chenyu Zhou , Anji Li , Qiangqiang Fu , Yu Wang , Bin Chen","doi":"10.1016/j.inffus.2025.104115","DOIUrl":"10.1016/j.inffus.2025.104115","url":null,"abstract":"<div><div>Robotic-assisted endovascular interventions promise to transform cardiovascular therapy by improving procedural precision and minimizing cardiologists’ exposure to occupational risks. However, current systems are limited by their reliance on manual control and lack of adaptability to complex vascular anatomies. To address these challenges, we propose a novel <em><strong>H</strong></em>ierarchical <em><strong>A</strong></em>utonomous <em><strong>G</strong></em>uidewire <em><strong>N</strong></em>avigation and <em><strong>D</strong></em>elivery (<em><strong>HAG-ND</strong></em>) framework that leverages the strengths of multimodal large language models (MLLMs) and a novel reinforcement learning module inspired by Deep Q-Networks (DQNs). The high-level MLLM is trained on diverse blood vessel and guidewire scenarios from various angles and positions, enabling it to assess the suitability and timing of substance release at the target location. Within the MLLM, a parliamentary mechanism is introduced, where multiple specialized models, each focusing on a specific aspect of the vascular environment, vote on the optimal course of action. The low-level reinforcement learning module focuses on optimizing autonomous guidewire navigation to the designated target site by learning from the rich semantic understanding provided by the MLLM. Experimental evaluations demonstrate that the HAG-ND framework significantly improves the accuracy and reliability of guidewire positioning and targeted delivery compared to existing methods. By harnessing the complementary capabilities of MLLMs and novel reinforcement learning techniques in a hierarchical architecture, HAG-ND represents a significant step towards fully autonomous and adaptive robotic-assisted endovascular interventions.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104115"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-01-03DOI: 10.1016/j.inffus.2025.104100
Yachuan Wang, Bin Zhang, Hao Yuan
Real-world deployment of video-based 3D human pose estimation remains challenging, as limited annotated data collected in constrained lab settings cannot fully capture the complexity of human motion. While motion synthesis for data augmentation has emerged as a mainstream solution to enhance generalization, existing synthesis methods suffer from inherent trade-offs: kinematics-based motion synthesis approaches preserve anatomical plausibility but sacrifice temporal coherence, while coordinate-based methods ensure motion smoothness but violate biomechanical constraints. This results in persistent domain gaps when synthetic data is directly used in the observation space to train pose estimation models. To overcome this, we propose DAK-Pose, which shifts augmentation to the feature space. We disentangle motion into structural and dynamic features, and design two complementary augmentors: (1) A structure-prioritized module enforces kinematic constraints for anatomical validity, and (2) a dynamic-prioritized module generates diverse temporal patterns. Auxiliary encoders trained on synthetic motions generated by these augmentors transfer domain-invariant knowledge to the pose estimator through adversarial alignment. Experiments on Human3.6M, MPI-INF-3DHP, and 3DPW datasets show that DAK-Pose achieves state-of-the-art cross-dataset performance.
{"title":"DAK-Pose: Dual-augmentor knowledge fusion for generalizable video-based 3D human pose estimation","authors":"Yachuan Wang, Bin Zhang, Hao Yuan","doi":"10.1016/j.inffus.2025.104100","DOIUrl":"10.1016/j.inffus.2025.104100","url":null,"abstract":"<div><div>Real-world deployment of video-based 3D human pose estimation remains challenging, as limited annotated data collected in constrained lab settings cannot fully capture the complexity of human motion. While motion synthesis for data augmentation has emerged as a mainstream solution to enhance generalization, existing synthesis methods suffer from inherent trade-offs: kinematics-based motion synthesis approaches preserve anatomical plausibility but sacrifice temporal coherence, while coordinate-based methods ensure motion smoothness but violate biomechanical constraints. This results in persistent domain gaps when synthetic data is directly used in the observation space to train pose estimation models. To overcome this, we propose DAK-Pose, which shifts augmentation to the feature space. We disentangle motion into structural and dynamic features, and design two complementary augmentors: (1) A structure-prioritized module enforces kinematic constraints for anatomical validity, and (2) a dynamic-prioritized module generates diverse temporal patterns. Auxiliary encoders trained on synthetic motions generated by these augmentors transfer domain-invariant knowledge to the pose estimator through adversarial alignment. Experiments on Human3.6M, MPI-INF-3DHP, and 3DPW datasets show that DAK-Pose achieves state-of-the-art cross-dataset performance.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104100"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2025-12-22DOI: 10.1016/j.inffus.2025.104094
Qiguang Miao , Linxing Jia , Kun Xie , Kaiyuan Fu , Zongkai Yang
Transformer-based architectures have achieved notable success across natural language processing, computer vision, and multimodal learning, yet they face persistent challenges such as high computational complexity and limited adaptability to dynamic environments. State Space Models (SSMs) have emerged as a competitive alternative, offering linear-time complexity and the ability to implicitly capture long-range dependencies. Building on this foundation, the Mamba model introduces time-varying parameterization to dynamically adjust state transitions based on input context, combined with selective state updates, content-aware scanning strategies, and hardware-efficient design. These innovations enable Mamba to maintain linear complexity while delivering higher throughput and significantly reduced memory consumption compared to both Transformer-based and conventional SSM architectures. This survey systematically reviews the theoretical foundations, architectural innovations, and application progress of the Mamba model. First, we trace the evolution of SSMs, highlighting the key design principles that underpin Mamba’s dynamic state transition and selective computation mechanisms. Second, we summarize Mamba’s structural innovations in modeling dynamics and multimodal fusion, categorizing its applications across multiple modalities, including vision, speech, point clouds, and multimodal data. Finally, we evaluate representative applications in medical image analysis, recommendation systems, reinforcement learning, and generative modeling, identifying advantages, limitations, and open challenges. The review concludes by outlining future research directions focused on improving generalization, causal reasoning, interpretability, and computational efficiency. This work aims to provide a concise yet comprehensive reference for researchers and practitioners, promoting further development and deployment of Mamba-based architectures across diverse real-world scenarios.
{"title":"A comprehensive survey and taxonomy of mamba: Applications, Challenges, and Future Directions","authors":"Qiguang Miao , Linxing Jia , Kun Xie , Kaiyuan Fu , Zongkai Yang","doi":"10.1016/j.inffus.2025.104094","DOIUrl":"10.1016/j.inffus.2025.104094","url":null,"abstract":"<div><div>Transformer-based architectures have achieved notable success across natural language processing, computer vision, and multimodal learning, yet they face persistent challenges such as high computational complexity and limited adaptability to dynamic environments. State Space Models (SSMs) have emerged as a competitive alternative, offering linear-time complexity and the ability to implicitly capture long-range dependencies. Building on this foundation, the Mamba model introduces time-varying parameterization to dynamically adjust state transitions based on input context, combined with selective state updates, content-aware scanning strategies, and hardware-efficient design. These innovations enable Mamba to maintain linear complexity while delivering higher throughput and significantly reduced memory consumption compared to both Transformer-based and conventional SSM architectures. This survey systematically reviews the theoretical foundations, architectural innovations, and application progress of the Mamba model. First, we trace the evolution of SSMs, highlighting the key design principles that underpin Mamba’s dynamic state transition and selective computation mechanisms. Second, we summarize Mamba’s structural innovations in modeling dynamics and multimodal fusion, categorizing its applications across multiple modalities, including vision, speech, point clouds, and multimodal data. Finally, we evaluate representative applications in medical image analysis, recommendation systems, reinforcement learning, and generative modeling, identifying advantages, limitations, and open challenges. The review concludes by outlining future research directions focused on improving generalization, causal reasoning, interpretability, and computational efficiency. This work aims to provide a concise yet comprehensive reference for researchers and practitioners, promoting further development and deployment of Mamba-based architectures across diverse real-world scenarios.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104094"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145813863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-01-07DOI: 10.1016/j.inffus.2026.104127
Menglin Yu , Shuxia Lu , Jiacheng Cong
Graph neural networks (GNNs) perform exceptionally well in node classification, but graph neural networks face severe challenges when dealing with imbalanced node classification. On the one hand, the model is prone to overfitting due to the small number of minority class samples. GNN’s message passing mechanism amplifies this problem, causing the model to overfit specific features and local neighborhood structures of minority class nodes rather than learning general patterns, resulting in poor generalization ability. On the other hand, the scarcity of samples leads to high variance in model training. Model performance is highly dependent on specific training samples and local graph structures, and is extremely sensitive to data partitioning, ultimately resulting in severe performance fluctuations and unstable results. In this work, to address the issues of minority class overfitting and high model variance faced by GNNs in imbalanced scenarios, we propose the dual-graph framework, A similarity-Guided Dual-Graph Learning Framework (SG-DGLF). To address the problem of overfitting for minority classes, the framework introduces a dynamic threshold random capture mechanism based on similarity, which supplements minority class samples by generating pseudo labels. Secondly, we leverage graph diffusion-based propagation and random edge dropping strategy to create new graphs, thereby increasing node diversity to alleviate the problem of excessive model variance. Empirically, SG-DGLF significantly outperforms advanced baseline methods on multiple imbalanced datasets. This validates the effectiveness of our framework in mitigating the problems of overfitting minority classes and high model variance.
{"title":"SG-DGLF: A similarity-guided dual-graph learning framework","authors":"Menglin Yu , Shuxia Lu , Jiacheng Cong","doi":"10.1016/j.inffus.2026.104127","DOIUrl":"10.1016/j.inffus.2026.104127","url":null,"abstract":"<div><div>Graph neural networks (GNNs) perform exceptionally well in node classification, but graph neural networks face severe challenges when dealing with imbalanced node classification. On the one hand, the model is prone to overfitting due to the small number of minority class samples. GNN’s message passing mechanism amplifies this problem, causing the model to overfit specific features and local neighborhood structures of minority class nodes rather than learning general patterns, resulting in poor generalization ability. On the other hand, the scarcity of samples leads to high variance in model training. Model performance is highly dependent on specific training samples and local graph structures, and is extremely sensitive to data partitioning, ultimately resulting in severe performance fluctuations and unstable results. In this work, to address the issues of minority class overfitting and high model variance faced by GNNs in imbalanced scenarios, we propose the dual-graph framework, A similarity-Guided Dual-Graph Learning Framework (SG-DGLF). To address the problem of overfitting for minority classes, the framework introduces a dynamic threshold random capture mechanism based on similarity, which supplements minority class samples by generating pseudo labels. Secondly, we leverage graph diffusion-based propagation and random edge dropping strategy to create new graphs, thereby increasing node diversity to alleviate the problem of excessive model variance. Empirically, SG-DGLF significantly outperforms advanced baseline methods on multiple imbalanced datasets. This validates the effectiveness of our framework in mitigating the problems of overfitting minority classes and high model variance.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104127"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145939897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-06-01Epub Date: 2026-01-11DOI: 10.1016/j.inffus.2026.104147
Sunxiaohe Li , Dongfang Zhao , Zirui Wang , Hao Zhang , Pang Wu , Zhenfeng Li , Lidong Du , Xianxiang Chen , Hongtao Niu , Xiaopan Li , Jingen Xia , Ting Yang , Peng Wang , Zhen Fang
Current methods for evaluating lung function require substantial patient cooperation and rigorous quality control. In contrast, impulse oscillometry (IOS) is a promising alternative that can measure lung mechanics with minimal patient effort and operational ease. IOS applies pressure oscillations to the airways and analyzes the resulting signals. However, previous studies on IOS have been limited to frequency-domain features derived from its response signals, while neglecting valuable time-domain information. To bridge this gap, we developed a deep learning model that fuses time- and frequency-domain IOS data for lung function evaluation. An internal dataset (2,702 cases) and an external dataset (335 cases) were retrospectively collected for model training and validation. Model performance was first evaluated through ablation studies and then tested across different demographic subgroups. Finally, Grad-CAM was employed to improve model interpretability. Results showed that our model accurately predicted lung function parameters, including FEV1/FVC (mean absolute errors [MAEs] of 3.78 and 4.33 %), FEV1 (MAEs of 0.235 and 0.270 L), and FVC (MAEs of 0.264 and 0.315 L), in internal and external validation sets. The model also demonstrated strong performance in respiratory disease prescreening, achieving AUCs of 0.989 and 0.980 with sensitivities of 73.97 % and 71.47 % for detecting airway obstruction, and AUCs of 0.938 and 0.925 with sensitivities of 76.41 % and 66.24 % for classifying four ventilation patterns across the two sets. By fusing time- and frequency-domain IOS data, this study offers a new strategy for pulmonary function evaluation, facilitating more efficient prescreening for pulmonary diseases.
{"title":"Fusing time- and frequency-domain information for effort-independent lung function evaluation using oscillometry","authors":"Sunxiaohe Li , Dongfang Zhao , Zirui Wang , Hao Zhang , Pang Wu , Zhenfeng Li , Lidong Du , Xianxiang Chen , Hongtao Niu , Xiaopan Li , Jingen Xia , Ting Yang , Peng Wang , Zhen Fang","doi":"10.1016/j.inffus.2026.104147","DOIUrl":"10.1016/j.inffus.2026.104147","url":null,"abstract":"<div><div>Current methods for evaluating lung function require substantial patient cooperation and rigorous quality control. In contrast, impulse oscillometry (IOS) is a promising alternative that can measure lung mechanics with minimal patient effort and operational ease. IOS applies pressure oscillations to the airways and analyzes the resulting signals. However, previous studies on IOS have been limited to frequency-domain features derived from its response signals, while neglecting valuable time-domain information. To bridge this gap, we developed a deep learning model that fuses time- and frequency-domain IOS data for lung function evaluation. An internal dataset (2,702 cases) and an external dataset (335 cases) were retrospectively collected for model training and validation. Model performance was first evaluated through ablation studies and then tested across different demographic subgroups. Finally, Grad-CAM was employed to improve model interpretability. Results showed that our model accurately predicted lung function parameters, including FEV<sub>1</sub>/FVC (mean absolute errors [MAEs] of 3.78 and 4.33 %), FEV<sub>1</sub> (MAEs of 0.235 and 0.270 L), and FVC (MAEs of 0.264 and 0.315 L), in internal and external validation sets. The model also demonstrated strong performance in respiratory disease prescreening, achieving AUCs of 0.989 and 0.980 with sensitivities of 73.97 % and 71.47 % for detecting airway obstruction, and AUCs of 0.938 and 0.925 with sensitivities of 76.41 % and 66.24 % for classifying four ventilation patterns across the two sets. By fusing time- and frequency-domain IOS data, this study offers a new strategy for pulmonary function evaluation, facilitating more efficient prescreening for pulmonary diseases.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104147"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145957303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To address the challenge of scarce burn mark samples in power infrastructure inspection, we introduce the Insulator Burn Mark RGB-Point Cloud (IBMR) dataset, the first publicly available benchmark featuring RGB-point clouds with pixel-level annotations for both insulators and burn marks. To tackle the critical issue of severe class imbalance caused by the vast number of background points and the small size of burn marks, we propose a novel two-stage RGB-point cloud segmentation framework. This framework integrates DCCU-Sampling, an innovative downsampling algorithm that effectively suppresses background points while preserving critical structures of the targets, and BB-Backtracking, a geometric recovery method that reconstructs fine-grained burn mark details lost during downsampling process. Experimental results validate the framework’s effectiveness, achieving 81.21% mIoU with 32 training samples and 68.37% mIoU with only 14 samples. The dataset is publicly available at https://huggingface.co/datasets/Junqiu-Tang/IBMR.
{"title":"Dimensional compensation for small-sample and small-size insulator burn mark via RGB-point cloud fusion in power grid inspection","authors":"Junqiu Tang , Zhikang Yuan , Zixiang Wei , Shuojie Gao , Changyong Shen","doi":"10.1016/j.inffus.2025.104105","DOIUrl":"10.1016/j.inffus.2025.104105","url":null,"abstract":"<div><div>To address the challenge of scarce burn mark samples in power infrastructure inspection, we introduce the Insulator Burn Mark RGB-Point Cloud (IBMR) dataset, the first publicly available benchmark featuring RGB-point clouds with pixel-level annotations for both insulators and burn marks. To tackle the critical issue of severe class imbalance caused by the vast number of background points and the small size of burn marks, we propose a novel two-stage RGB-point cloud segmentation framework. This framework integrates DCCU-Sampling, an innovative downsampling algorithm that effectively suppresses background points while preserving critical structures of the targets, and BB-Backtracking, a geometric recovery method that reconstructs fine-grained burn mark details lost during downsampling process. Experimental results validate the framework’s effectiveness, achieving 81.21% mIoU with 32 training samples and 68.37% mIoU with only 14 samples. The dataset is publicly available at <span><span>https://huggingface.co/datasets/Junqiu-Tang/IBMR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104105"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}