首页 > 最新文献

Information Fusion最新文献

英文 中文
Integrating visual and audio cues for emotion and gender recognition: A multi modal and multi task approach 整合视听线索的情绪和性别识别:一个多模式和多任务的方法
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2025-12-18 DOI: 10.1016/j.inffus.2025.104071
Giuseppe De Simone , Luca Greco , Alessia Saggese , Mario Vento
Gender and emotion recognition are traditionally analyzed independently using audio and video modalities, which introduces challenges when fusing their outputs and often results in increased computational overhead and latency. To address these limitations, in this work we introduces MAGNET (Multimodal Architecture for GeNder and Emotion Tasks), a novel multimodal multitask learning framework that jointly performs gender and emotion recognition by simultaneously analyzing audio and visual inputs. MAGNET employs soft parameter sharing, guided by GradNorm to balance task-specific learning dynamics. This design not only enhances recognition accuracy through effective modality fusion but also reduces model complexity by leveraging multitask learning. As a result, our approach is particularly well-suited for deployment on embedded devices, where computational efficiency and responsiveness are critical. Evaluated on the CREMA-D dataset, MAGNET consistently outperforms unimodal baselines and current state-of-the-art methods, demonstrating its effectiveness for efficient and accurate soft biometric analysis.
传统上,性别和情感识别是使用音频和视频模式独立分析的,这在融合它们的输出时带来了挑战,并且经常导致计算开销和延迟增加。为了解决这些限制,在这项工作中,我们引入了MAGNET(性别和情感任务的多模态架构),这是一个新的多模态多任务学习框架,通过同时分析音频和视觉输入来联合执行性别和情感识别。MAGNET采用软参数共享,在GradNorm的指导下平衡特定任务的学习动态。该设计不仅通过有效的模态融合提高了识别精度,而且利用多任务学习降低了模型复杂性。因此,我们的方法特别适合部署在嵌入式设备上,其中计算效率和响应能力是至关重要的。在CREMA-D数据集上进行评估,MAGNET始终优于单峰基线和当前最先进的方法,证明了其高效、准确的软生物识别分析的有效性。
{"title":"Integrating visual and audio cues for emotion and gender recognition: A multi modal and multi task approach","authors":"Giuseppe De Simone ,&nbsp;Luca Greco ,&nbsp;Alessia Saggese ,&nbsp;Mario Vento","doi":"10.1016/j.inffus.2025.104071","DOIUrl":"10.1016/j.inffus.2025.104071","url":null,"abstract":"<div><div>Gender and emotion recognition are traditionally analyzed independently using audio and video modalities, which introduces challenges when fusing their outputs and often results in increased computational overhead and latency. To address these limitations, in this work we introduces MAGNET (Multimodal Architecture for GeNder and Emotion Tasks), a novel multimodal multitask learning framework that jointly performs gender and emotion recognition by simultaneously analyzing audio and visual inputs. MAGNET employs soft parameter sharing, guided by GradNorm to balance task-specific learning dynamics. This design not only enhances recognition accuracy through effective modality fusion but also reduces model complexity by leveraging multitask learning. As a result, our approach is particularly well-suited for deployment on embedded devices, where computational efficiency and responsiveness are critical. Evaluated on the CREMA-D dataset, MAGNET consistently outperforms unimodal baselines and current state-of-the-art methods, demonstrating its effectiveness for efficient and accurate soft biometric analysis.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104071"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Span-aware temporal aggregation network for video moment retrieval 视频时刻检索的跨感知时间聚合网络
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2025-12-18 DOI: 10.1016/j.inffus.2025.104075
Xingyu Shen , Jinshi Xiao , Xiang Zhang , Long Lan , Xinwang Liu
Video Moment Retrieval (VMR) aims to identify the temporal span in an untrimmed video that semantically corresponds to a natural language query. Existing methods often overlook temporal invariance, making them sensitive to variations in query span and limiting their performance, especially for retrieving short-span moments. To address this limitation, we propose a Span-aware Temporal Aggregation (STA) network that introduces span-aware features to capture temporal invariant patterns, thereby enhancing robustness to varying query spans. STA consists of two key components: (i) A span-aware feature aggregation (SFA) module constructs span-specific visual representations that are aligned with the query to generate span-aware features, which are then integrated into local candidate moments; (ii) a Query-guided Moment Reasoning (QMR) module, which dynamically adapts the receptive fields of temporal convolutions based on query span semantics to achieve fine-grained reasoning. Extensive experiments on three challenging benchmark datasets demonstrate that STA consistently outperforms state-of-the-art methods, with particularly notable gains for short-span moments.
视频时刻检索(Video Moment Retrieval, VMR)的目的是识别在语义上与自然语言查询相对应的未修剪视频的时间跨度。现有的方法经常忽略时间不变性,使它们对查询范围的变化很敏感,限制了它们的性能,特别是在检索短跨度矩时。为了解决这一限制,我们提出了一个跨度感知的时间聚合(STA)网络,该网络引入了跨度感知的特征来捕获时间不变模式,从而增强了对不同查询跨度的鲁棒性。STA由两个关键组件组成:(i)跨度感知特征聚合(SFA)模块构建与查询对齐的特定于跨度的视觉表示,以生成跨度感知特征,然后将其集成到局部候选矩中;(ii)查询引导矩推理(query -guided Moment Reasoning, QMR)模块,基于查询跨度语义动态调整时间卷积的接受域,实现细粒度推理。在三个具有挑战性的基准数据集上进行的大量实验表明,STA始终优于最先进的方法,在短跨度矩方面的收益尤其显著。
{"title":"Span-aware temporal aggregation network for video moment retrieval","authors":"Xingyu Shen ,&nbsp;Jinshi Xiao ,&nbsp;Xiang Zhang ,&nbsp;Long Lan ,&nbsp;Xinwang Liu","doi":"10.1016/j.inffus.2025.104075","DOIUrl":"10.1016/j.inffus.2025.104075","url":null,"abstract":"<div><div>Video Moment Retrieval (VMR) aims to identify the temporal span in an untrimmed video that semantically corresponds to a natural language query. Existing methods often overlook temporal invariance, making them sensitive to variations in query span and limiting their performance, especially for retrieving short-span moments. To address this limitation, we propose a Span-aware Temporal Aggregation (STA) network that introduces span-aware features to capture temporal invariant patterns, thereby enhancing robustness to varying query spans. STA consists of two key components: (i) A span-aware feature aggregation (SFA) module constructs span-specific visual representations that are aligned with the query to generate span-aware features, which are then integrated into local candidate moments; (ii) a Query-guided Moment Reasoning (QMR) module, which dynamically adapts the receptive fields of temporal convolutions based on query span semantics to achieve fine-grained reasoning. Extensive experiments on three challenging benchmark datasets demonstrate that STA consistently outperforms state-of-the-art methods, with particularly notable gains for short-span moments.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104075"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GCEPANet: A lightweight and efficient remote sensing image cloud removal network model for optical-SAR image fusion GCEPANet:一种轻量高效的遥感图像去云网络模型,用于光学与sar图像融合
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2025-12-27 DOI: 10.1016/j.inffus.2025.104090
Qinglong Zhou , Xing Wang , Jiahao Fang , Wenbo Wu , Bingxian Zhang
To mitigate severe cloud interference in optical remote sensing imagery and address the challenges of deploying complex cloud removal models on satellite platforms, this study proposes a lightweight gated parallel attention network, GCEPANet. By integrating optical and SAR data, the network fully exploits the penetration capability of SAR imagery and combines a Gated Convolution Module (GCONV) with an Enhanced Parallel Attention Module (EPA) to establish a “cloud perception–cloud refinement” cooperative mechanism. This mechanism enables the model to identify and filter features according to cloud intensity, effectively separating the feature flows of clear and cloudy regions, and adaptively compensating for cloud-induced degradation to reconstruct the true structural and radiative characteristics of surface objects. Furthermore, a joint spectral–structural loss is introduced to simultaneously constrain spectral consistency and structural fidelity. Extensive experiments on the SEN12MS-CR dataset demonstrate that the proposed GCEPANet consistently outperforms existing methods across multiple metrics, including PSNR, SSIM, MAE, RMSE, SAM, and ERGAS. Compared with the SCTCR model, GCEPANet achieves a 0.9306 dB improvement in PSNR, reduces the number of parameters by 85.5% (to 12.77M), and decreases FLOPs by 76.0% (to 9.71G). These results demonstrate that the proposed method achieves superior cloud removal performance while significantly reducing model complexity, providing an efficient and practical solution for real-time on-orbit cloud removal in optical–SAR fused remote sensing imagery.
为了减轻光学遥感图像中严重的云干扰,并解决在卫星平台上部署复杂云去除模型的挑战,本研究提出了一种轻量级门控并行关注网络GCEPANet。该网络通过整合光学和SAR数据,充分利用SAR图像的突防能力,将门控卷积模块(GCONV)与增强型并行关注模块(EPA)相结合,建立“云感知-云细化”协同机制。该机制使模型能够根据云强度对特征进行识别和过滤,有效分离晴空和多云区域的特征流,并自适应补偿云引起的退化,重建地表物体真实的结构和辐射特征。在此基础上,引入联合谱结构损失,同时约束谱一致性和结构保真度。在SEN12MS-CR数据集上进行的大量实验表明,所提出的GCEPANet在多个指标上始终优于现有方法,包括PSNR、SSIM、MAE、RMSE、SAM和ERGAS。与SCTCR模型相比,GCEPANet的PSNR提高了0.9306 dB,参数个数减少了85.5%(至12.77M), FLOPs降低了76.0%(至9.71G)。结果表明,该方法在显著降低模型复杂度的同时,取得了优异的除云性能,为光学- sar融合遥感影像的实时在轨除云提供了一种高效实用的解决方案。
{"title":"GCEPANet: A lightweight and efficient remote sensing image cloud removal network model for optical-SAR image fusion","authors":"Qinglong Zhou ,&nbsp;Xing Wang ,&nbsp;Jiahao Fang ,&nbsp;Wenbo Wu ,&nbsp;Bingxian Zhang","doi":"10.1016/j.inffus.2025.104090","DOIUrl":"10.1016/j.inffus.2025.104090","url":null,"abstract":"<div><div>To mitigate severe cloud interference in optical remote sensing imagery and address the challenges of deploying complex cloud removal models on satellite platforms, this study proposes a lightweight gated parallel attention network, GCEPANet. By integrating optical and SAR data, the network fully exploits the penetration capability of SAR imagery and combines a Gated Convolution Module (GCONV) with an Enhanced Parallel Attention Module (EPA) to establish a “cloud perception–cloud refinement” cooperative mechanism. This mechanism enables the model to identify and filter features according to cloud intensity, effectively separating the feature flows of clear and cloudy regions, and adaptively compensating for cloud-induced degradation to reconstruct the true structural and radiative characteristics of surface objects. Furthermore, a joint spectral–structural loss is introduced to simultaneously constrain spectral consistency and structural fidelity. Extensive experiments on the SEN12MS-CR dataset demonstrate that the proposed GCEPANet consistently outperforms existing methods across multiple metrics, including PSNR, SSIM, MAE, RMSE, SAM, and ERGAS. Compared with the SCTCR model, GCEPANet achieves a 0.9306 dB improvement in PSNR, reduces the number of parameters by 85.5% (to 12.77M), and decreases FLOPs by 76.0% (to 9.71G). These results demonstrate that the proposed method achieves superior cloud removal performance while significantly reducing model complexity, providing an efficient and practical solution for real-time on-orbit cloud removal in optical–SAR fused remote sensing imagery.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104090"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145845111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VLDBench Evaluating multimodal disinformation with regulatory alignment VLDBench评估具有监管一致性的多模态虚假信息
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2025-12-24 DOI: 10.1016/j.inffus.2025.104092
Shaina Raza , Ashmal Vayani , Aditya Jain , Aravind Narayanan , Vahid Reza Khazaie , Syed Raza Bashir , Elham Dolatabadi , Gias Uddin , Christos Emmanouilidis , Rizwan Qureshi , Mubarak Shah
Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI-safety benchmarks focus on single-modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news; remains largely unaddressed. In this work, we introduce the Vision-Language Disinformation Detection Benchmark (VLDBench), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. VLDBench comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluation of state-of-the-art LLMs and VLMs on VLDBench shows that adding visual cues improves detection accuracy, with gains ranging from 5 points for strong baselines (e.g., LLaMA-3.2-11B-Vision 74.82% vs. LLaMA-3.2-1B-Instruct 70.29%) to 25-30 points for smaller families (e.g., LLaVA-v1.5-Vicuna7B 72.32% vs. Vicuna-7B-v1.5 55.21%), reflecting complementary evidence from images (e.g., meme-like visuals, image-text consistency) that text alone cannot capture. We provide data and code for evaluation, fine-tuning and robustness tests to support disinformation analysis. Developed in alignment with the AI Goverance frameworks (MIT AI Risk Repository), VLDBench offers a principled foundation for advancing trustworthy disinformation detection in multimodal media.
Project:https://vectorinstitute.github.io/VLDBench/
Data:https://huggingface.co/datasets/vector-institute/VLDBench
Code:https://github.com/VectorInstitute/VLDBench
随着人工智能工具使合成内容易于生成和传播,检测混合了操纵文本和图像的虚假信息变得越来越具有挑战性。虽然大多数现有的人工智能安全基准关注的是单模态的虚假信息(即,没有欺骗意图的虚假内容共享),但有意的多模态虚假信息,如模仿可信新闻的宣传或阴谋论;仍未得到解决。在这项工作中,我们引入了视觉语言虚假信息检测基准(VLDBench),这是第一个支持单模态(纯文本)和多模态(文本+图像)虚假信息检测的大规模资源。VLDBench包含大约62,000个标记文本图像对,横跨13个类别,来自58个新闻媒体。22位领域专家使用半自动的管道进行专家评审,投入了500多个小时,生成了高质量的注释,并在注释者之间达成了一致。在VLDBench上对最先进的llm和vlm的评估表明,添加视觉线索可以提高检测精度,从强大基线的5分(例如,LLaMA-3.2-11B-Vision 74.82% vs. LLaMA-3.2-1B-Instruct 70.29%)到较小家族的25-30分(例如,LLaVA-v1.5-Vicuna7B 72.32% vs. Vicuna-7B-v1.5 55.21%),反映了来自图像的补充证据(例如,像米姆一样的视觉,图像-文本一致性),仅文本无法捕获。我们为评估、微调和稳健性测试提供数据和代码,以支持虚假信息分析。VLDBench与人工智能治理框架(麻省理工学院人工智能风险存储库)一致开发,为在多模式媒体中推进可信赖的虚假信息检测提供了原则基础。项目:https://vectorinstitute.github.io/VLDBench/Data:https://huggingface.co/datasets/vector-institute/VLDBenchCode:https://github.com/VectorInstitute/VLDBench
{"title":"VLDBench Evaluating multimodal disinformation with regulatory alignment","authors":"Shaina Raza ,&nbsp;Ashmal Vayani ,&nbsp;Aditya Jain ,&nbsp;Aravind Narayanan ,&nbsp;Vahid Reza Khazaie ,&nbsp;Syed Raza Bashir ,&nbsp;Elham Dolatabadi ,&nbsp;Gias Uddin ,&nbsp;Christos Emmanouilidis ,&nbsp;Rizwan Qureshi ,&nbsp;Mubarak Shah","doi":"10.1016/j.inffus.2025.104092","DOIUrl":"10.1016/j.inffus.2025.104092","url":null,"abstract":"<div><div>Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI-safety benchmarks focus on single-modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news; remains largely unaddressed. In this work, we introduce the <strong>V</strong>ision-<strong>L</strong>anguage <strong>D</strong>isinformation Detection <strong>Bench</strong>mark (<span><strong>VLDBench</strong></span>), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. <span><strong>VLDBench</strong></span> comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluation of state-of-the-art LLMs and VLMs on <span><strong>VLDBench</strong></span> shows that adding visual cues improves detection accuracy, with gains ranging from 5 points for strong baselines (e.g., LLaMA-3.2-11B-Vision 74.82% vs. LLaMA-3.2-1B-Instruct 70.29%) to 25-30 points for smaller families (e.g., LLaVA-v1.5-Vicuna7B 72.32% vs. Vicuna-7B-v1.5 55.21%), reflecting complementary evidence from images (e.g., meme-like visuals, image-text consistency) that text alone cannot capture. We provide data and code for evaluation, fine-tuning and robustness tests to support disinformation analysis. Developed in alignment with the AI Goverance frameworks (MIT AI Risk Repository), <span><strong>VLDBench</strong></span> offers a principled foundation for advancing trustworthy disinformation detection in multimodal media.<span><div><div><table><tbody><tr><th><figure><img></figure></th><td><strong>Project:</strong></td><td><span><span>https://vectorinstitute.github.io/VLDBench/</span><svg><path></path></svg></span></td></tr><tr><th><figure><img></figure></th><td><strong>Data:</strong></td><td><span><span>https://huggingface.co/datasets/vector-institute/VLDBench</span><svg><path></path></svg></span></td></tr><tr><th><figure><img></figure></th><td><strong>Code:</strong></td><td><span><span>https://github.com/VectorInstitute/VLDBench</span><svg><path></path></svg></span></td></tr></tbody></table></div></div></span></div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104092"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A hierarchical information policy fusion framework with multimodal large language models for autonomous guidewire navigation in endovascular procedures 基于多模态大语言模型的血管内导丝自主导航分层信息策略融合框架
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2026-01-03 DOI: 10.1016/j.inffus.2025.104115
Haoyu Wang , Taylor Yiu , Serena Lee , Ka Gao , Hangling Sun , Chenyu Zhou , Anji Li , Qiangqiang Fu , Yu Wang , Bin Chen
Robotic-assisted endovascular interventions promise to transform cardiovascular therapy by improving procedural precision and minimizing cardiologists’ exposure to occupational risks. However, current systems are limited by their reliance on manual control and lack of adaptability to complex vascular anatomies. To address these challenges, we propose a novel Hierarchical Autonomous Guidewire Navigation and Delivery (HAG-ND) framework that leverages the strengths of multimodal large language models (MLLMs) and a novel reinforcement learning module inspired by Deep Q-Networks (DQNs). The high-level MLLM is trained on diverse blood vessel and guidewire scenarios from various angles and positions, enabling it to assess the suitability and timing of substance release at the target location. Within the MLLM, a parliamentary mechanism is introduced, where multiple specialized models, each focusing on a specific aspect of the vascular environment, vote on the optimal course of action. The low-level reinforcement learning module focuses on optimizing autonomous guidewire navigation to the designated target site by learning from the rich semantic understanding provided by the MLLM. Experimental evaluations demonstrate that the HAG-ND framework significantly improves the accuracy and reliability of guidewire positioning and targeted delivery compared to existing methods. By harnessing the complementary capabilities of MLLMs and novel reinforcement learning techniques in a hierarchical architecture, HAG-ND represents a significant step towards fully autonomous and adaptive robotic-assisted endovascular interventions.
机器人辅助血管内介入有望通过提高手术精度和减少心脏病专家的职业风险来改变心血管治疗。然而,目前的系统受到人工控制的限制,缺乏对复杂血管解剖结构的适应性。为了解决这些挑战,我们提出了一种新的分层自主导线导航和交付(HAG-ND)框架,该框架利用了多模态大语言模型(mllm)的优势和受深度q网络(dqn)启发的新型强化学习模块。高水平MLLM从不同角度和位置对不同的血管和导丝情景进行训练,使其能够评估目标位置物质释放的适宜性和时间。在MLLM中,引入了议会机制,其中多个专门模型,每个模型都关注血管环境的一个特定方面,对最佳行动方案进行投票。底层强化学习模块通过学习MLLM提供的丰富语义理解,优化导丝自主导航到指定目标位置。实验评估表明,与现有方法相比,HAG-ND框架显著提高了导丝定位和定向投放的准确性和可靠性。通过在分层结构中利用mllm的互补能力和新型强化学习技术,HAG-ND代表了向完全自主和自适应机器人辅助血管内干预迈出的重要一步。
{"title":"A hierarchical information policy fusion framework with multimodal large language models for autonomous guidewire navigation in endovascular procedures","authors":"Haoyu Wang ,&nbsp;Taylor Yiu ,&nbsp;Serena Lee ,&nbsp;Ka Gao ,&nbsp;Hangling Sun ,&nbsp;Chenyu Zhou ,&nbsp;Anji Li ,&nbsp;Qiangqiang Fu ,&nbsp;Yu Wang ,&nbsp;Bin Chen","doi":"10.1016/j.inffus.2025.104115","DOIUrl":"10.1016/j.inffus.2025.104115","url":null,"abstract":"<div><div>Robotic-assisted endovascular interventions promise to transform cardiovascular therapy by improving procedural precision and minimizing cardiologists’ exposure to occupational risks. However, current systems are limited by their reliance on manual control and lack of adaptability to complex vascular anatomies. To address these challenges, we propose a novel <em><strong>H</strong></em>ierarchical <em><strong>A</strong></em>utonomous <em><strong>G</strong></em>uidewire <em><strong>N</strong></em>avigation and <em><strong>D</strong></em>elivery (<em><strong>HAG-ND</strong></em>) framework that leverages the strengths of multimodal large language models (MLLMs) and a novel reinforcement learning module inspired by Deep Q-Networks (DQNs). The high-level MLLM is trained on diverse blood vessel and guidewire scenarios from various angles and positions, enabling it to assess the suitability and timing of substance release at the target location. Within the MLLM, a parliamentary mechanism is introduced, where multiple specialized models, each focusing on a specific aspect of the vascular environment, vote on the optimal course of action. The low-level reinforcement learning module focuses on optimizing autonomous guidewire navigation to the designated target site by learning from the rich semantic understanding provided by the MLLM. Experimental evaluations demonstrate that the HAG-ND framework significantly improves the accuracy and reliability of guidewire positioning and targeted delivery compared to existing methods. By harnessing the complementary capabilities of MLLMs and novel reinforcement learning techniques in a hierarchical architecture, HAG-ND represents a significant step towards fully autonomous and adaptive robotic-assisted endovascular interventions.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104115"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DAK-Pose: Dual-augmentor knowledge fusion for generalizable video-based 3D human pose estimation DAK-Pose:基于广义视频的三维人体姿态估计的双增强知识融合
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2026-01-03 DOI: 10.1016/j.inffus.2025.104100
Yachuan Wang, Bin Zhang, Hao Yuan
Real-world deployment of video-based 3D human pose estimation remains challenging, as limited annotated data collected in constrained lab settings cannot fully capture the complexity of human motion. While motion synthesis for data augmentation has emerged as a mainstream solution to enhance generalization, existing synthesis methods suffer from inherent trade-offs: kinematics-based motion synthesis approaches preserve anatomical plausibility but sacrifice temporal coherence, while coordinate-based methods ensure motion smoothness but violate biomechanical constraints. This results in persistent domain gaps when synthetic data is directly used in the observation space to train pose estimation models. To overcome this, we propose DAK-Pose, which shifts augmentation to the feature space. We disentangle motion into structural and dynamic features, and design two complementary augmentors: (1) A structure-prioritized module enforces kinematic constraints for anatomical validity, and (2) a dynamic-prioritized module generates diverse temporal patterns. Auxiliary encoders trained on synthetic motions generated by these augmentors transfer domain-invariant knowledge to the pose estimator through adversarial alignment. Experiments on Human3.6M, MPI-INF-3DHP, and 3DPW datasets show that DAK-Pose achieves state-of-the-art cross-dataset performance.
基于视频的3D人体姿态估计在现实世界中的部署仍然具有挑战性,因为在受限的实验室环境中收集的有限注释数据无法完全捕捉到人体运动的复杂性。虽然用于数据增强的运动合成已成为增强泛化的主流解决方案,但现有的合成方法存在固有的权衡:基于运动学的运动合成方法保留了解剖学的合理性,但牺牲了时间一致性,而基于坐标的方法确保了运动的平滑性,但违反了生物力学约束。当直接在观测空间中使用合成数据来训练姿态估计模型时,这会导致持久的域间隙。为了克服这个问题,我们提出了DAK-Pose,它将增强转移到特征空间。我们将运动分解为结构特征和动态特征,并设计了两个互补的增强器:(1)结构优先模块执行解剖学有效性的运动学约束;(2)动态优先模块生成多种时间模式。辅助编码器对这些增强量生成的合成运动进行训练,通过对抗性对齐将域不变知识传递给姿态估计器。在Human3.6M、MPI-INF-3DHP和3DPW数据集上的实验表明,DAK-Pose实现了最先进的跨数据集性能。
{"title":"DAK-Pose: Dual-augmentor knowledge fusion for generalizable video-based 3D human pose estimation","authors":"Yachuan Wang,&nbsp;Bin Zhang,&nbsp;Hao Yuan","doi":"10.1016/j.inffus.2025.104100","DOIUrl":"10.1016/j.inffus.2025.104100","url":null,"abstract":"<div><div>Real-world deployment of video-based 3D human pose estimation remains challenging, as limited annotated data collected in constrained lab settings cannot fully capture the complexity of human motion. While motion synthesis for data augmentation has emerged as a mainstream solution to enhance generalization, existing synthesis methods suffer from inherent trade-offs: kinematics-based motion synthesis approaches preserve anatomical plausibility but sacrifice temporal coherence, while coordinate-based methods ensure motion smoothness but violate biomechanical constraints. This results in persistent domain gaps when synthetic data is directly used in the observation space to train pose estimation models. To overcome this, we propose DAK-Pose, which shifts augmentation to the feature space. We disentangle motion into structural and dynamic features, and design two complementary augmentors: (1) A structure-prioritized module enforces kinematic constraints for anatomical validity, and (2) a dynamic-prioritized module generates diverse temporal patterns. Auxiliary encoders trained on synthetic motions generated by these augmentors transfer domain-invariant knowledge to the pose estimator through adversarial alignment. Experiments on Human3.6M, MPI-INF-3DHP, and 3DPW datasets show that DAK-Pose achieves state-of-the-art cross-dataset performance.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104100"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive survey and taxonomy of mamba: Applications, Challenges, and Future Directions 曼巴的综合调查和分类:应用、挑战和未来方向
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2025-12-22 DOI: 10.1016/j.inffus.2025.104094
Qiguang Miao , Linxing Jia , Kun Xie , Kaiyuan Fu , Zongkai Yang
Transformer-based architectures have achieved notable success across natural language processing, computer vision, and multimodal learning, yet they face persistent challenges such as high computational complexity and limited adaptability to dynamic environments. State Space Models (SSMs) have emerged as a competitive alternative, offering linear-time complexity and the ability to implicitly capture long-range dependencies. Building on this foundation, the Mamba model introduces time-varying parameterization to dynamically adjust state transitions based on input context, combined with selective state updates, content-aware scanning strategies, and hardware-efficient design. These innovations enable Mamba to maintain linear complexity while delivering higher throughput and significantly reduced memory consumption compared to both Transformer-based and conventional SSM architectures. This survey systematically reviews the theoretical foundations, architectural innovations, and application progress of the Mamba model. First, we trace the evolution of SSMs, highlighting the key design principles that underpin Mamba’s dynamic state transition and selective computation mechanisms. Second, we summarize Mamba’s structural innovations in modeling dynamics and multimodal fusion, categorizing its applications across multiple modalities, including vision, speech, point clouds, and multimodal data. Finally, we evaluate representative applications in medical image analysis, recommendation systems, reinforcement learning, and generative modeling, identifying advantages, limitations, and open challenges. The review concludes by outlining future research directions focused on improving generalization, causal reasoning, interpretability, and computational efficiency. This work aims to provide a concise yet comprehensive reference for researchers and practitioners, promoting further development and deployment of Mamba-based architectures across diverse real-world scenarios.
基于转换器的体系结构在自然语言处理、计算机视觉和多模态学习方面取得了显著的成功,但是它们面临着持续的挑战,例如高计算复杂性和对动态环境的有限适应性。状态空间模型(ssm)作为一种有竞争力的替代方案出现了,它提供了线性时间复杂度和隐式捕获远程依赖关系的能力。在此基础上,Mamba模型引入时变参数化,以根据输入上下文动态调整状态转换,并结合选择性状态更新、内容感知扫描策略和硬件高效设计。与基于transformer和传统SSM架构相比,这些创新使Mamba能够保持线性复杂性,同时提供更高的吞吐量,并显着降低内存消耗。本调查系统地回顾了曼巴模型的理论基础、建筑创新和应用进展。首先,我们追溯了ssm的演变,强调了支撑曼巴动态状态转换和选择计算机制的关键设计原则。其次,我们总结了Mamba在建模动力学和多模态融合方面的结构创新,并对其在多个模态的应用进行了分类,包括视觉、语音、点云和多模态数据。最后,我们评估了在医学图像分析、推荐系统、强化学习和生成建模方面的代表性应用,确定了优势、限制和开放的挑战。总结了未来的研究方向,重点是提高泛化、因果推理、可解释性和计算效率。这项工作旨在为研究人员和实践者提供一个简洁而全面的参考,促进基于mamba的架构在不同的现实世界场景中的进一步开发和部署。
{"title":"A comprehensive survey and taxonomy of mamba: Applications, Challenges, and Future Directions","authors":"Qiguang Miao ,&nbsp;Linxing Jia ,&nbsp;Kun Xie ,&nbsp;Kaiyuan Fu ,&nbsp;Zongkai Yang","doi":"10.1016/j.inffus.2025.104094","DOIUrl":"10.1016/j.inffus.2025.104094","url":null,"abstract":"<div><div>Transformer-based architectures have achieved notable success across natural language processing, computer vision, and multimodal learning, yet they face persistent challenges such as high computational complexity and limited adaptability to dynamic environments. State Space Models (SSMs) have emerged as a competitive alternative, offering linear-time complexity and the ability to implicitly capture long-range dependencies. Building on this foundation, the Mamba model introduces time-varying parameterization to dynamically adjust state transitions based on input context, combined with selective state updates, content-aware scanning strategies, and hardware-efficient design. These innovations enable Mamba to maintain linear complexity while delivering higher throughput and significantly reduced memory consumption compared to both Transformer-based and conventional SSM architectures. This survey systematically reviews the theoretical foundations, architectural innovations, and application progress of the Mamba model. First, we trace the evolution of SSMs, highlighting the key design principles that underpin Mamba’s dynamic state transition and selective computation mechanisms. Second, we summarize Mamba’s structural innovations in modeling dynamics and multimodal fusion, categorizing its applications across multiple modalities, including vision, speech, point clouds, and multimodal data. Finally, we evaluate representative applications in medical image analysis, recommendation systems, reinforcement learning, and generative modeling, identifying advantages, limitations, and open challenges. The review concludes by outlining future research directions focused on improving generalization, causal reasoning, interpretability, and computational efficiency. This work aims to provide a concise yet comprehensive reference for researchers and practitioners, promoting further development and deployment of Mamba-based architectures across diverse real-world scenarios.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104094"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145813863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SG-DGLF: A similarity-guided dual-graph learning framework SG-DGLF:一个相似引导的双图学习框架
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2026-01-07 DOI: 10.1016/j.inffus.2026.104127
Menglin Yu , Shuxia Lu , Jiacheng Cong
Graph neural networks (GNNs) perform exceptionally well in node classification, but graph neural networks face severe challenges when dealing with imbalanced node classification. On the one hand, the model is prone to overfitting due to the small number of minority class samples. GNN’s message passing mechanism amplifies this problem, causing the model to overfit specific features and local neighborhood structures of minority class nodes rather than learning general patterns, resulting in poor generalization ability. On the other hand, the scarcity of samples leads to high variance in model training. Model performance is highly dependent on specific training samples and local graph structures, and is extremely sensitive to data partitioning, ultimately resulting in severe performance fluctuations and unstable results. In this work, to address the issues of minority class overfitting and high model variance faced by GNNs in imbalanced scenarios, we propose the dual-graph framework, A similarity-Guided Dual-Graph Learning Framework (SG-DGLF). To address the problem of overfitting for minority classes, the framework introduces a dynamic threshold random capture mechanism based on similarity, which supplements minority class samples by generating pseudo labels. Secondly, we leverage graph diffusion-based propagation and random edge dropping strategy to create new graphs, thereby increasing node diversity to alleviate the problem of excessive model variance. Empirically, SG-DGLF significantly outperforms advanced baseline methods on multiple imbalanced datasets. This validates the effectiveness of our framework in mitigating the problems of overfitting minority classes and high model variance.
图神经网络(Graph neural network, gnn)在节点分类方面表现优异,但在处理不平衡节点分类时面临严峻挑战。一方面,由于少数类样本数量较少,模型容易出现过拟合。GNN的消息传递机制放大了这一问题,导致模型过度拟合少数类节点的特定特征和局部邻域结构,而不是学习一般模式,导致泛化能力差。另一方面,样本的稀缺性导致模型训练的方差很大。模型性能高度依赖于特定的训练样本和局部图结构,对数据划分极其敏感,最终导致性能波动严重,结果不稳定。在这项工作中,为了解决gnn在不平衡场景下面临的少数类过拟合和高模型方差问题,我们提出了双图框架,即相似度引导的双图学习框架(SG-DGLF)。为了解决少数类的过拟合问题,该框架引入了基于相似性的动态阈值随机捕获机制,该机制通过生成伪标签来补充少数类样本。其次,我们利用基于图扩散的传播和随机丢边策略来创建新图,从而增加节点多样性,以缓解模型方差过大的问题。从经验上看,SG-DGLF在多个不平衡数据集上显著优于先进的基线方法。这验证了我们的框架在缓解过拟合少数类和高模型方差问题方面的有效性。
{"title":"SG-DGLF: A similarity-guided dual-graph learning framework","authors":"Menglin Yu ,&nbsp;Shuxia Lu ,&nbsp;Jiacheng Cong","doi":"10.1016/j.inffus.2026.104127","DOIUrl":"10.1016/j.inffus.2026.104127","url":null,"abstract":"<div><div>Graph neural networks (GNNs) perform exceptionally well in node classification, but graph neural networks face severe challenges when dealing with imbalanced node classification. On the one hand, the model is prone to overfitting due to the small number of minority class samples. GNN’s message passing mechanism amplifies this problem, causing the model to overfit specific features and local neighborhood structures of minority class nodes rather than learning general patterns, resulting in poor generalization ability. On the other hand, the scarcity of samples leads to high variance in model training. Model performance is highly dependent on specific training samples and local graph structures, and is extremely sensitive to data partitioning, ultimately resulting in severe performance fluctuations and unstable results. In this work, to address the issues of minority class overfitting and high model variance faced by GNNs in imbalanced scenarios, we propose the dual-graph framework, A similarity-Guided Dual-Graph Learning Framework (SG-DGLF). To address the problem of overfitting for minority classes, the framework introduces a dynamic threshold random capture mechanism based on similarity, which supplements minority class samples by generating pseudo labels. Secondly, we leverage graph diffusion-based propagation and random edge dropping strategy to create new graphs, thereby increasing node diversity to alleviate the problem of excessive model variance. Empirically, SG-DGLF significantly outperforms advanced baseline methods on multiple imbalanced datasets. This validates the effectiveness of our framework in mitigating the problems of overfitting minority classes and high model variance.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104127"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145939897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fusing time- and frequency-domain information for effort-independent lung function evaluation using oscillometry 融合时间和频域信息,使用振荡法进行不依赖于努力的肺功能评估
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2026-01-11 DOI: 10.1016/j.inffus.2026.104147
Sunxiaohe Li , Dongfang Zhao , Zirui Wang , Hao Zhang , Pang Wu , Zhenfeng Li , Lidong Du , Xianxiang Chen , Hongtao Niu , Xiaopan Li , Jingen Xia , Ting Yang , Peng Wang , Zhen Fang
Current methods for evaluating lung function require substantial patient cooperation and rigorous quality control. In contrast, impulse oscillometry (IOS) is a promising alternative that can measure lung mechanics with minimal patient effort and operational ease. IOS applies pressure oscillations to the airways and analyzes the resulting signals. However, previous studies on IOS have been limited to frequency-domain features derived from its response signals, while neglecting valuable time-domain information. To bridge this gap, we developed a deep learning model that fuses time- and frequency-domain IOS data for lung function evaluation. An internal dataset (2,702 cases) and an external dataset (335 cases) were retrospectively collected for model training and validation. Model performance was first evaluated through ablation studies and then tested across different demographic subgroups. Finally, Grad-CAM was employed to improve model interpretability. Results showed that our model accurately predicted lung function parameters, including FEV1/FVC (mean absolute errors [MAEs] of 3.78 and 4.33 %), FEV1 (MAEs of 0.235 and 0.270 L), and FVC (MAEs of 0.264 and 0.315 L), in internal and external validation sets. The model also demonstrated strong performance in respiratory disease prescreening, achieving AUCs of 0.989 and 0.980 with sensitivities of 73.97 % and 71.47 % for detecting airway obstruction, and AUCs of 0.938 and 0.925 with sensitivities of 76.41 % and 66.24 % for classifying four ventilation patterns across the two sets. By fusing time- and frequency-domain IOS data, this study offers a new strategy for pulmonary function evaluation, facilitating more efficient prescreening for pulmonary diseases.
目前评估肺功能的方法需要大量的患者配合和严格的质量控制。相比之下,脉冲振荡测量法(IOS)是一种很有前途的替代方法,它可以以最小的患者努力和操作简便来测量肺部力学。IOS对气道施加压力振荡,并分析产生的信号。然而,以往对IOS的研究仅限于从其响应信号中提取的频域特征,而忽略了宝贵的时域信息。为了弥补这一差距,我们开发了一种深度学习模型,该模型融合了用于肺功能评估的时域和频域IOS数据。回顾性收集内部数据集(2702例)和外部数据集(335例)进行模型训练和验证。首先通过消融研究评估模型的性能,然后在不同的人口亚组中进行测试。最后,采用Grad-CAM提高模型的可解释性。结果表明,该模型能够准确预测肺功能参数,包括FEV1/FVC(平均绝对误差[MAEs]分别为3.78和4.33%)、FEV1(平均绝对误差[MAEs]分别为0.235和0.270 L)和FVC(平均绝对误差[MAEs]分别为0.264和0.315 L)。该模型在呼吸道疾病的预筛查中也表现出较强的性能,对气道阻塞的检测auc分别为0.989和0.980,灵敏度分别为73.97%和71.47%;对两组四种通气方式的分类auc分别为0.938和0.925,灵敏度分别为76.41%和66.24%。通过融合时频域IOS数据,本研究为肺功能评估提供了一种新的策略,有助于更有效地进行肺部疾病的预筛查。
{"title":"Fusing time- and frequency-domain information for effort-independent lung function evaluation using oscillometry","authors":"Sunxiaohe Li ,&nbsp;Dongfang Zhao ,&nbsp;Zirui Wang ,&nbsp;Hao Zhang ,&nbsp;Pang Wu ,&nbsp;Zhenfeng Li ,&nbsp;Lidong Du ,&nbsp;Xianxiang Chen ,&nbsp;Hongtao Niu ,&nbsp;Xiaopan Li ,&nbsp;Jingen Xia ,&nbsp;Ting Yang ,&nbsp;Peng Wang ,&nbsp;Zhen Fang","doi":"10.1016/j.inffus.2026.104147","DOIUrl":"10.1016/j.inffus.2026.104147","url":null,"abstract":"<div><div>Current methods for evaluating lung function require substantial patient cooperation and rigorous quality control. In contrast, impulse oscillometry (IOS) is a promising alternative that can measure lung mechanics with minimal patient effort and operational ease. IOS applies pressure oscillations to the airways and analyzes the resulting signals. However, previous studies on IOS have been limited to frequency-domain features derived from its response signals, while neglecting valuable time-domain information. To bridge this gap, we developed a deep learning model that fuses time- and frequency-domain IOS data for lung function evaluation. An internal dataset (2,702 cases) and an external dataset (335 cases) were retrospectively collected for model training and validation. Model performance was first evaluated through ablation studies and then tested across different demographic subgroups. Finally, Grad-CAM was employed to improve model interpretability. Results showed that our model accurately predicted lung function parameters, including FEV<sub>1</sub>/FVC (mean absolute errors [MAEs] of 3.78 and 4.33 %), FEV<sub>1</sub> (MAEs of 0.235 and 0.270 L), and FVC (MAEs of 0.264 and 0.315 L), in internal and external validation sets. The model also demonstrated strong performance in respiratory disease prescreening, achieving AUCs of 0.989 and 0.980 with sensitivities of 73.97 % and 71.47 % for detecting airway obstruction, and AUCs of 0.938 and 0.925 with sensitivities of 76.41 % and 66.24 % for classifying four ventilation patterns across the two sets. By fusing time- and frequency-domain IOS data, this study offers a new strategy for pulmonary function evaluation, facilitating more efficient prescreening for pulmonary diseases.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104147"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145957303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dimensional compensation for small-sample and small-size insulator burn mark via RGB-point cloud fusion in power grid inspection 基于rgb点云融合的电网检测中小样本小尺寸绝缘子烧痕尺寸补偿
IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-06-01 Epub Date: 2025-12-30 DOI: 10.1016/j.inffus.2025.104105
Junqiu Tang , Zhikang Yuan , Zixiang Wei , Shuojie Gao , Changyong Shen
To address the challenge of scarce burn mark samples in power infrastructure inspection, we introduce the Insulator Burn Mark RGB-Point Cloud (IBMR) dataset, the first publicly available benchmark featuring RGB-point clouds with pixel-level annotations for both insulators and burn marks. To tackle the critical issue of severe class imbalance caused by the vast number of background points and the small size of burn marks, we propose a novel two-stage RGB-point cloud segmentation framework. This framework integrates DCCU-Sampling, an innovative downsampling algorithm that effectively suppresses background points while preserving critical structures of the targets, and BB-Backtracking, a geometric recovery method that reconstructs fine-grained burn mark details lost during downsampling process. Experimental results validate the framework’s effectiveness, achieving 81.21% mIoU with 32 training samples and 68.37% mIoU with only 14 samples. The dataset is publicly available at https://huggingface.co/datasets/Junqiu-Tang/IBMR.
为了解决电力基础设施检测中烧痕样本稀缺的挑战,我们引入了绝缘子烧痕rgb点云(IBMR)数据集,这是第一个公开可用的基准测试,该数据集具有针对绝缘子和烧痕的像素级注释的rgb点云。为了解决大量背景点和小尺寸烧伤痕迹导致的严重类别不平衡的关键问题,我们提出了一种新的两阶段rgb -点云分割框架。该框架集成了DCCU-Sampling和BB-Backtracking, DCCU-Sampling是一种创新的下采样算法,可以有效地抑制背景点,同时保留目标的关键结构;BB-Backtracking是一种几何恢复方法,可以重建下采样过程中丢失的细粒度烧伤痕迹细节。实验结果验证了该框架的有效性,32个训练样本的mIoU达到81.21%,14个样本的mIoU达到68.37%。该数据集可在https://huggingface.co/datasets/Junqiu-Tang/IBMR上公开获取。
{"title":"Dimensional compensation for small-sample and small-size insulator burn mark via RGB-point cloud fusion in power grid inspection","authors":"Junqiu Tang ,&nbsp;Zhikang Yuan ,&nbsp;Zixiang Wei ,&nbsp;Shuojie Gao ,&nbsp;Changyong Shen","doi":"10.1016/j.inffus.2025.104105","DOIUrl":"10.1016/j.inffus.2025.104105","url":null,"abstract":"<div><div>To address the challenge of scarce burn mark samples in power infrastructure inspection, we introduce the Insulator Burn Mark RGB-Point Cloud (IBMR) dataset, the first publicly available benchmark featuring RGB-point clouds with pixel-level annotations for both insulators and burn marks. To tackle the critical issue of severe class imbalance caused by the vast number of background points and the small size of burn marks, we propose a novel two-stage RGB-point cloud segmentation framework. This framework integrates DCCU-Sampling, an innovative downsampling algorithm that effectively suppresses background points while preserving critical structures of the targets, and BB-Backtracking, a geometric recovery method that reconstructs fine-grained burn mark details lost during downsampling process. Experimental results validate the framework’s effectiveness, achieving 81.21% mIoU with 32 training samples and 68.37% mIoU with only 14 samples. The dataset is publicly available at <span><span>https://huggingface.co/datasets/Junqiu-Tang/IBMR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104105"},"PeriodicalIF":15.5,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Information Fusion
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1