Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-05 DOI:10.1109/TPAMI.2024.3511621

Hao Tang;Zechao Li;Dong Zhang;Shengfeng He;Jinhui Tang

{"title":"Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection","authors":"Hao Tang;Zechao Li;Dong Zhang;Shengfeng He;Jinhui Tang","doi":"10.1109/TPAMI.2024.3511621","DOIUrl":null,"url":null,"abstract":"RGB-Thermal Salient Object Detection (RGB-T SOD) aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. A key challenge lies in bridging the inherent disparities between RGB and Thermal modalities for effective saliency map prediction. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities, thereby leading to suboptimal performance in complex scenarios. Inspired by hierarchical human visual systems, we propose the <sc>ConTriNet, a robust Confluent Triple-Flow Network employing a <italic>“Divide-and-Conquer” strategy. This framework utilizes a unified encoder with specialized decoders, each addressing different subtasks of exploring modality-specific and modality-complementary information for RGB-T SOD, thereby enhancing the final saliency map prediction. Specifically, <sc>ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. <sc>ConTriNet presents several notable advantages. It incorporates a <italic>Modality-induced Feature Modulator (MFM) in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational <italic>Residual Atrous Spatial Pyramid Module (RASPM) in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a <italic>Modality-aware Dynamic Aggregation Module (MDAM) in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a <italic>flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, <bold>VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that <sc>ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios, even when dealing with incomplete modality data.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1958-1974"},"PeriodicalIF":18.6000,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10778650/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

RGB-Thermal Salient Object Detection (RGB-T SOD) aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. A key challenge lies in bridging the inherent disparities between RGB and Thermal modalities for effective saliency map prediction. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities, thereby leading to suboptimal performance in complex scenarios. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a “Divide-and-Conquer” strategy. This framework utilizes a unified encoder with specialized decoders, each addressing different subtasks of exploring modality-specific and modality-complementary information for RGB-T SOD, thereby enhancing the final saliency map prediction. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator (MFM) in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module (RASPM) in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module (MDAM) in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios, even when dealing with incomplete modality data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分而治之：RGB-T显著目标检测的融合三流网络

rgb -热显著目标检测（RGB-T SOD）旨在在对准的可见光和热红外图像对中精确定位突出目标。一个关键的挑战在于弥合RGB和热模式之间的固有差异，以实现有效的显著性图预测。传统的编码器-解码器架构虽然是为跨模态特征交互而设计的，但可能没有充分考虑对由缺陷模态产生的噪声的鲁棒性，从而导致在复杂情况下的性能不理想。受人类分层视觉系统的启发，我们提出了ConTriNet，一种采用“分而治之”策略的鲁棒融合三流网络。该框架使用统一的编码器和专门的解码器，每个解码器处理探索RGB-T SOD模态特定和模态互补信息的不同子任务，从而增强最终的显著性图预测。具体来说，ConTriNet包括三个流程：两个特定于模态的流程探索来自RGB和热模态的线索，第三个模态互补流程整合了来自两种模态的线索。ConTriNet有几个显著的优点。它在模态共享联合编码器中加入了模态诱导特征调制器（MFM），以减少模态间的差异并减轻缺陷样本的影响。此外，分离流中的基本残差空间金字塔模块（RASPM）扩大了接受野，允许捕获多尺度上下文信息。此外，模态互补流中的模态感知动态聚合模块（MDAM）动态地聚合来自两个模态特定流的显著性相关线索。利用所提出的并行三流框架，我们通过流协同融合策略进一步细化来自不同流的显著性图，为最终预测生成高质量、全分辨率的显著性图。为了评估我们的方法的健壮性和稳定性，我们收集了一个全面的RGB-T SOD基准测试，VT-IMAG，涵盖了各种现实世界的挑战场景。在公共基准测试和我们的VT-IMAG数据集上进行的大量实验表明，即使在处理不完整的模态数据时，ConTriNet在常见和具有挑战性的场景中始终优于最先进的竞争对手。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量

期刊最新文献

2025 Reviewers List* Towards Transferable Defense Against Malicious Image Edits. Understanding Data Influence With Differential Approximation. Reservoir-Based Graph Convolutional Networks. Near-Perfect Clustering Based on Recursive Binary Splitting Using Max-MMD.