RGB-T Tracking With Template-Bridged Search Interaction and Target-Preserved Template Updating

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-10-07 DOI:10.1109/TPAMI.2024.3475472

Bo Li;Fengguang Peng;Tianrui Hui;Xiaoming Wei;Xiaolin Wei;Lijun Zhang;Hang Shi;Si Liu

{"title":"RGB-T Tracking With Template-Bridged Search Interaction and Target-Preserved Template Updating","authors":"Bo Li;Fengguang Peng;Tianrui Hui;Xiaoming Wei;Xiaolin Wei;Lijun Zhang;Hang Shi;Si Liu","doi":"10.1109/TPAMI.2024.3475472","DOIUrl":null,"url":null,"abstract":"The goal of RGB-Thermal (RGB-T) tracking is to utilize the synergistic and complementary strengths of RGB and TIR modalities to enhance tracking in diverse situations, with cross-modal interaction being a crucial element. Earlier methods often simply combine the features of the RGB and TIR search frames, leading to a coarse interaction that also introduced unnecessary background noise. Many other approaches sample candidate boxes from search frames and apply different fusion techniques to individual pairs of RGB and TIR boxes, which confines cross-modal interactions to local areas and results in insufficient context modeling. Additionally, mining video temporal contexts is also under-explored in RGB-T tracking. To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module that exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts. An Illumination Guided Fusion (IGF) module is designed to adaptively fuse RGB and TIR search region tokens with a global illumination factor. Furthermore, in the inference stage, we also propose an efficient Target-Preserved Template Updating (TPTU) strategy, leveraging the temporal context within video sequences to accommodate the target’s appearance change. Our proposed modules are integrated into a ViT backbone for joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate our method achieves new state-of-the-art performances.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 1","pages":"634-649"},"PeriodicalIF":18.6000,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10706882/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The goal of RGB-Thermal (RGB-T) tracking is to utilize the synergistic and complementary strengths of RGB and TIR modalities to enhance tracking in diverse situations, with cross-modal interaction being a crucial element. Earlier methods often simply combine the features of the RGB and TIR search frames, leading to a coarse interaction that also introduced unnecessary background noise. Many other approaches sample candidate boxes from search frames and apply different fusion techniques to individual pairs of RGB and TIR boxes, which confines cross-modal interactions to local areas and results in insufficient context modeling. Additionally, mining video temporal contexts is also under-explored in RGB-T tracking. To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module that exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts. An Illumination Guided Fusion (IGF) module is designed to adaptively fuse RGB and TIR search region tokens with a global illumination factor. Furthermore, in the inference stage, we also propose an efficient Target-Preserved Template Updating (TPTU) strategy, leveraging the temporal context within video sequences to accommodate the target’s appearance change. Our proposed modules are integrated into a ViT backbone for joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate our method achieves new state-of-the-art performances.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用模板网格搜索交互和目标保留模板更新进行 RGB-T 跟踪

RGB-热（RGB- t）跟踪的目标是利用RGB和TIR模式的协同和互补优势，加强在不同情况下的跟踪，跨模式交互是一个关键因素。早期的方法通常简单地结合RGB和TIR搜索帧的特征，导致粗糙的相互作用，也引入了不必要的背景噪声。许多其他方法从搜索框架中采样候选框，并将不同的融合技术应用于RGB和TIR框的单个对，这将跨模态交互限制在局部区域，导致上下文建模不足。此外，挖掘视频时间上下文在RGB-T跟踪中也未得到充分的研究。为了减轻这些限制，我们提出了一种新的模板桥接搜索区域交互（TBSI）模块，该模块利用模板作为媒介，通过收集和分发与目标相关的对象和环境上下文来桥接RGB和TIR搜索区域之间的跨模态交互。设计了光照引导融合（IGF）模块，利用全局光照因子自适应融合RGB和TIR搜索区域令牌。此外，在推理阶段，我们还提出了一种有效的目标保留模板更新（TPTU）策略，利用视频序列中的时间上下文来适应目标的外观变化。我们提出的模块被集成到ViT主干中，用于联合特征提取、搜索模板匹配和跨模态交互。在三个流行的RGB-T跟踪基准上进行的大量实验表明，我们的方法实现了新的最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量