为高性能 RGB-T 跟踪探索多模态时空语境

IF 13.7 IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-07-19 DOI:10.1109/TIP.2024.3428316

Tianlu Zhang;Qiang Jiao;Qiang Zhang;Jungong Han

{"title":"为高性能 RGB-T 跟踪探索多模态时空语境","authors":"Tianlu Zhang;Qiang Jiao;Qiang Zhang;Jungong Han","doi":"10.1109/TIP.2024.3428316","DOIUrl":null,"url":null,"abstract":"In RGB-T tracking, there exist rich spatial relationships between the target and backgrounds within multi-modal data as well as sound consistencies of spatial relationships among successive frames, which are crucial for boosting the tracking performance. However, most existing RGB-T trackers overlook such multi-modal spatial relationships and temporal consistencies within RGB-T videos, hindering them from robust tracking and practical applications in complex scenarios. In this paper, we propose a novel Multi-modal Spatial-Temporal Context (MMSTC) network for RGB-T tracking, which employs a Transformer architecture for the construction of reliable multi-modal spatial context information and the effective propagation of temporal context information. Specifically, a Multi-modal Transformer Encoder (MMTE) is designed to achieve the encoding of reliable multi-modal spatial contexts as well as the fusion of multi-modal features. Furthermore, a Quality-aware Transformer Decoder (QATD) is proposed to effectively propagate the tracking cues from historical frames to the current frame, which facilitates the object searching process. Moreover, the proposed MMSTC network can be easily extended to various tracking frameworks. New state-of-the-art results on five prevalent RGB-T tracking benchmarks demonstrate the superiorities of our proposed trackers over existing ones.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"4303-4318"},"PeriodicalIF":13.7000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Multi-Modal Spatial–Temporal Contexts for High-Performance RGB-T Tracking\",\"authors\":\"Tianlu Zhang;Qiang Jiao;Qiang Zhang;Jungong Han\",\"doi\":\"10.1109/TIP.2024.3428316\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In RGB-T tracking, there exist rich spatial relationships between the target and backgrounds within multi-modal data as well as sound consistencies of spatial relationships among successive frames, which are crucial for boosting the tracking performance. However, most existing RGB-T trackers overlook such multi-modal spatial relationships and temporal consistencies within RGB-T videos, hindering them from robust tracking and practical applications in complex scenarios. In this paper, we propose a novel Multi-modal Spatial-Temporal Context (MMSTC) network for RGB-T tracking, which employs a Transformer architecture for the construction of reliable multi-modal spatial context information and the effective propagation of temporal context information. Specifically, a Multi-modal Transformer Encoder (MMTE) is designed to achieve the encoding of reliable multi-modal spatial contexts as well as the fusion of multi-modal features. Furthermore, a Quality-aware Transformer Decoder (QATD) is proposed to effectively propagate the tracking cues from historical frames to the current frame, which facilitates the object searching process. Moreover, the proposed MMSTC network can be easily extended to various tracking frameworks. New state-of-the-art results on five prevalent RGB-T tracking benchmarks demonstrate the superiorities of our proposed trackers over existing ones.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"33 \",\"pages\":\"4303-4318\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2024-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10605602/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10605602/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在 RGB-T 跟踪中，多模态数据中的目标与背景之间存在丰富的空间关系，连续帧之间的空间关系也具有良好的一致性，这对提高跟踪性能至关重要。然而，大多数现有的 RGB-T 追踪器都忽略了 RGB-T 视频中的这种多模态空间关系和时间一致性，阻碍了它们在复杂场景中的稳健追踪和实际应用。在本文中，我们提出了一种用于 RGB-T 跟踪的新型多模态空间-时间上下文（MMSTC）网络，该网络采用变换器架构来构建可靠的多模态空间上下文信息，并有效传播时间上下文信息。具体来说，设计了一个多模态变换器编码器（MMTE），以实现可靠的多模态空间上下文编码以及多模态特征融合。此外，还提出了质量感知变换器解码器（QATD），以有效地将历史帧的跟踪线索传播到当前帧，从而促进物体搜索过程。此外，所提出的 MMSTC 网络可轻松扩展到各种跟踪框架。在五个流行的 RGB-T 跟踪基准上取得的新的先进结果表明，我们提出的跟踪器优于现有的跟踪器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Exploring Multi-Modal Spatial–Temporal Contexts for High-Performance RGB-T Tracking

In RGB-T tracking, there exist rich spatial relationships between the target and backgrounds within multi-modal data as well as sound consistencies of spatial relationships among successive frames, which are crucial for boosting the tracking performance. However, most existing RGB-T trackers overlook such multi-modal spatial relationships and temporal consistencies within RGB-T videos, hindering them from robust tracking and practical applications in complex scenarios. In this paper, we propose a novel Multi-modal Spatial-Temporal Context (MMSTC) network for RGB-T tracking, which employs a Transformer architecture for the construction of reliable multi-modal spatial context information and the effective propagation of temporal context information. Specifically, a Multi-modal Transformer Encoder (MMTE) is designed to achieve the encoding of reliable multi-modal spatial contexts as well as the fusion of multi-modal features. Furthermore, a Quality-aware Transformer Decoder (QATD) is proposed to effectively propagate the tracking cues from historical frames to the current frame, which facilitates the object searching process. Moreover, the proposed MMSTC network can be easily extended to various tracking frameworks. New state-of-the-art results on five prevalent RGB-T tracking benchmarks demonstrate the superiorities of our proposed trackers over existing ones.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量

期刊最新文献

Text-Driven Relation Manipulation of Diffusion Imagery. A Geometric Framework for Absolute Pose and Velocity Estimation with Event Cameras. Multi-Label Image Classification via Contrastive Co-occurrence Learning. Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning. Spectral State Fusion Tree Mamba for Hyperspectral Image Classification.