{"title":"SiamTFA: Siamese Triple-Stream Feature Aggregation Network for Efficient RGBT Tracking","authors":"Jianming Zhang;Yu Qin;Shimeng Fan;Zhu Xiao;Jin Zhang","doi":"10.1109/TITS.2024.3512551","DOIUrl":null,"url":null,"abstract":"RGBT tracking is a task that utilizes images from visible (RGB) and thermal infrared (TIR) modalities to continuously locate a target, which plays an important role in various fields including intelligent transportation systems. Most existing RGBT trackers do not achieve high precision and real-time tracking speed simultaneously. To address this challenge, we propose an innovative RGBT tracker, the Siamese Triple-stream Feature Aggregation Network (SiamTFA). Firstly, a triple-stream backbone is presented to implement multi-modal feature extraction and fusion, which contains two parallel Swin Transformer feature extraction streams, and one feature fusion stream composed of joint-complementary feature aggregation (JCFA) modules. Secondly, our proposed JCFA module utilizes a joint-complementary attention to guide the aggregation of multi-modal features. Specifically, the joint attention can focus on spatial location information and semantic information of the target by combining the features of two modalities. Considering the complementarity between RGB and TIR modalities, the complementary attention is introduced to enhance the information of beneficial modality and suppress the information of ineffective modality. Thirdly, in order to reduce the computational complexity of the joint-complementary attention, we propose a depthwise shared attention structure, which utilizes depthwise convolution and shared features to achieve lightweight attention. Finally, we conduct extensive experiments on four official RGBT test datasets and the experimental results demonstrate that our proposed tracker outperforms some state-of-the-art trackers and the tracking speed reaches 37 frames per second (FPS). The code is available at <uri>https://github.com/zjjqinyu/SiamTFA</uri>.","PeriodicalId":13416,"journal":{"name":"IEEE Transactions on Intelligent Transportation Systems","volume":"26 2","pages":"1900-1913"},"PeriodicalIF":7.9000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Intelligent Transportation Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10804856/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CIVIL","Score":null,"Total":0}
引用次数: 0
Abstract
RGBT tracking is a task that utilizes images from visible (RGB) and thermal infrared (TIR) modalities to continuously locate a target, which plays an important role in various fields including intelligent transportation systems. Most existing RGBT trackers do not achieve high precision and real-time tracking speed simultaneously. To address this challenge, we propose an innovative RGBT tracker, the Siamese Triple-stream Feature Aggregation Network (SiamTFA). Firstly, a triple-stream backbone is presented to implement multi-modal feature extraction and fusion, which contains two parallel Swin Transformer feature extraction streams, and one feature fusion stream composed of joint-complementary feature aggregation (JCFA) modules. Secondly, our proposed JCFA module utilizes a joint-complementary attention to guide the aggregation of multi-modal features. Specifically, the joint attention can focus on spatial location information and semantic information of the target by combining the features of two modalities. Considering the complementarity between RGB and TIR modalities, the complementary attention is introduced to enhance the information of beneficial modality and suppress the information of ineffective modality. Thirdly, in order to reduce the computational complexity of the joint-complementary attention, we propose a depthwise shared attention structure, which utilizes depthwise convolution and shared features to achieve lightweight attention. Finally, we conduct extensive experiments on four official RGBT test datasets and the experimental results demonstrate that our proposed tracker outperforms some state-of-the-art trackers and the tracking speed reaches 37 frames per second (FPS). The code is available at https://github.com/zjjqinyu/SiamTFA.
期刊介绍:
The theoretical, experimental and operational aspects of electrical and electronics engineering and information technologies as applied to Intelligent Transportation Systems (ITS). Intelligent Transportation Systems are defined as those systems utilizing synergistic technologies and systems engineering concepts to develop and improve transportation systems of all kinds. The scope of this interdisciplinary activity includes the promotion, consolidation and coordination of ITS technical activities among IEEE entities, and providing a focus for cooperative activities, both internally and externally.