Jianming Zhang , Jing Yang , Zikang Liu , Jin Wang
{"title":"RGBT tracking via frequency-aware feature enhancement and unidirectional mixed attention","authors":"Jianming Zhang , Jing Yang , Zikang Liu , Jin Wang","doi":"10.1016/j.neucom.2024.128908","DOIUrl":null,"url":null,"abstract":"<div><div>RGBT object tracking is widely used due to the complementary nature of RGB and TIR modalities. However, RGBT trackers based on Transformer or CNN face significant challenges in effectively enhancing and extracting features from one modality and fusing them into another modality. To achieve effective regional feature representation and adequate information fusion, we propose a novel tracking method that employs frequency-aware feature enhancement and bidirectional multistage feature fusion. Firstly, we propose an Early Region Feature Enhancement (ERFE) module, which is comprised of the Frequency-aware Self-region Feature Enhancement (FSFE) block and the Cross-attention Cross-region Feature Enhancement (CCFE) block. The FFT-based FSFE block can enhance the feature of the template or search region separately, while the CCFE block can improve feature representation by considering the template and search region jointly. Secondly, we propose a Bidirectional Multistage Feature Fusion (BMFF) module, with the Complementary Feature Extraction Attention (CFEA) module as its core component. The CFEA module including the Unidirectional Mixed Attention (UMA) block and the Context Focused Attention (CFA) block, can extract information from one modality. When RGB is the primary modality, TIR is the auxiliary modality, and vice versa. The auxiliary modal features processed by CFEA are added to the primary modal features. This information fusion process is bidirectional and multistage. Thirdly, extensive experiments on three benchmark datasets — RGBT234, LaSHeR, and GTOT — demonstrate that our tracker outperforms the advanced RGBT tracking methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"616 ","pages":"Article 128908"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016795","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
RGBT object tracking is widely used due to the complementary nature of RGB and TIR modalities. However, RGBT trackers based on Transformer or CNN face significant challenges in effectively enhancing and extracting features from one modality and fusing them into another modality. To achieve effective regional feature representation and adequate information fusion, we propose a novel tracking method that employs frequency-aware feature enhancement and bidirectional multistage feature fusion. Firstly, we propose an Early Region Feature Enhancement (ERFE) module, which is comprised of the Frequency-aware Self-region Feature Enhancement (FSFE) block and the Cross-attention Cross-region Feature Enhancement (CCFE) block. The FFT-based FSFE block can enhance the feature of the template or search region separately, while the CCFE block can improve feature representation by considering the template and search region jointly. Secondly, we propose a Bidirectional Multistage Feature Fusion (BMFF) module, with the Complementary Feature Extraction Attention (CFEA) module as its core component. The CFEA module including the Unidirectional Mixed Attention (UMA) block and the Context Focused Attention (CFA) block, can extract information from one modality. When RGB is the primary modality, TIR is the auxiliary modality, and vice versa. The auxiliary modal features processed by CFEA are added to the primary modal features. This information fusion process is bidirectional and multistage. Thirdly, extensive experiments on three benchmark datasets — RGBT234, LaSHeR, and GTOT — demonstrate that our tracker outperforms the advanced RGBT tracking methods.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.