{"title":"Attention-Based Gating Network for Robust Segmentation Tracking","authors":"Yijin Yang;Xiaodong Gu","doi":"10.1109/TCSVT.2024.3460400","DOIUrl":null,"url":null,"abstract":"Visual object tracking is a challenging task that aims to accurately estimate the scale and position of a designated target. Recently, segmentation networks have proven effective in visual tracking, producing outstanding results for target scale estimation. However, segmentation-based trackers still lack robustness due to the presence of similar distractors. To mitigate this issue, we propose an Attention-based Gating Network (AGNet) that produces gating weights to diminish the impact of feature maps linked to similar distractors. Subsequently, we incorporate the AGNet into the segmentation-based tracking paradigm to achieve accurate and robust tracking. Specifically, the AGNet utilizes three cascading Multi-Head Cross-Attention (MHCA) modules to generate gating weights that govern the generation of feature maps in the baseline tracker. The proficiency of the MHCA in modeling global semantic information effectively suppresses feature maps associated with similar distractors. Additionally, we introduce a distractor-aware training strategy that leverages distractor masks to train our model. To alleviate the issue of partial occlusion, we introduce a box refinement module to enhance the accuracy of the predicted target box. Comprehensive experiments conducted on 11 challenging tracking benchmarks show that our approach significantly surpasses the baseline tracker across all metrics and achieves excellent results on multiple tracking benchmarks.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"245-258"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10680064/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Visual object tracking is a challenging task that aims to accurately estimate the scale and position of a designated target. Recently, segmentation networks have proven effective in visual tracking, producing outstanding results for target scale estimation. However, segmentation-based trackers still lack robustness due to the presence of similar distractors. To mitigate this issue, we propose an Attention-based Gating Network (AGNet) that produces gating weights to diminish the impact of feature maps linked to similar distractors. Subsequently, we incorporate the AGNet into the segmentation-based tracking paradigm to achieve accurate and robust tracking. Specifically, the AGNet utilizes three cascading Multi-Head Cross-Attention (MHCA) modules to generate gating weights that govern the generation of feature maps in the baseline tracker. The proficiency of the MHCA in modeling global semantic information effectively suppresses feature maps associated with similar distractors. Additionally, we introduce a distractor-aware training strategy that leverages distractor masks to train our model. To alleviate the issue of partial occlusion, we introduce a box refinement module to enhance the accuracy of the predicted target box. Comprehensive experiments conducted on 11 challenging tracking benchmarks show that our approach significantly surpasses the baseline tracker across all metrics and achieves excellent results on multiple tracking benchmarks.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.