{"title":"Overlapped Trajectory-Enhanced Visual Tracking","authors":"Li Shen;Xuyi Fan;Hongguang Li","doi":"10.1109/TCSVT.2024.3440330","DOIUrl":null,"url":null,"abstract":"Deep-learning-based methods have achieved promising performance in visual tracking tasks. However, the backbones of the existing trackers normally emanate from the object detection realm, making them inefficient and insufficient in terms of spatial template matching. Moreover, such trackers apply temporal information without considering its authenticity during the online inference step, rendering them prone to error accumulation. To address these two issues, this work proposes OTETrack, a novel visual tracker with overlapped feature extraction and robust trajectory enhancement. The backbone of OTETrack, termed Overlapped ViT, slices the input image into overlapped patches to attain stronger template matching capabilities and sends them to alternating attention modules to maintain high model efficiency. Moreover, the trajectory enhancement mechanism in OTETrack is used to predict the center of the ladder-shaped Hanning window, which mildly penalizes the displacements between the spatial tracking results and the temporal predicted results to maintain the tracking consistency of a video sequence, thus mitigating the influences of spurious temporal information. Extensive experiments conducted on five benchmarks with thirteen baselines demonstrate the state-of-the-art performance of OTETrack. The source code and Appendix are released on \n<uri>https://github.com/OrigamiSL/OTETrack</uri>\n.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"12949-12962"},"PeriodicalIF":11.1000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10630872","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10630872/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Deep-learning-based methods have achieved promising performance in visual tracking tasks. However, the backbones of the existing trackers normally emanate from the object detection realm, making them inefficient and insufficient in terms of spatial template matching. Moreover, such trackers apply temporal information without considering its authenticity during the online inference step, rendering them prone to error accumulation. To address these two issues, this work proposes OTETrack, a novel visual tracker with overlapped feature extraction and robust trajectory enhancement. The backbone of OTETrack, termed Overlapped ViT, slices the input image into overlapped patches to attain stronger template matching capabilities and sends them to alternating attention modules to maintain high model efficiency. Moreover, the trajectory enhancement mechanism in OTETrack is used to predict the center of the ladder-shaped Hanning window, which mildly penalizes the displacements between the spatial tracking results and the temporal predicted results to maintain the tracking consistency of a video sequence, thus mitigating the influences of spurious temporal information. Extensive experiments conducted on five benchmarks with thirteen baselines demonstrate the state-of-the-art performance of OTETrack. The source code and Appendix are released on
https://github.com/OrigamiSL/OTETrack
.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.