Yang Fang, Bailian Xie, Uswah Khairuddin, Zijian Min, Bingbing Jiang, Weisheng Li
{"title":"DPT-tracker: Dual pooling transformer for efficient visual tracking","authors":"Yang Fang, Bailian Xie, Uswah Khairuddin, Zijian Min, Bingbing Jiang, Weisheng Li","doi":"10.1049/cit2.12296","DOIUrl":null,"url":null,"abstract":"<p>Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target-search feature correlation by self and/or cross attention operations, thus the model complexity will grow quadratically with the number of input images. To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer-based trackers, we propose a dual pooling transformer tracking framework, dubbed as DPT, which consists of three components: a simple yet efficient spatiotemporal attention model (SAM), a mutual correlation pooling Transformer (MCPT) and a multiscale aggregation pooling Transformer (MAPT). SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi-frame templates along space-time dimensions. MCPT aims to capture multi-scale pooled and correlated contextual features, which is followed by MAPT that aggregates multi-scale features into a unified feature representation for tracking prediction. DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on TrackingNet while maintaining a shorter sequence length of attention tokens, fewer parameters and FLOPs compared to existing state-of-the-art (SOTA) Transformer tracking methods. Extensive experiments demonstrate that DPT tracker yields a strong real-time tracking baseline with a good trade-off between tracking performance and inference efficiency.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"9 4","pages":"948-959"},"PeriodicalIF":8.4000,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12296","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12296","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target-search feature correlation by self and/or cross attention operations, thus the model complexity will grow quadratically with the number of input images. To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer-based trackers, we propose a dual pooling transformer tracking framework, dubbed as DPT, which consists of three components: a simple yet efficient spatiotemporal attention model (SAM), a mutual correlation pooling Transformer (MCPT) and a multiscale aggregation pooling Transformer (MAPT). SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi-frame templates along space-time dimensions. MCPT aims to capture multi-scale pooled and correlated contextual features, which is followed by MAPT that aggregates multi-scale features into a unified feature representation for tracking prediction. DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on TrackingNet while maintaining a shorter sequence length of attention tokens, fewer parameters and FLOPs compared to existing state-of-the-art (SOTA) Transformer tracking methods. Extensive experiments demonstrate that DPT tracker yields a strong real-time tracking baseline with a good trade-off between tracking performance and inference efficiency.
期刊介绍:
CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.