{"title":"基于变压器模型的两级眼窝视觉跟踪器","authors":"Guang Han;Jianshu Ma;Ziyang Li;Haitao Zhao","doi":"10.1109/TCDS.2024.3377642","DOIUrl":null,"url":null,"abstract":"With the development of transformer visual models, attention-based trackers have shown highly competitive performance in the field of object tracking. However, in some tracking scenarios, especially those with multiple similar objects, the performance of existing trackers is often not satisfactory. In order to improve the performance of trackers in such scenarios, inspired by the fovea vision structure and its visual characteristics, this article proposes a novel foveal vision tracker (FVT). FVT combines the process of human eye fixation and object tracking, pruning based on the distance to the object rather than attention scores. This pruning method allows the receptive field of the feature extraction network to focus on the object, excluding background interference. FVT divides the feature extraction network into two stages: local and global, and introduces the local recursive module (LRM) and the view elimination module (VEM). LRM is used to enhance foreground features in the local stage, while VEM generates circular fovea-like visual field masks in the global stage and prunes tokens outside the mask, guiding the model to focus attention on high-information regions of the object. Experimental results on multiple object tracking datasets demonstrate that the proposed FVT achieves stronger object discrimination capability in the feature extraction stage, improves tracking accuracy and robustness in complex scenes, and achieves a significant accuracy improvement with an area overlap (AO) of 72.6% on the generic object tracking (GOT)-10k dataset.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 4","pages":"1575-1588"},"PeriodicalIF":5.0000,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Two-Stage Foveal Vision Tracker Based on Transformer Model\",\"authors\":\"Guang Han;Jianshu Ma;Ziyang Li;Haitao Zhao\",\"doi\":\"10.1109/TCDS.2024.3377642\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the development of transformer visual models, attention-based trackers have shown highly competitive performance in the field of object tracking. However, in some tracking scenarios, especially those with multiple similar objects, the performance of existing trackers is often not satisfactory. In order to improve the performance of trackers in such scenarios, inspired by the fovea vision structure and its visual characteristics, this article proposes a novel foveal vision tracker (FVT). FVT combines the process of human eye fixation and object tracking, pruning based on the distance to the object rather than attention scores. This pruning method allows the receptive field of the feature extraction network to focus on the object, excluding background interference. FVT divides the feature extraction network into two stages: local and global, and introduces the local recursive module (LRM) and the view elimination module (VEM). LRM is used to enhance foreground features in the local stage, while VEM generates circular fovea-like visual field masks in the global stage and prunes tokens outside the mask, guiding the model to focus attention on high-information regions of the object. Experimental results on multiple object tracking datasets demonstrate that the proposed FVT achieves stronger object discrimination capability in the feature extraction stage, improves tracking accuracy and robustness in complex scenes, and achieves a significant accuracy improvement with an area overlap (AO) of 72.6% on the generic object tracking (GOT)-10k dataset.\",\"PeriodicalId\":54300,\"journal\":{\"name\":\"IEEE Transactions on Cognitive and Developmental Systems\",\"volume\":\"16 4\",\"pages\":\"1575-1588\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-03-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cognitive and Developmental Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10472720/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10472720/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A Two-Stage Foveal Vision Tracker Based on Transformer Model
With the development of transformer visual models, attention-based trackers have shown highly competitive performance in the field of object tracking. However, in some tracking scenarios, especially those with multiple similar objects, the performance of existing trackers is often not satisfactory. In order to improve the performance of trackers in such scenarios, inspired by the fovea vision structure and its visual characteristics, this article proposes a novel foveal vision tracker (FVT). FVT combines the process of human eye fixation and object tracking, pruning based on the distance to the object rather than attention scores. This pruning method allows the receptive field of the feature extraction network to focus on the object, excluding background interference. FVT divides the feature extraction network into two stages: local and global, and introduces the local recursive module (LRM) and the view elimination module (VEM). LRM is used to enhance foreground features in the local stage, while VEM generates circular fovea-like visual field masks in the global stage and prunes tokens outside the mask, guiding the model to focus attention on high-information regions of the object. Experimental results on multiple object tracking datasets demonstrate that the proposed FVT achieves stronger object discrimination capability in the feature extraction stage, improves tracking accuracy and robustness in complex scenes, and achieves a significant accuracy improvement with an area overlap (AO) of 72.6% on the generic object tracking (GOT)-10k dataset.
期刊介绍:
The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.