{"title":"OffsetNet:实现高效的多目标跟踪、检测和分割","authors":"Wei Zhang, Jiaming Li, Meng Xia, Xu Gao, Xiao Tan, Yifeng Shi, Zhenhua Huang, Guanbin Li","doi":"10.1109/TPAMI.2024.3485644","DOIUrl":null,"url":null,"abstract":"<p><p>Offset-based representation has emerged as a promising approach for modeling semantic relations between pixels and object motion, demonstrating efficacy across various computer vision tasks. In this paper, we introduce a novel one-stage multi-tasking network tailored to extend the offset-based approach to MOTS. Our proposed framework, named OffsetNet, is designed to concurrently address amodal bounding box detection, instance segmentation, and tracking. It achieves this by formulating these three tasks within a unified pixel-offset-based representation, thereby achieving excellent efficiency and encouraging mutual collaborations. OffsetNet achieves several remarkable properties: first, the encoder is empowered by a novel Memory Enhanced Linear Self-Attention (MELSA) block to efficiently aggregate spatial-temporal features; second, all tasks are decoupled fairly using three lightweight decoders that operate in a one-shot manner; third, a novel cross-frame offsets prediction module is proposed to enhance the robustness of tracking against occlusions. With these merits, OffsetNet achieves 76.83% HOTA on KITTI MOTS benchmark, which is the best result without relying on 3D detection. Furthermore, OffsetNet achieves 74.83% HOTA at 50 FPS on the KITTI MOT benchmark, which is nearly 3.3 times faster than CenterTrack with better performance. We hope our approach will serve as a solid baseline and encourage future research in this field.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"OffsetNet: Towards Efficient Multiple Object Tracking, Detection, and Segmentation.\",\"authors\":\"Wei Zhang, Jiaming Li, Meng Xia, Xu Gao, Xiao Tan, Yifeng Shi, Zhenhua Huang, Guanbin Li\",\"doi\":\"10.1109/TPAMI.2024.3485644\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Offset-based representation has emerged as a promising approach for modeling semantic relations between pixels and object motion, demonstrating efficacy across various computer vision tasks. In this paper, we introduce a novel one-stage multi-tasking network tailored to extend the offset-based approach to MOTS. Our proposed framework, named OffsetNet, is designed to concurrently address amodal bounding box detection, instance segmentation, and tracking. It achieves this by formulating these three tasks within a unified pixel-offset-based representation, thereby achieving excellent efficiency and encouraging mutual collaborations. OffsetNet achieves several remarkable properties: first, the encoder is empowered by a novel Memory Enhanced Linear Self-Attention (MELSA) block to efficiently aggregate spatial-temporal features; second, all tasks are decoupled fairly using three lightweight decoders that operate in a one-shot manner; third, a novel cross-frame offsets prediction module is proposed to enhance the robustness of tracking against occlusions. With these merits, OffsetNet achieves 76.83% HOTA on KITTI MOTS benchmark, which is the best result without relying on 3D detection. Furthermore, OffsetNet achieves 74.83% HOTA at 50 FPS on the KITTI MOT benchmark, which is nearly 3.3 times faster than CenterTrack with better performance. We hope our approach will serve as a solid baseline and encourage future research in this field.</p>\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TPAMI.2024.3485644\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3485644","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
基于偏移量的表示法已成为一种很有前途的方法,可用于模拟像素与物体运动之间的语义关系,在各种计算机视觉任务中都显示出功效。在本文中,我们介绍了一种新颖的单级多任务网络,专门用于将基于偏移的方法扩展到 MOTS。我们提出的框架名为 OffsetNet,旨在同时处理模态边界框检测、实例分割和跟踪。为了实现这一目标,我们将这三项任务置于统一的基于像素偏移的表示法中,从而实现了出色的效率,并鼓励了相互协作。OffsetNet 实现了几个显著的特性:首先,编码器由一个新颖的内存增强线性自保持(MELSA)模块授权,以有效地聚合空间-时间特征;其次,所有任务都通过三个轻量级解码器进行解耦,以单次方式运行;第三,提出了一个新颖的跨帧偏移预测模块,以增强跟踪对遮挡的鲁棒性。凭借这些优点,OffsetNet 在 KITTI MOTS 基准测试中取得了 76.83% 的 HOTA,这是在不依赖 3D 检测的情况下取得的最佳成绩。此外,OffsetNet 在 KITTI MOT 基准测试中以 50 FPS 的速度实现了 74.83% 的 HOTA,比性能更好的 CenterTrack 快了近 3.3 倍。我们希望我们的方法能成为一个坚实的基线,并鼓励未来在这一领域的研究。
OffsetNet: Towards Efficient Multiple Object Tracking, Detection, and Segmentation.
Offset-based representation has emerged as a promising approach for modeling semantic relations between pixels and object motion, demonstrating efficacy across various computer vision tasks. In this paper, we introduce a novel one-stage multi-tasking network tailored to extend the offset-based approach to MOTS. Our proposed framework, named OffsetNet, is designed to concurrently address amodal bounding box detection, instance segmentation, and tracking. It achieves this by formulating these three tasks within a unified pixel-offset-based representation, thereby achieving excellent efficiency and encouraging mutual collaborations. OffsetNet achieves several remarkable properties: first, the encoder is empowered by a novel Memory Enhanced Linear Self-Attention (MELSA) block to efficiently aggregate spatial-temporal features; second, all tasks are decoupled fairly using three lightweight decoders that operate in a one-shot manner; third, a novel cross-frame offsets prediction module is proposed to enhance the robustness of tracking against occlusions. With these merits, OffsetNet achieves 76.83% HOTA on KITTI MOTS benchmark, which is the best result without relying on 3D detection. Furthermore, OffsetNet achieves 74.83% HOTA at 50 FPS on the KITTI MOT benchmark, which is nearly 3.3 times faster than CenterTrack with better performance. We hope our approach will serve as a solid baseline and encourage future research in this field.