Wei Feng;Feifan Wang;Ruize Han;Yiyang Gan;Zekun Qian;Junhui Hou;Song Wang
{"title":"Unveiling the Power of Self-Supervision for Multi-View Multi-Human Association and Tracking","authors":"Wei Feng;Feifan Wang;Ruize Han;Yiyang Gan;Zekun Qian;Junhui Hou;Song Wang","doi":"10.1109/TPAMI.2024.3463966","DOIUrl":null,"url":null,"abstract":"Multi-view multi-human association and tracking (MvMHAT), is an emerging yet important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across different views at the same time, which is different from previous MOT and multi-camera MOT tasks only considering the over-time human tracking. This way, the videos for MvMHAT require more complex annotations while containing more information for self-learning. In this work, we tackle this problem with an end-to-end neural network in a self-supervised learning manner. Specifically, we propose to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry, and transitivity. Besides the reflexivity property that naturally holds, we design the self-supervised learning losses based on the properties of symmetry and transitivity, for both appearance feature learning and assignment matrix optimization, to associate multiple humans over time and across views. Furthermore, to promote the research on MvMHAT, we build two new large-scale benchmarks for the network training and testing of different algorithms. Extensive experiments on the proposed benchmarks verify the effectiveness of our method. We have released the benchmark and code to the public.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 1","pages":"351-368"},"PeriodicalIF":18.6000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10684138/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Multi-view multi-human association and tracking (MvMHAT), is an emerging yet important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across different views at the same time, which is different from previous MOT and multi-camera MOT tasks only considering the over-time human tracking. This way, the videos for MvMHAT require more complex annotations while containing more information for self-learning. In this work, we tackle this problem with an end-to-end neural network in a self-supervised learning manner. Specifically, we propose to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry, and transitivity. Besides the reflexivity property that naturally holds, we design the self-supervised learning losses based on the properties of symmetry and transitivity, for both appearance feature learning and assignment matrix optimization, to associate multiple humans over time and across views. Furthermore, to promote the research on MvMHAT, we build two new large-scale benchmarks for the network training and testing of different algorithms. Extensive experiments on the proposed benchmarks verify the effectiveness of our method. We have released the benchmark and code to the public.
Multi-view multi-human association and tracking (MvMHAT)是多人场景视频监控中一个新兴而又重要的问题,其目的是在每个视图中对一组人进行长时间跟踪,并在不同视图中同时识别同一个人,这与以往的MOT和多摄像机MOT任务只考虑人的长时间跟踪不同。这样,MvMHAT的视频需要更复杂的注释,同时包含更多的自学习信息。在这项工作中,我们以自监督学习的方式使用端到端神经网络来解决这个问题。具体而言,我们建议通过考虑反身性、对称性和传递性三个性质来利用时空自洽原理。除了自然存在的自反性特性外,我们还设计了基于对称性和传递性特性的自监督学习损失,用于外观特征学习和分配矩阵优化,以便随着时间和视图将多个人关联起来。此外,为了促进MvMHAT的研究,我们建立了两个新的大规模基准,用于不同算法的网络训练和测试。在所提出的基准上进行的大量实验验证了我们方法的有效性。我们已经向公众发布了基准测试和代码。