Ye Li, Lei Wu, Yiping Chen, Xinzhong Wang, Guangqiang Yin, Zhiguo Wang
{"title":"Motion estimation and multi-stage association for tracking-by-detection","authors":"Ye Li, Lei Wu, Yiping Chen, Xinzhong Wang, Guangqiang Yin, Zhiguo Wang","doi":"10.1007/s40747-023-01273-3","DOIUrl":null,"url":null,"abstract":"<p>Multi-object tracking (MOT) aims to locate and identify objects in videos. As deep learning brings excellent performances to object detection, the tracking-by-detection (TBD) has gradually become a mainstream tracking framework. However, some drawbacks still exist in the current TBD framework: (1) inaccurate prediction of the bounding boxes would occur in the detection part, which is caused by overlooking the actual pedestrian ratio in the surveillance scene. (2) The width of the bounding boxes in the next frame might be indirectly predicted by the aspect ratio, which increases the error of width prediction in the motion prediction part. (3) Association is only performed for high-confidence detection boxes, and the low-confidence boxes caused by occlusion are discarded in the data association part, resulting in fragmentation of trajectories. To address the above issues, we propose a multi-target tracking model incorporating motion estimation and multi-stage association (MEMA). First, the aspect ratio of the ground-true bounding box is introduced to improve the fit of the detection and the ground-true bounding box, and we design the elliptical Gaussian kernel to improve the positioning accuracy of the object center point. Then, the prediction state vector of the Kalman filter is modified to predict the width and its corresponding velocity directly. It can reduce the width error of the prediction box and eliminate the velocity error of the motion estimation, which leads to a more pedestrian-friendly prediction bounding box. Finally, we propose a multi-stage association strategy to correlate different confidence boxes. Without using the appearance feature, the strategy can reduce the impact of occlusion and improve the tracking performance. On the MOT17 test set, the method proposed in this paper achieves a MOTA of 74.3% and an IDF1 of 72.4%, outperforming the current SOTA.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"17 10","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-023-01273-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multi-object tracking (MOT) aims to locate and identify objects in videos. As deep learning brings excellent performances to object detection, the tracking-by-detection (TBD) has gradually become a mainstream tracking framework. However, some drawbacks still exist in the current TBD framework: (1) inaccurate prediction of the bounding boxes would occur in the detection part, which is caused by overlooking the actual pedestrian ratio in the surveillance scene. (2) The width of the bounding boxes in the next frame might be indirectly predicted by the aspect ratio, which increases the error of width prediction in the motion prediction part. (3) Association is only performed for high-confidence detection boxes, and the low-confidence boxes caused by occlusion are discarded in the data association part, resulting in fragmentation of trajectories. To address the above issues, we propose a multi-target tracking model incorporating motion estimation and multi-stage association (MEMA). First, the aspect ratio of the ground-true bounding box is introduced to improve the fit of the detection and the ground-true bounding box, and we design the elliptical Gaussian kernel to improve the positioning accuracy of the object center point. Then, the prediction state vector of the Kalman filter is modified to predict the width and its corresponding velocity directly. It can reduce the width error of the prediction box and eliminate the velocity error of the motion estimation, which leads to a more pedestrian-friendly prediction bounding box. Finally, we propose a multi-stage association strategy to correlate different confidence boxes. Without using the appearance feature, the strategy can reduce the impact of occlusion and improve the tracking performance. On the MOT17 test set, the method proposed in this paper achieves a MOTA of 74.3% and an IDF1 of 72.4%, outperforming the current SOTA.
期刊介绍:
Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.