An end-to-end tracking framework via multi-view and temporal feature aggregation

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-10-10 DOI:10.1016/j.cviu.2024.104203

Yihan Yang , Ming Xu , Jason F. Ralph , Yuchen Ling , Xiaonan Pan

{"title":"An end-to-end tracking framework via multi-view and temporal feature aggregation","authors":"Yihan Yang , Ming Xu , Jason F. Ralph , Yuchen Ling , Xiaonan Pan","doi":"10.1016/j.cviu.2024.104203","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-view pedestrian tracking has frequently been used to cope with the challenges of occlusion and limited fields-of-view in single-view tracking. However, there are few end-to-end methods in this field. Many existing algorithms detect pedestrians in individual views, cluster projected detections in a top view and then track them. The others track pedestrians in individual views and then associate the projected tracklets in a top view. In this paper, an end-to-end framework is proposed for multi-view tracking, in which both multi-view and temporal aggregations of feature maps are applied. The multi-view aggregation projects the per-view feature maps to a top view, uses a transformer encoder to output encoded feature maps and then uses a CNN to calculate a pedestrian occupancy map. The temporal aggregation uses another CNN to estimate position offsets from the encoded feature maps in consecutive frames. Our experiments have demonstrated that this end-to-end framework outperforms the state-of-the-art online algorithms for multi-view pedestrian tracking.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104203"},"PeriodicalIF":4.3000,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002844","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-view pedestrian tracking has frequently been used to cope with the challenges of occlusion and limited fields-of-view in single-view tracking. However, there are few end-to-end methods in this field. Many existing algorithms detect pedestrians in individual views, cluster projected detections in a top view and then track them. The others track pedestrians in individual views and then associate the projected tracklets in a top view. In this paper, an end-to-end framework is proposed for multi-view tracking, in which both multi-view and temporal aggregations of feature maps are applied. The multi-view aggregation projects the per-view feature maps to a top view, uses a transformer encoder to output encoded feature maps and then uses a CNN to calculate a pedestrian occupancy map. The temporal aggregation uses another CNN to estimate position offsets from the encoded feature maps in consecutive frames. Our experiments have demonstrated that this end-to-end framework outperforms the state-of-the-art online algorithms for multi-view pedestrian tracking.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过多视角和时间特征聚合实现端到端跟踪框架

多视角行人跟踪经常被用来应对单视角跟踪中的遮挡和有限视场的挑战。然而，在这一领域，端到端的方法还很少。现有的许多算法都是在单个视图中检测行人，在顶视图中对检测到的行人进行聚类，然后对其进行跟踪。其他算法则是在单个视图中跟踪行人，然后将投影小轨迹关联到顶视图中。本文提出了一种端到端多视图跟踪框架，其中应用了特征图的多视图聚合和时间聚合。多视图聚合将每个视图的特征图投影到顶视图上，使用变换编码器输出编码特征图，然后使用 CNN 计算行人占用图。时间聚合使用另一个 CNN 从连续帧中的编码特征图估算位置偏移。我们的实验证明，这种端到端框架在多视角行人跟踪方面优于最先进的在线算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems

期刊最新文献

Editorial Board Multi-Scale Adaptive Skeleton Transformer for action recognition Open-set domain adaptation with visual-language foundation models Leveraging vision-language prompts for real-world image restoration and enhancement RetSeg3D: Retention-based 3D semantic segmentation for autonomous driving