Xinlong Dong , Peicheng Shi , Heng Qi , Aixi Yang , Taonian Liang
{"title":"TS-BEV: BEV object detection algorithm based on temporal-spatial feature fusion","authors":"Xinlong Dong , Peicheng Shi , Heng Qi , Aixi Yang , Taonian Liang","doi":"10.1016/j.displa.2024.102814","DOIUrl":null,"url":null,"abstract":"<div><p>In order to accurately identify occluding targets and infer the motion state of objects, we propose a Bird’s-Eye View Object Detection Network based on Temporal-Spatial feature fusion (TS-BEV), which replaces the previous multi-frame sampling method by using the cyclic propagation mode of historical frame instance information. We design a new Temporal-Spatial feature fusion attention module, which fully integrates temporal information and spatial features, and improves the inference and training speed. In response to realize multi-frame feature fusion across multiple scales and views, we propose an efficient Temporal-Spatial deformable aggregation module, which performs feature sampling and weighted summation from multiple feature maps of historical frames and current frames, and makes full use of the parallel computing capabilities of GPUs and AI chips to further improve efficiency. Furthermore, in order to solve the lack of global inference in the context of temporal-spatial fusion BEV features and the inability of instance features distributed in different locations to fully interact, we further design the BEV self-attention mechanism module to perform global operation of features, enhance global inference ability and fully interact with instance features. We have carried out extensive experimental experiments on the challenging BEV object detection nuScenes dataset, quantitative results show that our method achieves excellent performance of 61.5% mAP and 68.5% NDS in camera-only 3D object detection tasks, and qualitative results show that TS-BEV can effectively solve the problem of 3D object detection in complex traffic background with lack of light at night, with good robustness and scalability.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102814"},"PeriodicalIF":3.7000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938224001781","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
In order to accurately identify occluding targets and infer the motion state of objects, we propose a Bird’s-Eye View Object Detection Network based on Temporal-Spatial feature fusion (TS-BEV), which replaces the previous multi-frame sampling method by using the cyclic propagation mode of historical frame instance information. We design a new Temporal-Spatial feature fusion attention module, which fully integrates temporal information and spatial features, and improves the inference and training speed. In response to realize multi-frame feature fusion across multiple scales and views, we propose an efficient Temporal-Spatial deformable aggregation module, which performs feature sampling and weighted summation from multiple feature maps of historical frames and current frames, and makes full use of the parallel computing capabilities of GPUs and AI chips to further improve efficiency. Furthermore, in order to solve the lack of global inference in the context of temporal-spatial fusion BEV features and the inability of instance features distributed in different locations to fully interact, we further design the BEV self-attention mechanism module to perform global operation of features, enhance global inference ability and fully interact with instance features. We have carried out extensive experimental experiments on the challenging BEV object detection nuScenes dataset, quantitative results show that our method achieves excellent performance of 61.5% mAP and 68.5% NDS in camera-only 3D object detection tasks, and qualitative results show that TS-BEV can effectively solve the problem of 3D object detection in complex traffic background with lack of light at night, with good robustness and scalability.
期刊介绍:
Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface.
Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.