Zichen Liang, Hu Cao, Chu Yang, Zikai Zhang, G. Chen
{"title":"Global-local Feature Aggregation for Event-based Object Detection on EventKITTI","authors":"Zichen Liang, Hu Cao, Chu Yang, Zikai Zhang, G. Chen","doi":"10.1109/MFI55806.2022.9913852","DOIUrl":null,"url":null,"abstract":"Event sequence conveys asynchronous pixel-wise visual information in a low power and high temporal resolution manner, which enables more robust perception under challenging conditions, e.g., fast motion. Two main factors limit the development of event-based object detection in traffic scenes: lack of high-quality datasets and effective event-based algorithms. To solve the first problem, we propose a simulated event-based detection dataset named EventKITTI, which incorporates the novel event modality information into a mixed two-level (i.e. object-level and video-level) detection dataset under traffic scenarios. EventKITTI possesses the high-quality event stream and the largest number of categories at microsecond temporal resolution and 1242×375 spatial resolution, exceeding existing datasets. As for the second problem, existing algorithms rely on CNN-based, spiking or graph architectures to capture local features of moving objects, leading to poor performance in objects with incomplete contours. Hence, we propose event-based object detectors named GFA-Net and CGFA-Net. To enhance the global-local learning ability in the spatial dimension, GFA-Net introduces transformer with edge-based position encoding and multi-scale feature fusion to detect objects on static frame. Furthermore, CGFA-Net optimizes edge-based position encoding with close-loop learning based on previous detected heatmap, which aggregates temporal global features across event frames. The proposed event-based object detectors achieve the best speed-accuracy trade-off on EventKITTI, approaching an 81.3% MAP at 33.0 FPS on object-level detection dataset and a 64.5% MAP at 30.3 FPS on video-level detection dataset.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MFI55806.2022.9913852","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Event sequence conveys asynchronous pixel-wise visual information in a low power and high temporal resolution manner, which enables more robust perception under challenging conditions, e.g., fast motion. Two main factors limit the development of event-based object detection in traffic scenes: lack of high-quality datasets and effective event-based algorithms. To solve the first problem, we propose a simulated event-based detection dataset named EventKITTI, which incorporates the novel event modality information into a mixed two-level (i.e. object-level and video-level) detection dataset under traffic scenarios. EventKITTI possesses the high-quality event stream and the largest number of categories at microsecond temporal resolution and 1242×375 spatial resolution, exceeding existing datasets. As for the second problem, existing algorithms rely on CNN-based, spiking or graph architectures to capture local features of moving objects, leading to poor performance in objects with incomplete contours. Hence, we propose event-based object detectors named GFA-Net and CGFA-Net. To enhance the global-local learning ability in the spatial dimension, GFA-Net introduces transformer with edge-based position encoding and multi-scale feature fusion to detect objects on static frame. Furthermore, CGFA-Net optimizes edge-based position encoding with close-loop learning based on previous detected heatmap, which aggregates temporal global features across event frames. The proposed event-based object detectors achieve the best speed-accuracy trade-off on EventKITTI, approaching an 81.3% MAP at 33.0 FPS on object-level detection dataset and a 64.5% MAP at 30.3 FPS on video-level detection dataset.