Global-local Feature Aggregation for Event-based Object Detection on EventKITTI

2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI) Pub Date : 2022-09-20 DOI:10.1109/MFI55806.2022.9913852

Zichen Liang, Hu Cao, Chu Yang, Zikai Zhang, G. Chen

{"title":"Global-local Feature Aggregation for Event-based Object Detection on EventKITTI","authors":"Zichen Liang, Hu Cao, Chu Yang, Zikai Zhang, G. Chen","doi":"10.1109/MFI55806.2022.9913852","DOIUrl":null,"url":null,"abstract":"Event sequence conveys asynchronous pixel-wise visual information in a low power and high temporal resolution manner, which enables more robust perception under challenging conditions, e.g., fast motion. Two main factors limit the development of event-based object detection in traffic scenes: lack of high-quality datasets and effective event-based algorithms. To solve the first problem, we propose a simulated event-based detection dataset named EventKITTI, which incorporates the novel event modality information into a mixed two-level (i.e. object-level and video-level) detection dataset under traffic scenarios. EventKITTI possesses the high-quality event stream and the largest number of categories at microsecond temporal resolution and 1242×375 spatial resolution, exceeding existing datasets. As for the second problem, existing algorithms rely on CNN-based, spiking or graph architectures to capture local features of moving objects, leading to poor performance in objects with incomplete contours. Hence, we propose event-based object detectors named GFA-Net and CGFA-Net. To enhance the global-local learning ability in the spatial dimension, GFA-Net introduces transformer with edge-based position encoding and multi-scale feature fusion to detect objects on static frame. Furthermore, CGFA-Net optimizes edge-based position encoding with close-loop learning based on previous detected heatmap, which aggregates temporal global features across event frames. The proposed event-based object detectors achieve the best speed-accuracy trade-off on EventKITTI, approaching an 81.3% MAP at 33.0 FPS on object-level detection dataset and a 64.5% MAP at 30.3 FPS on video-level detection dataset.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MFI55806.2022.9913852","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Event sequence conveys asynchronous pixel-wise visual information in a low power and high temporal resolution manner, which enables more robust perception under challenging conditions, e.g., fast motion. Two main factors limit the development of event-based object detection in traffic scenes: lack of high-quality datasets and effective event-based algorithms. To solve the first problem, we propose a simulated event-based detection dataset named EventKITTI, which incorporates the novel event modality information into a mixed two-level (i.e. object-level and video-level) detection dataset under traffic scenarios. EventKITTI possesses the high-quality event stream and the largest number of categories at microsecond temporal resolution and 1242×375 spatial resolution, exceeding existing datasets. As for the second problem, existing algorithms rely on CNN-based, spiking or graph architectures to capture local features of moving objects, leading to poor performance in objects with incomplete contours. Hence, we propose event-based object detectors named GFA-Net and CGFA-Net. To enhance the global-local learning ability in the spatial dimension, GFA-Net introduces transformer with edge-based position encoding and multi-scale feature fusion to detect objects on static frame. Furthermore, CGFA-Net optimizes edge-based position encoding with close-loop learning based on previous detected heatmap, which aggregates temporal global features across event frames. The proposed event-based object detectors achieve the best speed-accuracy trade-off on EventKITTI, approaching an 81.3% MAP at 33.0 FPS on object-level detection dataset and a 64.5% MAP at 30.3 FPS on video-level detection dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于EventKITTI的事件目标检测的全局-局部特征聚合

事件序列以低功耗和高时间分辨率的方式传达异步像素级视觉信息，从而在具有挑战性的条件下(例如快速运动)实现更稳健的感知。两个主要因素限制了基于事件的交通场景目标检测的发展:缺乏高质量的数据集和有效的基于事件的算法。为了解决第一个问题，我们提出了一个基于事件的模拟检测数据集EventKITTI，它将新的事件模态信息融合到交通场景下的混合两级(即物体级和视频级)检测数据集中。EventKITTI在微秒级时间分辨率和1242×375空间分辨率下拥有高质量的事件流和最多的类别，超过了现有的数据集。对于第二个问题，现有算法依赖于基于cnn的、尖峰的或图的架构来捕捉运动物体的局部特征，导致在轮廓不完整的物体上表现不佳。因此，我们提出了基于事件的目标检测器，命名为GFA-Net和CGFA-Net。为了增强空间维度上的全局-局部学习能力，GFA-Net引入了基于边缘位置编码和多尺度特征融合的变压器来检测静态框架上的目标。此外，CGFA-Net利用基于先前检测到的热图的闭环学习优化了基于边缘的位置编码，从而聚合了跨事件帧的时间全局特征。所提出的基于事件的目标检测器在EventKITTI上实现了最佳的速度-精度权衡，在对象级检测数据集上以33.0 FPS接近81.3% MAP，在视频级检测数据集上以30.3 FPS接近64.5% MAP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)

自引率

0.00%

发文量