Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms

arXiv - CS - Neural and Evolutionary Computing Pub Date : 2024-08-19 DOI:arxiv-2408.09764

Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, Yonghong Tian

{"title":"Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms","authors":"Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, Yonghong Tian","doi":"arxiv-2408.09764","DOIUrl":null,"url":null,"abstract":"Human Action Recognition (HAR) stands as a pivotal research domain in both\ncomputer vision and artificial intelligence, with RGB cameras dominating as the\npreferred tool for investigation and innovation in this field. However, in\nreal-world applications, RGB cameras encounter numerous challenges, including\nlight conditions, fast motion, and privacy concerns. Consequently, bio-inspired\nevent cameras have garnered increasing attention due to their advantages of low\nenergy consumption, high dynamic range, etc. Nevertheless, most existing\nevent-based HAR datasets are low resolution ($346 \\times 260$). In this paper,\nwe propose a large-scale, high-definition ($1280 \\times 800$) human action\nrecognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It\nencompasses 150 commonly occurring action categories, comprising a total of\n124,625 video sequences. Various factors such as multi-view, illumination,\naction speed, and occlusion are considered when recording these data. To build\na more comprehensive benchmark dataset, we report over 20 mainstream HAR models\nfor future works to compare. In addition, we also propose a novel Mamba vision\nbackbone network for event stream based HAR, termed EVMamba, which equips the\nspatial plane multi-directional scanning and novel voxel temporal scanning\nmechanism. By encoding and mining the spatio-temporal information of event\nstreams, our EVMamba has achieved favorable results across multiple datasets.\nBoth the dataset and source code will be released on\n\\url{https://github.com/Event-AHU/CeleX-HAR}","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Human Action Recognition (HAR) stands as a pivotal research domain in both computer vision and artificial intelligence, with RGB cameras dominating as the preferred tool for investigation and innovation in this field. However, in real-world applications, RGB cameras encounter numerous challenges, including light conditions, fast motion, and privacy concerns. Consequently, bio-inspired event cameras have garnered increasing attention due to their advantages of low energy consumption, high dynamic range, etc. Nevertheless, most existing event-based HAR datasets are low resolution ($346 \times 260$). In this paper, we propose a large-scale, high-definition ($1280 \times 800$) human action recognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It encompasses 150 commonly occurring action categories, comprising a total of 124,625 video sequences. Various factors such as multi-view, illumination, action speed, and occlusion are considered when recording these data. To build a more comprehensive benchmark dataset, we report over 20 mainstream HAR models for future works to compare. In addition, we also propose a novel Mamba vision backbone network for event stream based HAR, termed EVMamba, which equips the spatial plane multi-directional scanning and novel voxel temporal scanning mechanism. By encoding and mining the spatio-temporal information of event streams, our EVMamba has achieved favorable results across multiple datasets. Both the dataset and source code will be released on \url{https://github.com/Event-AHU/CeleX-HAR}

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于事件流的人类动作识别：高清基准数据集与算法

人类动作识别（HAR）是计算机视觉和人工智能领域的一个重要研究领域，RGB 摄像机是该领域研究和创新的首选工具。然而，在现实世界的应用中，RGB 摄像机遇到了许多挑战，包括光线条件、快速运动和隐私问题。因此，生物事件相机因其低能耗、高动态范围等优点而受到越来越多的关注。然而，现有的基于事件的 HAR 数据集大多分辨率较低（346 美元/次 260 美元）。在本文中，我们提出了一个基于 CeleX-V 事件相机的大规模、高清晰度（1280 美元乘以 800 美元）人类动作识别数据集，称为 CeleX-HAR。该数据集涵盖 150 个常见动作类别，共包含 124625 个视频序列。在记录这些数据时，考虑了多视角、光照、动作速度和遮挡等各种因素。为了建立一个更全面的基准数据集，我们报告了 20 多个主流 HAR 模型，供未来的工作进行比较。此外，我们还为基于事件流的 HAR 提出了一种新颖的 Mamba 视觉骨干网络，称为 EVMamba，它配备了空间平面多向扫描和新颖的体素时间扫描机制。通过对事件流的时空信息进行编码和挖掘，我们的EVMamba在多个数据集上取得了良好的效果。数据集和源代码都将在（https://github.com/Event-AHU/CeleX-HAR）上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Neural and Evolutionary Computing

自引率

0.00%

发文量