Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, Yonghong Tian
{"title":"基于事件流的人类动作识别:高清基准数据集与算法","authors":"Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, Yonghong Tian","doi":"arxiv-2408.09764","DOIUrl":null,"url":null,"abstract":"Human Action Recognition (HAR) stands as a pivotal research domain in both\ncomputer vision and artificial intelligence, with RGB cameras dominating as the\npreferred tool for investigation and innovation in this field. However, in\nreal-world applications, RGB cameras encounter numerous challenges, including\nlight conditions, fast motion, and privacy concerns. Consequently, bio-inspired\nevent cameras have garnered increasing attention due to their advantages of low\nenergy consumption, high dynamic range, etc. Nevertheless, most existing\nevent-based HAR datasets are low resolution ($346 \\times 260$). In this paper,\nwe propose a large-scale, high-definition ($1280 \\times 800$) human action\nrecognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It\nencompasses 150 commonly occurring action categories, comprising a total of\n124,625 video sequences. Various factors such as multi-view, illumination,\naction speed, and occlusion are considered when recording these data. To build\na more comprehensive benchmark dataset, we report over 20 mainstream HAR models\nfor future works to compare. In addition, we also propose a novel Mamba vision\nbackbone network for event stream based HAR, termed EVMamba, which equips the\nspatial plane multi-directional scanning and novel voxel temporal scanning\nmechanism. By encoding and mining the spatio-temporal information of event\nstreams, our EVMamba has achieved favorable results across multiple datasets.\nBoth the dataset and source code will be released on\n\\url{https://github.com/Event-AHU/CeleX-HAR}","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms\",\"authors\":\"Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, Yonghong Tian\",\"doi\":\"arxiv-2408.09764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human Action Recognition (HAR) stands as a pivotal research domain in both\\ncomputer vision and artificial intelligence, with RGB cameras dominating as the\\npreferred tool for investigation and innovation in this field. However, in\\nreal-world applications, RGB cameras encounter numerous challenges, including\\nlight conditions, fast motion, and privacy concerns. Consequently, bio-inspired\\nevent cameras have garnered increasing attention due to their advantages of low\\nenergy consumption, high dynamic range, etc. Nevertheless, most existing\\nevent-based HAR datasets are low resolution ($346 \\\\times 260$). In this paper,\\nwe propose a large-scale, high-definition ($1280 \\\\times 800$) human action\\nrecognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It\\nencompasses 150 commonly occurring action categories, comprising a total of\\n124,625 video sequences. Various factors such as multi-view, illumination,\\naction speed, and occlusion are considered when recording these data. To build\\na more comprehensive benchmark dataset, we report over 20 mainstream HAR models\\nfor future works to compare. In addition, we also propose a novel Mamba vision\\nbackbone network for event stream based HAR, termed EVMamba, which equips the\\nspatial plane multi-directional scanning and novel voxel temporal scanning\\nmechanism. By encoding and mining the spatio-temporal information of event\\nstreams, our EVMamba has achieved favorable results across multiple datasets.\\nBoth the dataset and source code will be released on\\n\\\\url{https://github.com/Event-AHU/CeleX-HAR}\",\"PeriodicalId\":501347,\"journal\":{\"name\":\"arXiv - CS - Neural and Evolutionary Computing\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Neural and Evolutionary Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.09764\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
人类动作识别(HAR)是计算机视觉和人工智能领域的一个重要研究领域,RGB 摄像机是该领域研究和创新的首选工具。然而,在现实世界的应用中,RGB 摄像机遇到了许多挑战,包括光线条件、快速运动和隐私问题。因此,生物事件相机因其低能耗、高动态范围等优点而受到越来越多的关注。然而,现有的基于事件的 HAR 数据集大多分辨率较低(346 美元/次 260 美元)。在本文中,我们提出了一个基于 CeleX-V 事件相机的大规模、高清晰度(1280 美元乘以 800 美元)人类动作识别数据集,称为 CeleX-HAR。该数据集涵盖 150 个常见动作类别,共包含 124625 个视频序列。在记录这些数据时,考虑了多视角、光照、动作速度和遮挡等各种因素。为了建立一个更全面的基准数据集,我们报告了 20 多个主流 HAR 模型,供未来的工作进行比较。此外,我们还为基于事件流的 HAR 提出了一种新颖的 Mamba 视觉骨干网络,称为 EVMamba,它配备了空间平面多向扫描和新颖的体素时间扫描机制。通过对事件流的时空信息进行编码和挖掘,我们的EVMamba在多个数据集上取得了良好的效果。数据集和源代码都将在(https://github.com/Event-AHU/CeleX-HAR)上发布。
Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms
Human Action Recognition (HAR) stands as a pivotal research domain in both
computer vision and artificial intelligence, with RGB cameras dominating as the
preferred tool for investigation and innovation in this field. However, in
real-world applications, RGB cameras encounter numerous challenges, including
light conditions, fast motion, and privacy concerns. Consequently, bio-inspired
event cameras have garnered increasing attention due to their advantages of low
energy consumption, high dynamic range, etc. Nevertheless, most existing
event-based HAR datasets are low resolution ($346 \times 260$). In this paper,
we propose a large-scale, high-definition ($1280 \times 800$) human action
recognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It
encompasses 150 commonly occurring action categories, comprising a total of
124,625 video sequences. Various factors such as multi-view, illumination,
action speed, and occlusion are considered when recording these data. To build
a more comprehensive benchmark dataset, we report over 20 mainstream HAR models
for future works to compare. In addition, we also propose a novel Mamba vision
backbone network for event stream based HAR, termed EVMamba, which equips the
spatial plane multi-directional scanning and novel voxel temporal scanning
mechanism. By encoding and mining the spatio-temporal information of event
streams, our EVMamba has achieved favorable results across multiple datasets.
Both the dataset and source code will be released on
\url{https://github.com/Event-AHU/CeleX-HAR}