重姿态授权RGB网视频动作识别

2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE) Pub Date : 2023-01-06 DOI:10.1109/ICCECE58074.2023.10135328

Song Ren, Meng Ding

{"title":"重姿态授权RGB网视频动作识别","authors":"Song Ren, Meng Ding","doi":"10.1109/ICCECE58074.2023.10135328","DOIUrl":null,"url":null,"abstract":"Recently, works related to video action recognition focus on using hybrid streams as input to get better results. Those streams usually are combinations of RGB channel with one additional feature stream such as audio, optical flow and pose information. Among those extra streams, posture as unstructured data is more difficult to fuse with RGB channel than the others. In this paper, we propose our Heavy Pose Empowered RGB Nets (HPER-Nets) ‐‐an end-to-end multitasking model‐‐based on the thorough investigation on how to fuse posture and RGB information. Given video frames as the only input, our model will reinforce it by merging the intrinsic posture information in the form of part affinity fields (PAFs), and use this hybrid stream to perform further video action recognition. Experimental results show that our model can outperform other different methods on UCF-101, UMDB and Kinetics datasets, and with only 16 frames, a 95.3% Top-1 accuracy on UCF101, a 69.6% on HMDB and a 41.0% on Kinetics have been recorded.","PeriodicalId":120030,"journal":{"name":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Heavy Pose Empowered RGB Nets for Video Action Recognition\",\"authors\":\"Song Ren, Meng Ding\",\"doi\":\"10.1109/ICCECE58074.2023.10135328\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, works related to video action recognition focus on using hybrid streams as input to get better results. Those streams usually are combinations of RGB channel with one additional feature stream such as audio, optical flow and pose information. Among those extra streams, posture as unstructured data is more difficult to fuse with RGB channel than the others. In this paper, we propose our Heavy Pose Empowered RGB Nets (HPER-Nets) ‐‐an end-to-end multitasking model‐‐based on the thorough investigation on how to fuse posture and RGB information. Given video frames as the only input, our model will reinforce it by merging the intrinsic posture information in the form of part affinity fields (PAFs), and use this hybrid stream to perform further video action recognition. Experimental results show that our model can outperform other different methods on UCF-101, UMDB and Kinetics datasets, and with only 16 frames, a 95.3% Top-1 accuracy on UCF101, a 69.6% on HMDB and a 41.0% on Kinetics have been recorded.\",\"PeriodicalId\":120030,\"journal\":{\"name\":\"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCECE58074.2023.10135328\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCECE58074.2023.10135328","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近年来，视频动作识别的研究主要集中在使用混合流作为输入来获得更好的结果。这些流通常是RGB通道与一个额外的特征流(如音频、光流和姿态信息)的组合。在这些额外的流中，姿态作为非结构化数据比其他数据更难与RGB通道融合。在本文中，我们基于如何融合姿态和RGB信息的深入研究，提出了我们的重姿态授权RGB网络(HPER-Nets)——一个端到端多任务模型。给定视频帧作为唯一的输入，我们的模型将通过以部分亲和场(paf)的形式合并固有姿态信息来增强它，并使用这种混合流来执行进一步的视频动作识别。实验结果表明，该模型在UCF-101、UMDB和Kinetics数据集上的表现优于其他不同的方法，仅用16帧，UCF101的Top-1准确率为95.3%，HMDB为69.6%，Kinetics为41.0%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Heavy Pose Empowered RGB Nets for Video Action Recognition

Recently, works related to video action recognition focus on using hybrid streams as input to get better results. Those streams usually are combinations of RGB channel with one additional feature stream such as audio, optical flow and pose information. Among those extra streams, posture as unstructured data is more difficult to fuse with RGB channel than the others. In this paper, we propose our Heavy Pose Empowered RGB Nets (HPER-Nets) ‐‐an end-to-end multitasking model‐‐based on the thorough investigation on how to fuse posture and RGB information. Given video frames as the only input, our model will reinforce it by merging the intrinsic posture information in the form of part affinity fields (PAFs), and use this hybrid stream to perform further video action recognition. Experimental results show that our model can outperform other different methods on UCF-101, UMDB and Kinetics datasets, and with only 16 frames, a 95.3% Top-1 accuracy on UCF101, a 69.6% on HMDB and a 41.0% on Kinetics have been recorded.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)

自引率

0.00%

发文量