Human Action Recognition in First Person Videos using Verb-Object Pairs

2019 27th Signal Processing and Communications Applications Conference (SIU) Pub Date : 2019-04-01 DOI:10.1109/SIU.2019.8806562

Zeynep Gökce, Selen Pehlivan

引用次数: 2

Abstract

Human action recognition problem is important for distinguishing the rich variety of human activities in first-person videos. While there has been an improvement in egocentric action recognition, the space of action categories is large and it looks impractical to label training data for all categories. In this work, we decompose action models into verb and noun model pairs and propose a method to combine them with a simple fusion strategy. Particularly, we use 3 Dimensional Convolutional Neural Network model, C3D, for verb stream to model video-based features, and we use object detection model, YOLO, for noun stream to model objects interacting with human. We present experiments on the recently introduced large-scale EGTEA Gaze+ dataset with 106 action classes, and show that our model is comparable to the state-of-the-art action recognition models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

第一人称视频中使用动物对的人类动作识别

人类动作识别问题对于区分第一人称视频中丰富多样的人类活动具有重要意义。虽然在以自我为中心的动作识别方面有了很大的进步，但动作类别的空间很大，对所有类别的训练数据进行标记是不切实际的。在这项工作中，我们将动作模型分解为动词和名词模型对，并提出了一种用简单的融合策略将它们组合起来的方法。其中，动词流使用三维卷积神经网络模型C3D来模拟基于视频的特征，名词流使用目标检测模型YOLO来模拟与人交互的对象。我们在最近引入的具有106个动作类的大规模EGTEA Gaze+数据集上进行了实验，并表明我们的模型与最先进的动作识别模型相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 27th Signal Processing and Communications Applications Conference (SIU)

自引率

0.00%

发文量