视频中人类动作识别的多尺度分层码本方法

2012 Ninth Conference on Computer and Robot Vision Pub Date : 2012-05-28 DOI:10.1109/CRV.2012.32

M. J. Roshtkhari, M. Levine

{"title":"视频中人类动作识别的多尺度分层码本方法","authors":"M. J. Roshtkhari, M. Levine","doi":"10.1109/CRV.2012.32","DOIUrl":null,"url":null,"abstract":"This paper presents a novel action matching method based on a hierarchical codebook of local spatio-temporal video volumes (STVs). Given a single example of an activity as a query video, the proposed method finds similar videos to the query in a video dataset. It is based on the bag of video words (BOV) representation and does not require prior knowledge about actions, background subtraction, motion estimation or tracking. It is also robust to spatial and temporal scale changes, as well as some deformations. The hierarchical algorithm yields a compact subset of salient code words of STVs for the query video, and then the likelihood of similarity between the query video and all STVs in the target video is measured using a probabilistic inference mechanism. This hierarchy is achieved by initially constructing a codebook of STVs, while considering the uncertainty in the codebook construction, which is always ignored in current versions of the BOV approach. At the second level of the hierarchy, a large contextual region containing many STVs (Ensemble of STVs) is considered in order to construct a probabilistic model of STVs and their spatio-temporal compositions. At the third level of the hierarchy a codebook is formed for the ensembles of STVs based on their contextual similarities. The latter are the proposed labels (code words) for the actions being exhibited in the video. Finally, at the highest level of the hierarchy, the salient labels for the actions are selected by analyzing the high level code words assigned to each image pixel as a function of time. The algorithm was applied to three available video datasets for action recognition with different complexities (KTH, Weizmann, and MSR II) and the results were superior to other approaches, especially in the cases of a single training example and cross-dataset action recognition.","PeriodicalId":372951,"journal":{"name":"2012 Ninth Conference on Computer and Robot Vision","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"A Multi-Scale Hierarchical Codebook Method for Human Action Recognition in Videos Using a Single Example\",\"authors\":\"M. J. Roshtkhari, M. Levine\",\"doi\":\"10.1109/CRV.2012.32\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a novel action matching method based on a hierarchical codebook of local spatio-temporal video volumes (STVs). Given a single example of an activity as a query video, the proposed method finds similar videos to the query in a video dataset. It is based on the bag of video words (BOV) representation and does not require prior knowledge about actions, background subtraction, motion estimation or tracking. It is also robust to spatial and temporal scale changes, as well as some deformations. The hierarchical algorithm yields a compact subset of salient code words of STVs for the query video, and then the likelihood of similarity between the query video and all STVs in the target video is measured using a probabilistic inference mechanism. This hierarchy is achieved by initially constructing a codebook of STVs, while considering the uncertainty in the codebook construction, which is always ignored in current versions of the BOV approach. At the second level of the hierarchy, a large contextual region containing many STVs (Ensemble of STVs) is considered in order to construct a probabilistic model of STVs and their spatio-temporal compositions. At the third level of the hierarchy a codebook is formed for the ensembles of STVs based on their contextual similarities. The latter are the proposed labels (code words) for the actions being exhibited in the video. Finally, at the highest level of the hierarchy, the salient labels for the actions are selected by analyzing the high level code words assigned to each image pixel as a function of time. The algorithm was applied to three available video datasets for action recognition with different complexities (KTH, Weizmann, and MSR II) and the results were superior to other approaches, especially in the cases of a single training example and cross-dataset action recognition.\",\"PeriodicalId\":372951,\"journal\":{\"name\":\"2012 Ninth Conference on Computer and Robot Vision\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 Ninth Conference on Computer and Robot Vision\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CRV.2012.32\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Ninth Conference on Computer and Robot Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CRV.2012.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

提出了一种基于局部时空视频卷分层码本的动作匹配方法。给定一个活动示例作为查询视频，该方法在视频数据集中找到与查询相似的视频。它基于视频词包(BOV)表示，不需要关于动作、背景减除、运动估计或跟踪的先验知识。它对空间和时间尺度的变化以及一些变形也具有鲁棒性。分层算法为查询视频生成stv显著码字的紧凑子集，然后使用概率推理机制测量查询视频与目标视频中所有stv之间的相似似然。这种层次结构是通过首先构造stv的码本来实现的，同时考虑到码本构造中的不确定性，这在当前版本的BOV方法中总是被忽略。在层次结构的第二层，考虑一个包含许多stv (stv集合)的大上下文区域，以构建stv及其时空组成的概率模型。在层次结构的第三层，基于上下文相似性为stv的集合形成一个码本。后者是建议的标签(码字)，用于在视频中展示的动作。最后，在层次结构的最高层，通过分析分配给每个图像像素的高级码字作为时间的函数来选择动作的显著标签。将该算法应用于三个不同复杂度的视频数据集(KTH, Weizmann和MSR II)进行动作识别，结果优于其他方法，特别是在单个训练样例和跨数据集动作识别的情况下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Multi-Scale Hierarchical Codebook Method for Human Action Recognition in Videos Using a Single Example

This paper presents a novel action matching method based on a hierarchical codebook of local spatio-temporal video volumes (STVs). Given a single example of an activity as a query video, the proposed method finds similar videos to the query in a video dataset. It is based on the bag of video words (BOV) representation and does not require prior knowledge about actions, background subtraction, motion estimation or tracking. It is also robust to spatial and temporal scale changes, as well as some deformations. The hierarchical algorithm yields a compact subset of salient code words of STVs for the query video, and then the likelihood of similarity between the query video and all STVs in the target video is measured using a probabilistic inference mechanism. This hierarchy is achieved by initially constructing a codebook of STVs, while considering the uncertainty in the codebook construction, which is always ignored in current versions of the BOV approach. At the second level of the hierarchy, a large contextual region containing many STVs (Ensemble of STVs) is considered in order to construct a probabilistic model of STVs and their spatio-temporal compositions. At the third level of the hierarchy a codebook is formed for the ensembles of STVs based on their contextual similarities. The latter are the proposed labels (code words) for the actions being exhibited in the video. Finally, at the highest level of the hierarchy, the salient labels for the actions are selected by analyzing the high level code words assigned to each image pixel as a function of time. The algorithm was applied to three available video datasets for action recognition with different complexities (KTH, Weizmann, and MSR II) and the results were superior to other approaches, especially in the cases of a single training example and cross-dataset action recognition.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 Ninth Conference on Computer and Robot Vision

自引率

0.00%

发文量

期刊最新文献

Visual Place Categorization in Indoor Environments Probabilistic Obstacle Detection Using 2 1/2 D Terrain Maps Shape from Suggestive Contours Using 3D Priors Large-Scale Tattoo Image Retrieval A Metaheuristic Bat-Inspired Algorithm for Full Body Human Pose Estimation