{"title":"Progressively Parsing Interactional Objects for Fine Grained Action Detection","authors":"Bingbing Ni, Xiaokang Yang, Shenghua Gao","doi":"10.1109/CVPR.2016.116","DOIUrl":null,"url":null,"abstract":"Fine grained video action analysis often requires reliable detection and tracking of various interacting objects and human body parts, denoted as Interactional Object Parsing. However, most of the previous methods based on either independent or joint object detection might suffer from high model complexity and challenging image content, e.g., illumination/pose/appearance/scale variation, motion, and occlusion etc. In this work, we propose an end-to-end system based on recurrent neural network to perform frame by frame interactional object parsing, which can alleviate the difficulty through an incremental/progressive manner. Our key innovation is that: instead of jointly outputting all object detections at once, for each frame we use a set of long-short term memory (LSTM) nodes to incrementally refine the detections. After passing through each LSTM node, more object detections are consolidated and thus more contextual information could be utilized to localize more difficult objects. The object parsing results are further utilized to form object specific action representation for fine grained action detection. Extensive experiments on two benchmark fine grained activity datasets demonstrate that our proposed algorithm achieves better interacting object detection performance, which in turn boosts the action recognition performance over the state-of-the-art.","PeriodicalId":6515,"journal":{"name":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"71 1","pages":"1020-1028"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"75","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2016.116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 75
Abstract
Fine grained video action analysis often requires reliable detection and tracking of various interacting objects and human body parts, denoted as Interactional Object Parsing. However, most of the previous methods based on either independent or joint object detection might suffer from high model complexity and challenging image content, e.g., illumination/pose/appearance/scale variation, motion, and occlusion etc. In this work, we propose an end-to-end system based on recurrent neural network to perform frame by frame interactional object parsing, which can alleviate the difficulty through an incremental/progressive manner. Our key innovation is that: instead of jointly outputting all object detections at once, for each frame we use a set of long-short term memory (LSTM) nodes to incrementally refine the detections. After passing through each LSTM node, more object detections are consolidated and thus more contextual information could be utilized to localize more difficult objects. The object parsing results are further utilized to form object specific action representation for fine grained action detection. Extensive experiments on two benchmark fine grained activity datasets demonstrate that our proposed algorithm achieves better interacting object detection performance, which in turn boosts the action recognition performance over the state-of-the-art.