Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, Hannaneh Hajishirzi
We introduce the task of Multi-Modal Machine Comprehension (M3C), which aims at answering multimodal questions given a context of text, diagrams and images. We present the Textbook Question Answering (TQA) dataset that includes 1,076 lessons and 26,260 multi-modal questions, taken from middle school science curricula. Our analysis shows that a significant portion of questions require complex parsing of the text and the diagrams and reasoning, indicating that our dataset is more complex compared to previous machine comprehension and visual question answering datasets. We extend state-of-the-art methods for textual machine comprehension and visual question answering to the TQA dataset. Our experiments show that these models do not perform well on TQA. The presented dataset opens new challenges for research in question answering and reasoning across multiple modalities.
{"title":"Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension","authors":"Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, Hannaneh Hajishirzi","doi":"10.1109/CVPR.2017.571","DOIUrl":"https://doi.org/10.1109/CVPR.2017.571","url":null,"abstract":"We introduce the task of Multi-Modal Machine Comprehension (M3C), which aims at answering multimodal questions given a context of text, diagrams and images. We present the Textbook Question Answering (TQA) dataset that includes 1,076 lessons and 26,260 multi-modal questions, taken from middle school science curricula. Our analysis shows that a significant portion of questions require complex parsing of the text and the diagrams and reasoning, indicating that our dataset is more complex compared to previous machine comprehension and visual question answering datasets. We extend state-of-the-art methods for textual machine comprehension and visual question answering to the TQA dataset. Our experiments show that these models do not perform well on TQA. The presented dataset opens new challenges for research in question answering and reasoning across multiple modalities.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"57 1","pages":"5376-5384"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90843414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose an extremely simple but effective regularization technique of convolutional neural networks (CNNs), referred to as BranchOut, for online ensemble tracking. Our algorithm employs a CNN for target representation, which has a common convolutional layers but has multiple branches of fully connected layers. For better regularization, a subset of branches in the CNN are selected randomly for online learning whenever target appearance models need to be updated. Each branch may have a different number of layers to maintain variable abstraction levels of target appearances. BranchOut with multi-level target representation allows us to learn robust target appearance models with diversity and handle various challenges in visual tracking problem effectively. The proposed algorithm is evaluated in standard tracking benchmarks and shows the state-of-the-art performance even without additional pretraining on external tracking sequences.
{"title":"BranchOut: Regularization for Online Ensemble Tracking with Convolutional Neural Networks","authors":"Bohyung Han, Jack Sim, Hartwig Adam","doi":"10.1109/CVPR.2017.63","DOIUrl":"https://doi.org/10.1109/CVPR.2017.63","url":null,"abstract":"We propose an extremely simple but effective regularization technique of convolutional neural networks (CNNs), referred to as BranchOut, for online ensemble tracking. Our algorithm employs a CNN for target representation, which has a common convolutional layers but has multiple branches of fully connected layers. For better regularization, a subset of branches in the CNN are selected randomly for online learning whenever target appearance models need to be updated. Each branch may have a different number of layers to maintain variable abstraction levels of target appearances. BranchOut with multi-level target representation allows us to learn robust target appearance models with diversity and handle various challenges in visual tracking problem effectively. The proposed algorithm is evaluated in standard tracking benchmarks and shows the state-of-the-art performance even without additional pretraining on external tracking sequences.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"189 1","pages":"521-530"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83055338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiming Luo, A. Mishra, Andrew Achkar, Justin A. Eichel, Shaozi Li, Pierre-Marc Jodoin
Saliency detection aims to highlight the most relevant objects in an image. Methods using conventional models struggle whenever salient objects are pictured on top of a cluttered background while deep neural nets suffer from excess complexity and slow evaluation speeds. In this paper, we propose a simplified convolutional neural network which combines local and global information through a multi-resolution 4×5 grid structure. Instead of enforcing spacial coherence with a CRF or superpixels as is usually the case, we implemented a loss function inspired by the Mumford-Shah functional which penalizes errors on the boundary. We trained our model on the MSRA-B dataset, and tested it on six different saliency benchmark datasets. Results show that our method is on par with the state-of-the-art while reducing computation time by a factor of 18 to 100 times, enabling near real-time, high performance saliency detection.
{"title":"Non-local Deep Features for Salient Object Detection","authors":"Zhiming Luo, A. Mishra, Andrew Achkar, Justin A. Eichel, Shaozi Li, Pierre-Marc Jodoin","doi":"10.1109/CVPR.2017.698","DOIUrl":"https://doi.org/10.1109/CVPR.2017.698","url":null,"abstract":"Saliency detection aims to highlight the most relevant objects in an image. Methods using conventional models struggle whenever salient objects are pictured on top of a cluttered background while deep neural nets suffer from excess complexity and slow evaluation speeds. In this paper, we propose a simplified convolutional neural network which combines local and global information through a multi-resolution 4×5 grid structure. Instead of enforcing spacial coherence with a CRF or superpixels as is usually the case, we implemented a loss function inspired by the Mumford-Shah functional which penalizes errors on the boundary. We trained our model on the MSRA-B dataset, and tested it on six different saliency benchmark datasets. Results show that our method is on par with the state-of-the-art while reducing computation time by a factor of 18 to 100 times, enabling near real-time, high performance saliency detection.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"16 1","pages":"6593-6601"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82057473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian Caba Heilbron, Wayner Barrios, Victor Escorcia, Bernard Ghanem
Despite the recent advances in large-scale video analysis, action detection remains as one of the most challenging unsolved problems in computer vision. This snag is in part due to the large volume of data that needs to be analyzed to detect actions in videos. Existing approaches have mitigated the computational cost, but still, these methods lack rich high-level semantics that helps them to localize the actions quickly. In this paper, we introduce a Semantic Cascade Context (SCC) model that aims to detect action in long video sequences. By embracing semantic priors associated with human activities, SCC produces high-quality class-specific action proposals and prune unrelated activities in a cascade fashion. Experimental results in ActivityNet unveils that SCC achieves state-of-the-art performance for action detection while operating at real time.
{"title":"SCC: Semantic Context Cascade for Efficient Action Detection","authors":"Fabian Caba Heilbron, Wayner Barrios, Victor Escorcia, Bernard Ghanem","doi":"10.1109/CVPR.2017.338","DOIUrl":"https://doi.org/10.1109/CVPR.2017.338","url":null,"abstract":"Despite the recent advances in large-scale video analysis, action detection remains as one of the most challenging unsolved problems in computer vision. This snag is in part due to the large volume of data that needs to be analyzed to detect actions in videos. Existing approaches have mitigated the computational cost, but still, these methods lack rich high-level semantics that helps them to localize the actions quickly. In this paper, we introduce a Semantic Cascade Context (SCC) model that aims to detect action in long video sequences. By embracing semantic priors associated with human activities, SCC produces high-quality class-specific action proposals and prune unrelated activities in a cascade fashion. Experimental results in ActivityNet unveils that SCC achieves state-of-the-art performance for action detection while operating at real time.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"43 1","pages":"3175-3184"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80522790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current CNN based object detectors need initialization from pre-trained ImageNet classification models, which are usually time-consuming. In this paper, we present a fully convolutional feature mimic framework to train very efficient CNN based detectors, which do not need ImageNet pre-training and achieve competitive performance as the large and slow models. We add supervision from high-level features of the large networks in training to help the small network better learn object representation. More specifically, we conduct a mimic method for the features sampled from the entire feature map and use a transform layer to map features from the small network onto the same dimension of the large network. In training the small network, we optimize the similarity between features sampled from the same region on the feature maps of both networks. Extensive experiments are conducted on pedestrian and common object detection tasks using VGG, Inception and ResNet. On both Caltech and Pascal VOC, we show that the modified 2.5× accelerated Inception network achieves competitive performance as the full Inception Network. Our faster model runs at 80 FPS for a 1000×1500 large input with only a minor degradation of performance on Caltech.
{"title":"Mimicking Very Efficient Network for Object Detection","authors":"Quanquan Li, Sheng Jin, Junjie Yan","doi":"10.1109/CVPR.2017.776","DOIUrl":"https://doi.org/10.1109/CVPR.2017.776","url":null,"abstract":"Current CNN based object detectors need initialization from pre-trained ImageNet classification models, which are usually time-consuming. In this paper, we present a fully convolutional feature mimic framework to train very efficient CNN based detectors, which do not need ImageNet pre-training and achieve competitive performance as the large and slow models. We add supervision from high-level features of the large networks in training to help the small network better learn object representation. More specifically, we conduct a mimic method for the features sampled from the entire feature map and use a transform layer to map features from the small network onto the same dimension of the large network. In training the small network, we optimize the similarity between features sampled from the same region on the feature maps of both networks. Extensive experiments are conducted on pedestrian and common object detection tasks using VGG, Inception and ResNet. On both Caltech and Pascal VOC, we show that the modified 2.5× accelerated Inception network achieves competitive performance as the full Inception Network. Our faster model runs at 80 FPS for a 1000×1500 large input with only a minor degradation of performance on Caltech.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"7 1","pages":"7341-7349"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80784058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speckle refers to the granular patterns that occur in ultrasound images due to wave interference. Speckle removal can greatly improve the visibility of the underlying structures in an ultrasound image and enhance subsequent post processing. We present a novel framework for speckle removal based on low-rank non-local filtering. Our approach works by first computing a guidance image that assists in the selection of candidate patches for non-local filtering in the face of significant speckles. The candidate patches are further refined using a low-rank minimization estimated using a truncated weighted nuclear norm (TWNN) and structured sparsity. We show that the proposed filtering framework produces results that outperform state-of-the-art methods both qualitatively and quantitatively. This framework also provides better segmentation results when used for pre-processing ultrasound images.
{"title":"A Non-local Low-Rank Framework for Ultrasound Speckle Reduction","authors":"Lei Zhu, Chi-Wing Fu, M. S. Brown, P. Heng","doi":"10.1109/CVPR.2017.60","DOIUrl":"https://doi.org/10.1109/CVPR.2017.60","url":null,"abstract":"Speckle refers to the granular patterns that occur in ultrasound images due to wave interference. Speckle removal can greatly improve the visibility of the underlying structures in an ultrasound image and enhance subsequent post processing. We present a novel framework for speckle removal based on low-rank non-local filtering. Our approach works by first computing a guidance image that assists in the selection of candidate patches for non-local filtering in the face of significant speckles. The candidate patches are further refined using a low-rank minimization estimated using a truncated weighted nuclear norm (TWNN) and structured sparsity. We show that the proposed filtering framework produces results that outperform state-of-the-art methods both qualitatively and quantitatively. This framework also provides better segmentation results when used for pre-processing ultrasound images.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"577 1","pages":"493-501"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76259459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Zhang, Yang Cao, Shuai Fang, Yu Kang, Changwen Chen
In this paper, we address a haze removal problem from a single nighttime image, even in the presence of varicolored and non-uniform illumination. The core idea lies in a novel maximum reflectance prior. We first introduce the nighttime hazy imaging model, which includes a local ambient illumination item in both direct attenuation term and scattering term. Then, we propose a simple but effective image prior, maximum reflectance prior, to estimate the varying ambient illumination. The maximum reflectance prior is based on a key observation: for most daytime haze-free image patches, each color channel has very high intensity at some pixels. For the nighttime haze image, the local maximum intensities at each color channel are mainly contributed by the ambient illumination. Therefore, we can directly estimate the ambient illumination and transmission map, and consequently restore a high quality haze-free image. Experimental results on various nighttime hazy images demonstrate the effectiveness of the proposed approach. In particular, our approach has the advantage of computational efficiency, which is 10-100 times faster than state-of-the-art methods.
{"title":"Fast Haze Removal for Nighttime Image Using Maximum Reflectance Prior","authors":"Jing Zhang, Yang Cao, Shuai Fang, Yu Kang, Changwen Chen","doi":"10.1109/CVPR.2017.742","DOIUrl":"https://doi.org/10.1109/CVPR.2017.742","url":null,"abstract":"In this paper, we address a haze removal problem from a single nighttime image, even in the presence of varicolored and non-uniform illumination. The core idea lies in a novel maximum reflectance prior. We first introduce the nighttime hazy imaging model, which includes a local ambient illumination item in both direct attenuation term and scattering term. Then, we propose a simple but effective image prior, maximum reflectance prior, to estimate the varying ambient illumination. The maximum reflectance prior is based on a key observation: for most daytime haze-free image patches, each color channel has very high intensity at some pixels. For the nighttime haze image, the local maximum intensities at each color channel are mainly contributed by the ambient illumination. Therefore, we can directly estimate the ambient illumination and transmission map, and consequently restore a high quality haze-free image. Experimental results on various nighttime hazy images demonstrate the effectiveness of the proposed approach. In particular, our approach has the advantage of computational efficiency, which is 10-100 times faster than state-of-the-art methods.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"45 1","pages":"7016-7024"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90721435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edge detection had made significant progress with the help of deep Convolutional Networks (ConvNet). ConvNet based edge detectors approached human level performance on standard benchmarks. We provide a systematical study of these detector outputs, and show that they failed to accurately localize edges, which can be adversarial for tasks that require crisp edge inputs. In addition, we propose a novel refinement architecture to address the challenging problem of learning a crisp edge detector using ConvNet. Our method leverages a top-down backward refinement pathway, and progressively increases the resolution of feature maps to generate crisp edges. Our results achieve promising performance on BSDS500, surpassing human accuracy when using standard criteria, and largely outperforming state-of-the-art methods when using more strict criteria. We further demonstrate the benefit of crisp edge maps for estimating optical flow and generating object proposals.
{"title":"Deep Crisp Boundaries","authors":"Yupei Wang, Xin Zhao, Kaiqi Huang","doi":"10.1109/CVPR.2017.187","DOIUrl":"https://doi.org/10.1109/CVPR.2017.187","url":null,"abstract":"Edge detection had made significant progress with the help of deep Convolutional Networks (ConvNet). ConvNet based edge detectors approached human level performance on standard benchmarks. We provide a systematical study of these detector outputs, and show that they failed to accurately localize edges, which can be adversarial for tasks that require crisp edge inputs. In addition, we propose a novel refinement architecture to address the challenging problem of learning a crisp edge detector using ConvNet. Our method leverages a top-down backward refinement pathway, and progressively increases the resolution of feature maps to generate crisp edges. Our results achieve promising performance on BSDS500, surpassing human accuracy when using standard criteria, and largely outperforming state-of-the-art methods when using more strict criteria. We further demonstrate the benefit of crisp edge maps for estimating optical flow and generating object proposals.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"1724-1732"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91024824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Janai, F. Güney, Jonas Wulff, Michael J. Black, Andreas Geiger
Existing optical flow datasets are limited in size and variability due to the difficulty of capturing dense ground truth. In this paper, we tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of the state-of-the-art in optical flow under various levels of motion blur.
{"title":"Slow Flow: Exploiting High-Speed Cameras for Accurate and Diverse Optical Flow Reference Data","authors":"J. Janai, F. Güney, Jonas Wulff, Michael J. Black, Andreas Geiger","doi":"10.1109/CVPR.2017.154","DOIUrl":"https://doi.org/10.1109/CVPR.2017.154","url":null,"abstract":"Existing optical flow datasets are limited in size and variability due to the difficulty of capturing dense ground truth. In this paper, we tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of the state-of-the-art in optical flow under various levels of motion blur.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"33 1","pages":"1406-1416"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75713724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Long Short-Term Memory (LSTM) networks have shown superior performance in 3D human action recognition due to their power in modeling the dynamics and dependencies in sequential data. Since not all joints are informative for action analysis and the irrelevant joints often bring a lot of noise, we need to pay more attention to the informative ones. However, original LSTM does not have strong attention capability. Hence we propose a new class of LSTM network, Global Context-Aware Attention LSTM (GCA-LSTM), for 3D action recognition, which is able to selectively focus on the informative joints in the action sequence with the assistance of global contextual information. In order to achieve a reliable attention representation for the action sequence, we further propose a recurrent attention mechanism for our GCA-LSTM network, in which the attention performance is improved iteratively. Experiments show that our end-to-end network can reliably focus on the most informative joints in each frame of the skeleton sequence. Moreover, our network yields state-of-the-art performance on three challenging datasets for 3D action recognition.
{"title":"Global Context-Aware Attention LSTM Networks for 3D Action Recognition","authors":"Jun Liu, G. Wang, Ping Hu, Ling-yu Duan, A. Kot","doi":"10.1109/CVPR.2017.391","DOIUrl":"https://doi.org/10.1109/CVPR.2017.391","url":null,"abstract":"Long Short-Term Memory (LSTM) networks have shown superior performance in 3D human action recognition due to their power in modeling the dynamics and dependencies in sequential data. Since not all joints are informative for action analysis and the irrelevant joints often bring a lot of noise, we need to pay more attention to the informative ones. However, original LSTM does not have strong attention capability. Hence we propose a new class of LSTM network, Global Context-Aware Attention LSTM (GCA-LSTM), for 3D action recognition, which is able to selectively focus on the informative joints in the action sequence with the assistance of global contextual information. In order to achieve a reliable attention representation for the action sequence, we further propose a recurrent attention mechanism for our GCA-LSTM network, in which the attention performance is improved iteratively. Experiments show that our end-to-end network can reliably focus on the most informative joints in each frame of the skeleton sequence. Moreover, our network yields state-of-the-art performance on three challenging datasets for 3D action recognition.","PeriodicalId":6631,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"4 5 1","pages":"3671-3680"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83453053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}