Takamasa Tsunoda, Y. Komori, M. Matsugu, T. Harada
We present a hierarchical recurrent network for understanding team sports activity in image and location sequences. In the hierarchical model, we integrate proposed multiple person-centered features over a temporal sequence based on LSTM's outputs. To achieve this scheme, we introduce the Keeping state in LSTM as one of externally controllable states, and extend the Hierarchical LSTMs to include mechanism for the integration. Experimental results demonstrate effectiveness of the proposed framework involving hierarchical LSTM and person-centered feature. In this study, we demonstrate improvement over the reference model. Specifically, by incorporating the person-centered feature with meta-information (e.g., location data) in our proposed late fusion framework, we also demonstrate increased discriminability of action categories and enhanced robustness against fluctuation in the number of observed players.
{"title":"Football Action Recognition Using Hierarchical LSTM","authors":"Takamasa Tsunoda, Y. Komori, M. Matsugu, T. Harada","doi":"10.1109/CVPRW.2017.25","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.25","url":null,"abstract":"We present a hierarchical recurrent network for understanding team sports activity in image and location sequences. In the hierarchical model, we integrate proposed multiple person-centered features over a temporal sequence based on LSTM's outputs. To achieve this scheme, we introduce the Keeping state in LSTM as one of externally controllable states, and extend the Hierarchical LSTMs to include mechanism for the integration. Experimental results demonstrate effectiveness of the proposed framework involving hierarchical LSTM and person-centered feature. In this study, we demonstrate improvement over the reference model. Specifically, by incorporating the person-centered feature with meta-information (e.g., location data) in our proposed late fusion framework, we also demonstrate increased discriminability of action categories and enhanced robustness against fluctuation in the number of observed players.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"30 4 1","pages":"155-163"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90532333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anupam Das, Martin Degeling, Xiaoyou Wang, Junjue Wang, N. Sadeh, M. Satyanarayanan
Computer vision based technologies have seen widespread adoption over the recent years. This use is not limited to the rapid adoption of facial recognition technology but extends to facial expression recognition, scene recognition and more. These developments raise privacy concerns and call for novel solutions to ensure adequate user awareness, and ideally, control over the resulting collection and use of potentially sensitive data. While cameras have become ubiquitous, most of the time users are not even aware of their presence. In this paper we introduce a novel distributed privacy infrastructure for the Internet-of-Things and discuss in particular how it can help enhance user's awareness of and control over the collection and use of video data about them. The infrastructure, which has undergone early deployment and evaluation on two campuses, supports the automated discovery of IoT resources and the selective notification of users. This includes the presence of computer vision applications that collect data about users. In particular, we describe an implementation of functionality that helps users discover nearby cameras and choose whether or not they want their faces to be denatured in the video streams.
{"title":"Assisting Users in a World Full of Cameras: A Privacy-Aware Infrastructure for Computer Vision Applications","authors":"Anupam Das, Martin Degeling, Xiaoyou Wang, Junjue Wang, N. Sadeh, M. Satyanarayanan","doi":"10.1109/CVPRW.2017.181","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.181","url":null,"abstract":"Computer vision based technologies have seen widespread adoption over the recent years. This use is not limited to the rapid adoption of facial recognition technology but extends to facial expression recognition, scene recognition and more. These developments raise privacy concerns and call for novel solutions to ensure adequate user awareness, and ideally, control over the resulting collection and use of potentially sensitive data. While cameras have become ubiquitous, most of the time users are not even aware of their presence. In this paper we introduce a novel distributed privacy infrastructure for the Internet-of-Things and discuss in particular how it can help enhance user's awareness of and control over the collection and use of video data about them. The infrastructure, which has undergone early deployment and evaluation on two campuses, supports the automated discovery of IoT resources and the selective notification of users. This includes the presence of computer vision applications that collect data about users. In particular, we describe an implementation of functionality that helps users discover nearby cameras and choose whether or not they want their faces to be denatured in the video streams.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"80 8","pages":"1387-1396"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91501523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soonchan Park, Ju Yong Chang, Hyuk Jeong, Jae-Ho Lee, Jiyoung Park
Human pose analysis has been known to be an effective means to evaluate athlete's performance. Marker-less 3D human pose estimation is one of the most practical methods to acquire human pose but lacks sufficient accuracy required to achieve precise performance analysis for sports. In this paper, we propose a human pose estimation algorithm that utilizes multiple types of random forests to enhance results for sports analysis. Random regression forest voting to localize joints of the athlete's anatomy is followed by random verification forests that evaluate and optimize the votes to improve the accuracy of clustering that determine the final position of anatomic joints. Experiential results show that the proposed algorithm enhances not only accuracy, but also efficiency of human pose estimation. We also conduct the field study to investigate feasibility of the algorithm for sports applications with developed golf swing analyzing system.
{"title":"Accurate and Efficient 3D Human Pose Estimation Algorithm Using Single Depth Images for Pose Analysis in Golf","authors":"Soonchan Park, Ju Yong Chang, Hyuk Jeong, Jae-Ho Lee, Jiyoung Park","doi":"10.1109/CVPRW.2017.19","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.19","url":null,"abstract":"Human pose analysis has been known to be an effective means to evaluate athlete's performance. Marker-less 3D human pose estimation is one of the most practical methods to acquire human pose but lacks sufficient accuracy required to achieve precise performance analysis for sports. In this paper, we propose a human pose estimation algorithm that utilizes multiple types of random forests to enhance results for sports analysis. Random regression forest voting to localize joints of the athlete's anatomy is followed by random verification forests that evaluate and optimize the votes to improve the accuracy of clustering that determine the final position of anatomic joints. Experiential results show that the proposed algorithm enhances not only accuracy, but also efficiency of human pose estimation. We also conduct the field study to investigate feasibility of the algorithm for sports applications with developed golf swing analyzing system.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"64 4 1","pages":"105-113"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91552094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent work in computer graphics has explored the synthesis of indoor spaces with furniture, accessories, and other layout items. In this work, we bridge the gap between the physical and virtual worlds: Given an input image of an interior or exterior space, and a general user specification of the desired furnishings and layout constraints, our method automatically furnishes the scene with a realistic arrangement and displays it to the user by augmenting the original image. Our method can deal with varying layouts and target arrangements at interactive rates, which affords the user a sense of collaboration with the design program, enabling the rapid visual assessment of various layout designs, a process which would typically be time consuming if done manually. Our method is suitable for smartphones and other camera-enabled mobile devices.
{"title":"Automated Layout Synthesis and Visualization from Images of Interior or Exterior Spaces","authors":"Tomer Weiss, Masaki Nakada, Demetri Terzopoulos","doi":"10.1109/CVPRW.2017.12","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.12","url":null,"abstract":"Recent work in computer graphics has explored the synthesis of indoor spaces with furniture, accessories, and other layout items. In this work, we bridge the gap between the physical and virtual worlds: Given an input image of an interior or exterior space, and a general user specification of the desired furnishings and layout constraints, our method automatically furnishes the scene with a realistic arrangement and displays it to the user by augmenting the original image. Our method can deal with varying layouts and target arrangements at interactive rates, which affords the user a sense of collaboration with the design program, enabling the rapid visual assessment of various layout designs, a process which would typically be time consuming if done manually. Our method is suitable for smartphones and other camera-enabled mobile devices.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"4 1","pages":"41-47"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90092076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural networks are powerful and popular learning models that achieve state-of-the-art pattern recognition performance on many computer vision, speech, and language processing tasks. However, these networks have also been shown susceptible to crafted adversarial perturbations which force misclassification of the inputs. Adversarial examples enable adversaries to subvert the expected system behavior leading to undesired consequences and could pose a security risk when these systems are deployed in the real world.,,,,,, In this work, we focus on deep convolutional neural networks and demonstrate that adversaries can easily craft adversarial examples even without any internal knowledge of the target network. Our attacks treat the network as an oracle (black-box) and only assume that the output of the network can be observed on the probed inputs. Our attacks utilize a novel local-search based technique to construct numerical approximation to the network gradient, which is then carefully used to construct a small set of pixels in an image to perturb. We demonstrate how this underlying idea can be adapted to achieve several strong notions of misclassification. The simplicity and effectiveness of our proposed schemes mean that they could serve as a litmus test for designing robust networks.
{"title":"Simple Black-Box Adversarial Attacks on Deep Neural Networks","authors":"Nina Narodytska, S. Kasiviswanathan","doi":"10.1109/CVPRW.2017.172","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.172","url":null,"abstract":"Deep neural networks are powerful and popular learning models that achieve state-of-the-art pattern recognition performance on many computer vision, speech, and language processing tasks. However, these networks have also been shown susceptible to crafted adversarial perturbations which force misclassification of the inputs. Adversarial examples enable adversaries to subvert the expected system behavior leading to undesired consequences and could pose a security risk when these systems are deployed in the real world.,,,,,, In this work, we focus on deep convolutional neural networks and demonstrate that adversaries can easily craft adversarial examples even without any internal knowledge of the target network. Our attacks treat the network as an oracle (black-box) and only assume that the output of the network can be observed on the probed inputs. Our attacks utilize a novel local-search based technique to construct numerical approximation to the network gradient, which is then carefully used to construct a small set of pixels in an image to perturb. We demonstrate how this underlying idea can be adapted to achieve several strong notions of misclassification. The simplicity and effectiveness of our proposed schemes mean that they could serve as a litmus test for designing robust networks.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"115 1","pages":"1310-1318"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79527701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Iris recognition research is heading towards enabling more relaxed acquisition conditions. This has effects on the quality and resolution of acquired images, severely affecting the accuracy of recognition systems if not tackled appropriately. In this paper, we evaluate a super-resolution algorithm used to reconstruct iris images based on iterative neighbor embedding of local image patches which tries to represent input low-resolution patches while preserving the geometry of the original high-resolution space. To this end, the geometry of the low-and high-resolution manifolds are jointly considered during the reconstruction process. We validate the system with a database of 1,872 near-infrared iris images, while fusion of two iris comparators has been adopted to improve recognition performance. The presented approach is substantially superior to bilinear/bicubic interpolations at very low resolutions, and it also outperforms a previous PCA-based iris reconstruction approach which only considers the geometry of the low-resolution manifold during the reconstruction process.
{"title":"Iris Super-Resolution Using Iterative Neighbor Embedding","authors":"F. Alonso-Fernandez, R. Farrugia, J. Bigün","doi":"10.1109/CVPRW.2017.94","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.94","url":null,"abstract":"Iris recognition research is heading towards enabling more relaxed acquisition conditions. This has effects on the quality and resolution of acquired images, severely affecting the accuracy of recognition systems if not tackled appropriately. In this paper, we evaluate a super-resolution algorithm used to reconstruct iris images based on iterative neighbor embedding of local image patches which tries to represent input low-resolution patches while preserving the geometry of the original high-resolution space. To this end, the geometry of the low-and high-resolution manifolds are jointly considered during the reconstruction process. We validate the system with a database of 1,872 near-infrared iris images, while fusion of two iris comparators has been adopted to improve recognition performance. The presented approach is substantially superior to bilinear/bicubic interpolations at very low resolutions, and it also outperforms a previous PCA-based iris reconstruction approach which only considers the geometry of the low-resolution manifold during the reconstruction process.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"1 1","pages":"655-663"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74889042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michele Merler, D. Joshi, Q. Nguyen, Stephen Hammer, John Kent, John R. Smith, R. Feris
The production of sports highlight packages summarizing a game’s most exciting moments is an essential task for broadcast media. Yet, it requires labor-intensive video editing. We propose a novel approach for auto-curating sports highlights, and use it to create a real-world system for the editorial aid of golf highlight reels. Our method fuses information from the players’ reactions (action recognition such as high-fives and fist pumps), spectators (crowd cheering), and commentator (tone of the voice and word analysis) to determine the most interesting moments of a game. We accurately identify the start and end frames of key shot highlights with additional metadata, such as the player’s name and the hole number, allowing personalized content summarization and retrieval. In addition, we introduce new techniques for learning our classifiers with reduced manual training data annotation by exploiting the correlation of different modalities. Our work has been demonstrated at a major golf tournament, successfully extracting highlights from live video streams over four consecutive days.
{"title":"Automatic Curation of Golf Highlights Using Multimodal Excitement Features","authors":"Michele Merler, D. Joshi, Q. Nguyen, Stephen Hammer, John Kent, John R. Smith, R. Feris","doi":"10.1109/CVPRW.2017.14","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.14","url":null,"abstract":"The production of sports highlight packages summarizing a game’s most exciting moments is an essential task for broadcast media. Yet, it requires labor-intensive video editing. We propose a novel approach for auto-curating sports highlights, and use it to create a real-world system for the editorial aid of golf highlight reels. Our method fuses information from the players’ reactions (action recognition such as high-fives and fist pumps), spectators (crowd cheering), and commentator (tone of the voice and word analysis) to determine the most interesting moments of a game. We accurately identify the start and end frames of key shot highlights with additional metadata, such as the player’s name and the hole number, allowing personalized content summarization and retrieval. In addition, we introduce new techniques for learning our classifiers with reduced manual training data annotation by exploiting the correlation of different modalities. Our work has been demonstrated at a major golf tournament, successfully extracting highlights from live video streams over four consecutive days.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"142 1","pages":"57-65"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88959889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motion analysis is often restricted to a laboratory setup with multiple cameras and force sensors which requires expensive equipment and knowledgeable operators. Therefore it lacks in simplicity and flexibility. We propose an algorithm combining monocular 3D pose estimation with physics-based modeling to introduce a statistical framework for fast and robust 3D motion analysis from 2D video-data. We use a factorization approach to learn 3D motion coefficients and join them with physical parameters, that describe the dynamic of a mass-spring-model. Our approach does neither require additional force measurement nor torque optimization and only uses a single camera while allowing to estimate unobservable torques in the human body. We show that our algorithm improves the monocular 3D reconstruction by enforcing plausible human motion and resolving the ambiguity of camera and object motion.,,,,,,The performance is evaluated on different motions and multiple test data sets as well as on challenging outdoor sequences.
{"title":"Joint 3D Human Motion Capture and Physical Analysis from Monocular Videos","authors":"Petrissa Zell, Bastian Wandt, B. Rosenhahn","doi":"10.1109/CVPRW.2017.9","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.9","url":null,"abstract":"Motion analysis is often restricted to a laboratory setup with multiple cameras and force sensors which requires expensive equipment and knowledgeable operators. Therefore it lacks in simplicity and flexibility. We propose an algorithm combining monocular 3D pose estimation with physics-based modeling to introduce a statistical framework for fast and robust 3D motion analysis from 2D video-data. We use a factorization approach to learn 3D motion coefficients and join them with physical parameters, that describe the dynamic of a mass-spring-model. Our approach does neither require additional force measurement nor torque optimization and only uses a single camera while allowing to estimate unobservable torques in the human body. We show that our algorithm improves the monocular 3D reconstruction by enforcing plausible human motion and resolving the ambiguity of camera and object motion.,,,,,,The performance is evaluated on different motions and multiple test data sets as well as on challenging outdoor sequences.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"30 1","pages":"17-26"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84382527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a two-stream network for face tampering detection. We train GoogLeNet to detect tampering artifacts in a face classification stream, and train a patch based triplet network to leverage features capturing local noise residuals and camera characteristics as a second stream. In addition, we use two different online face swaping applications to create a new dataset that consists of 2010 tampered images, each of which contains a tampered face. We evaluate the proposed two-stream network on our newly collected dataset. Experimental results demonstrate the effectness of our method.
{"title":"Two-Stream Neural Networks for Tampered Face Detection","authors":"Peng Zhou, Xintong Han, Vlad I. Morariu, L. Davis","doi":"10.1109/CVPRW.2017.229","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.229","url":null,"abstract":"We propose a two-stream network for face tampering detection. We train GoogLeNet to detect tampering artifacts in a face classification stream, and train a patch based triplet network to leverage features capturing local noise residuals and camera characteristics as a second stream. In addition, we use two different online face swaping applications to create a new dataset that consists of 2010 tampered images, each of which contains a tampered face. We evaluate the proposed two-stream network on our newly collected dataset. Experimental results demonstrate the effectness of our method.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"66 1","pages":"1831-1839"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89285009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Chen, Hua Yang, Ji Zhu, Qin Zhou, Shuang Wu, Zhiyong Gao
In this paper, we propose a novel deep end-to-end network to automatically learn the spatial-temporal fusion features for video-based person re-identification. Specifically, the proposed network consists of CNN and RNN to jointly learn both the spatial and the temporal features of input image sequences. The network is optimized by utilizing the siamese and softmax losses simultaneously to pull the instances of the same person closer and push the instances of different persons apart. Our network is trained on full-body and part-body image sequences respectively to learn complementary representations from holistic and local perspectives. By combining them together, we obtain more discriminative features that are beneficial to person re-identification. Experiments conducted on the PRID-2011, i-LIDS-VIS and MARS datasets show that the proposed method performs favorably against existing approaches.
在本文中,我们提出了一种新的深度端到端网络来自动学习基于视频的人物再识别的时空融合特征。具体来说,该网络由CNN和RNN组成,共同学习输入图像序列的空间和时间特征。通过同时利用siamese和softmax损失来优化网络,将同一个人的实例拉得更近,并将不同人的实例分开。我们的网络分别在全身和部分身体图像序列上进行训练,从整体和局部角度学习互补表示。将它们结合在一起,我们得到了更多有利于人的再识别的判别特征。在PRID-2011、i- lid - vis和MARS数据集上进行的实验表明,该方法优于现有方法。
{"title":"Deep Spatial-Temporal Fusion Network for Video-Based Person Re-identification","authors":"Lin Chen, Hua Yang, Ji Zhu, Qin Zhou, Shuang Wu, Zhiyong Gao","doi":"10.1109/CVPRW.2017.191","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.191","url":null,"abstract":"In this paper, we propose a novel deep end-to-end network to automatically learn the spatial-temporal fusion features for video-based person re-identification. Specifically, the proposed network consists of CNN and RNN to jointly learn both the spatial and the temporal features of input image sequences. The network is optimized by utilizing the siamese and softmax losses simultaneously to pull the instances of the same person closer and push the instances of different persons apart. Our network is trained on full-body and part-body image sequences respectively to learn complementary representations from holistic and local perspectives. By combining them together, we obtain more discriminative features that are beneficial to person re-identification. Experiments conducted on the PRID-2011, i-LIDS-VIS and MARS datasets show that the proposed method performs favorably against existing approaches.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"9 1","pages":"1478-1485"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87911439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}