{"title":"用于视频表示学习的时域神经编码器","authors":"Hao Hu, Zhaowen Wang, Joon-Young Lee, Zhe L. Lin, Guo-Jun Qi","doi":"10.1109/CVPRW.2017.272","DOIUrl":null,"url":null,"abstract":"We address the challenge of learning good video representations by explicitly modeling the relationship between visual concepts in time space. We propose a novel Temporal Preserving Recurrent Neural Network (TPRNN) that extracts and encodes visual dynamics with frame-level features as input. The proposed network architecture captures temporal dynamics by keeping track of the ordinal relationship of co-occurring visual concepts, and constructs video representations with their temporal order patterns. The resultant video representations effectively encode temporal information of dynamic patterns, which makes them more discriminative to human actions performed with different sequences of action patterns. We evaluate the proposed model on several real video datasets, and the results show that it successfully outperforms the baseline models. In particular, we observe significant improvement on action classes that can only be distinguished by capturing the temporal orders of action patterns.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"7 1","pages":"2192-2199"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Temporal Domain Neural Encoder for Video Representation Learning\",\"authors\":\"Hao Hu, Zhaowen Wang, Joon-Young Lee, Zhe L. Lin, Guo-Jun Qi\",\"doi\":\"10.1109/CVPRW.2017.272\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We address the challenge of learning good video representations by explicitly modeling the relationship between visual concepts in time space. We propose a novel Temporal Preserving Recurrent Neural Network (TPRNN) that extracts and encodes visual dynamics with frame-level features as input. The proposed network architecture captures temporal dynamics by keeping track of the ordinal relationship of co-occurring visual concepts, and constructs video representations with their temporal order patterns. The resultant video representations effectively encode temporal information of dynamic patterns, which makes them more discriminative to human actions performed with different sequences of action patterns. We evaluate the proposed model on several real video datasets, and the results show that it successfully outperforms the baseline models. In particular, we observe significant improvement on action classes that can only be distinguished by capturing the temporal orders of action patterns.\",\"PeriodicalId\":6668,\"journal\":{\"name\":\"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)\",\"volume\":\"7 1\",\"pages\":\"2192-2199\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPRW.2017.272\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPRW.2017.272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Temporal Domain Neural Encoder for Video Representation Learning
We address the challenge of learning good video representations by explicitly modeling the relationship between visual concepts in time space. We propose a novel Temporal Preserving Recurrent Neural Network (TPRNN) that extracts and encodes visual dynamics with frame-level features as input. The proposed network architecture captures temporal dynamics by keeping track of the ordinal relationship of co-occurring visual concepts, and constructs video representations with their temporal order patterns. The resultant video representations effectively encode temporal information of dynamic patterns, which makes them more discriminative to human actions performed with different sequences of action patterns. We evaluate the proposed model on several real video datasets, and the results show that it successfully outperforms the baseline models. In particular, we observe significant improvement on action classes that can only be distinguished by capturing the temporal orders of action patterns.