{"title":"用于动作识别的显著性-上下文双流卷积","authors":"Quan-Qi Chen, Feng Liu, Xue Li, Baodi Liu, Yujin Zhang","doi":"10.1109/ICIP.2016.7532925","DOIUrl":null,"url":null,"abstract":"Recently, very deep two-stream ConvNets have achieved great discriminative power for video classification, which is especially the case for the temporal ConvNets when trained on multi-frame optical flow. However, action recognition in videos often fall prey to the wild camera motion, which poses challenges on the extraction of reliable optical flow for human body. In light of this, we propose a novel method to remove the global camera motion, which explicitly calculates a homography between two consecutive frames without human detection. Given the estimated homography due to camera motion, background motion can be canceled out from the warped optical flow. We take this a step further and design a new architecture called Saliency-Context two-stream ConvNets, where the context two-stream ConvNets are employed to recognize the entire scene in video frames, whilst the saliency streams are trained on salient human motion regions that are detected from the warped optical flow. Finally, the Saliency-Context two-stream ConvNets allow us to capture complementary information and achieve state-of-the-art performance on UCF101 dataset.","PeriodicalId":6521,"journal":{"name":"2016 IEEE International Conference on Image Processing (ICIP)","volume":"32 1","pages":"3076-3080"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Saliency-context two-stream convnets for action recognition\",\"authors\":\"Quan-Qi Chen, Feng Liu, Xue Li, Baodi Liu, Yujin Zhang\",\"doi\":\"10.1109/ICIP.2016.7532925\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, very deep two-stream ConvNets have achieved great discriminative power for video classification, which is especially the case for the temporal ConvNets when trained on multi-frame optical flow. However, action recognition in videos often fall prey to the wild camera motion, which poses challenges on the extraction of reliable optical flow for human body. In light of this, we propose a novel method to remove the global camera motion, which explicitly calculates a homography between two consecutive frames without human detection. Given the estimated homography due to camera motion, background motion can be canceled out from the warped optical flow. We take this a step further and design a new architecture called Saliency-Context two-stream ConvNets, where the context two-stream ConvNets are employed to recognize the entire scene in video frames, whilst the saliency streams are trained on salient human motion regions that are detected from the warped optical flow. Finally, the Saliency-Context two-stream ConvNets allow us to capture complementary information and achieve state-of-the-art performance on UCF101 dataset.\",\"PeriodicalId\":6521,\"journal\":{\"name\":\"2016 IEEE International Conference on Image Processing (ICIP)\",\"volume\":\"32 1\",\"pages\":\"3076-3080\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Conference on Image Processing (ICIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIP.2016.7532925\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP.2016.7532925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Saliency-context two-stream convnets for action recognition
Recently, very deep two-stream ConvNets have achieved great discriminative power for video classification, which is especially the case for the temporal ConvNets when trained on multi-frame optical flow. However, action recognition in videos often fall prey to the wild camera motion, which poses challenges on the extraction of reliable optical flow for human body. In light of this, we propose a novel method to remove the global camera motion, which explicitly calculates a homography between two consecutive frames without human detection. Given the estimated homography due to camera motion, background motion can be canceled out from the warped optical flow. We take this a step further and design a new architecture called Saliency-Context two-stream ConvNets, where the context two-stream ConvNets are employed to recognize the entire scene in video frames, whilst the saliency streams are trained on salient human motion regions that are detected from the warped optical flow. Finally, the Saliency-Context two-stream ConvNets allow us to capture complementary information and achieve state-of-the-art performance on UCF101 dataset.