{"title":"运动=视频-内容:从视频中实现运动表示的无监督学习","authors":"Hehe Fan, Mohan S. Kankanhalli","doi":"10.1145/3469877.3490582","DOIUrl":null,"url":null,"abstract":"Motion, according to its definition in physics, is the change in position with respect to time, regardless of the specific moving object and background. In this paper, we aim to learn appearance-independent motion representation in an unsupervised manner. The main idea is to separate motion from videos while leaving objects and background as content. Specifically, we design an encoder-decoder model which consists of a content encoder, a motion encoder and a video generator. To train the model, we leverage a one-step cycle-consistency in reconstruction within the same video and a two-step cycle-consistency in generation across different videos as self-supervised signals, and use adversarial training to remove the content representation from the motion representation. We demonstrate that the proposed framework can be used for conditional video generation and fine-grained action recognition.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos\",\"authors\":\"Hehe Fan, Mohan S. Kankanhalli\",\"doi\":\"10.1145/3469877.3490582\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motion, according to its definition in physics, is the change in position with respect to time, regardless of the specific moving object and background. In this paper, we aim to learn appearance-independent motion representation in an unsupervised manner. The main idea is to separate motion from videos while leaving objects and background as content. Specifically, we design an encoder-decoder model which consists of a content encoder, a motion encoder and a video generator. To train the model, we leverage a one-step cycle-consistency in reconstruction within the same video and a two-step cycle-consistency in generation across different videos as self-supervised signals, and use adversarial training to remove the content representation from the motion representation. We demonstrate that the proposed framework can be used for conditional video generation and fine-grained action recognition.\",\"PeriodicalId\":210974,\"journal\":{\"name\":\"ACM Multimedia Asia\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Multimedia Asia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3469877.3490582\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Multimedia Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3469877.3490582","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos
Motion, according to its definition in physics, is the change in position with respect to time, regardless of the specific moving object and background. In this paper, we aim to learn appearance-independent motion representation in an unsupervised manner. The main idea is to separate motion from videos while leaving objects and background as content. Specifically, we design an encoder-decoder model which consists of a content encoder, a motion encoder and a video generator. To train the model, we leverage a one-step cycle-consistency in reconstruction within the same video and a two-step cycle-consistency in generation across different videos as self-supervised signals, and use adversarial training to remove the content representation from the motion representation. We demonstrate that the proposed framework can be used for conditional video generation and fine-grained action recognition.