T. Phung, Thi Hong Thu Ma, Van Truong Nguyen, Duc-Quang Vu
{"title":"Self-Supervised Learning for Action Recognition by Video Denoising","authors":"T. Phung, Thi Hong Thu Ma, Van Truong Nguyen, Duc-Quang Vu","doi":"10.1109/RIVF51545.2021.9642129","DOIUrl":null,"url":null,"abstract":"Deep learning is a data-hungry technique that is more effective when being applied to large datasets. However, large-scale annotation datasets are not always available. A new approach, such as self-supervised learning of which labels can be automatically generated, is essential. Therefore, using self- supervised learning is a new approach to state-of-the-art methods. In this paper, we introduce a new self-supervised method namely video denoising. This method requires an autoencoder model to restore original videos. The second model is proposed, which is called the discriminator. It is used for the quality evaluation of output videos from the autoencoder. By reconstructing videos, the autoencoder is learned both spatial and temporal relations of video frames to process the downstream task easily. In the experiments, we have demonstrated that our model is well transferred to the action recognition task and outperforms state- of-the-art methods on the UCF-101 and HMDB-51 datasets.","PeriodicalId":6860,"journal":{"name":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","volume":"46 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIVF51545.2021.9642129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Deep learning is a data-hungry technique that is more effective when being applied to large datasets. However, large-scale annotation datasets are not always available. A new approach, such as self-supervised learning of which labels can be automatically generated, is essential. Therefore, using self- supervised learning is a new approach to state-of-the-art methods. In this paper, we introduce a new self-supervised method namely video denoising. This method requires an autoencoder model to restore original videos. The second model is proposed, which is called the discriminator. It is used for the quality evaluation of output videos from the autoencoder. By reconstructing videos, the autoencoder is learned both spatial and temporal relations of video frames to process the downstream task easily. In the experiments, we have demonstrated that our model is well transferred to the action recognition task and outperforms state- of-the-art methods on the UCF-101 and HMDB-51 datasets.