{"title":"Feature distillation from vision-language model for semisupervised action classification","authors":"ASLI ÇELİK, AYHAN KÜÇÜKMANİSA, OĞUZHAN URHAN","doi":"10.55730/1300-0632.4038","DOIUrl":null,"url":null,"abstract":": The training of supervised machine learning approaches is critically dependent on annotating large-scale datasets. Semisupervised learning approaches aim to achieve compatible performance with supervised methods using relatively less annotation without sacrificing good generalization capacity. In line with this objective, ways of leveraging unlabeled data have been the subject of intense research. However, semisupervised video action recognition has received relatively less attention compared to image domain implementations. Existing semisupervised video action recognition methods trained from scratch rely heavily on augmentation techniques, complex architectures, and/or the use of other modalities while distillation-based methods use models that have only been trained for 2D computer vision tasks. In another line of work, pretrained vision-language models have shown very promising results for generating general-purpose visual features with reports of high zero-shot performance for many downstream tasks. In this work, we exploit a language-supervised visual encoder for learning video representations for video action classification tasks. We propose a teacher-student learning paradigm through feature distillation and pseudo-labeling. Our experimental results are a proof-of-concept revealing that multimodal feature extractors can be utilized for spatiotemporal feature extraction in a semisupervised learning context and show compatible performance with SOTA methods, especially in a low-label regime.","PeriodicalId":49410,"journal":{"name":"Turkish Journal of Electrical Engineering and Computer Sciences","volume":"43 1","pages":"0"},"PeriodicalIF":1.2000,"publicationDate":"2023-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Turkish Journal of Electrical Engineering and Computer Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.55730/1300-0632.4038","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
: The training of supervised machine learning approaches is critically dependent on annotating large-scale datasets. Semisupervised learning approaches aim to achieve compatible performance with supervised methods using relatively less annotation without sacrificing good generalization capacity. In line with this objective, ways of leveraging unlabeled data have been the subject of intense research. However, semisupervised video action recognition has received relatively less attention compared to image domain implementations. Existing semisupervised video action recognition methods trained from scratch rely heavily on augmentation techniques, complex architectures, and/or the use of other modalities while distillation-based methods use models that have only been trained for 2D computer vision tasks. In another line of work, pretrained vision-language models have shown very promising results for generating general-purpose visual features with reports of high zero-shot performance for many downstream tasks. In this work, we exploit a language-supervised visual encoder for learning video representations for video action classification tasks. We propose a teacher-student learning paradigm through feature distillation and pseudo-labeling. Our experimental results are a proof-of-concept revealing that multimodal feature extractors can be utilized for spatiotemporal feature extraction in a semisupervised learning context and show compatible performance with SOTA methods, especially in a low-label regime.
期刊介绍:
The Turkish Journal of Electrical Engineering & Computer Sciences is published electronically 6 times a year by the Scientific and Technological Research Council of Turkey (TÜBİTAK)
Accepts English-language manuscripts in the areas of power and energy, environmental sustainability and energy efficiency, electronics, industry applications, control systems, information and systems, applied electromagnetics, communications, signal and image processing, tomographic image reconstruction, face recognition, biometrics, speech processing, video processing and analysis, object recognition, classification, feature extraction, parallel and distributed computing, cognitive systems, interaction, robotics, digital libraries and content, personalized healthcare, ICT for mobility, sensors, and artificial intelligence.
Contribution is open to researchers of all nationalities.