Lin Yuan, Zhen He, Qianqian Wang, Leiyang Xu, Xiang Ma
{"title":"SkeletonCLIP: Recognizing Skeleton-based Human Actions with Text Prompts","authors":"Lin Yuan, Zhen He, Qianqian Wang, Leiyang Xu, Xiang Ma","doi":"10.1109/ICSAI57119.2022.10005459","DOIUrl":null,"url":null,"abstract":"Human action recognition has been a hot research for decades, and mainstream supervised frameworks include a feature extraction backbone and a softmax classifier to predict daily human actions. When the number of classes applied to the dataset changes, we must retrain the classifier on the well-trained backbone. This pipeline restricts the generalization and transfer ability of the model due to an extra training period. Moreover, replacing action labels with simple number labels discards useful semantic information and can only receive a meaningless classifier at last. In this work, we present a model SkeletonCLIP for skeleton-based human action recognition. We add an alternative text encoder to extract semantic information from labels while keeping the original sequence encoder. We use dot production to measure the similarities of sequence-text pairs in place of traditional classifier head and cross-entropy loss. Experiments from three human action datasets show that our framework can reach a higher recognition accuracy with the help of semantic information when training the network from scratch. The code has been shown at eunseo-v/SkeletonCLIP.","PeriodicalId":339547,"journal":{"name":"2022 8th International Conference on Systems and Informatics (ICSAI)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 8th International Conference on Systems and Informatics (ICSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSAI57119.2022.10005459","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Human action recognition has been a hot research for decades, and mainstream supervised frameworks include a feature extraction backbone and a softmax classifier to predict daily human actions. When the number of classes applied to the dataset changes, we must retrain the classifier on the well-trained backbone. This pipeline restricts the generalization and transfer ability of the model due to an extra training period. Moreover, replacing action labels with simple number labels discards useful semantic information and can only receive a meaningless classifier at last. In this work, we present a model SkeletonCLIP for skeleton-based human action recognition. We add an alternative text encoder to extract semantic information from labels while keeping the original sequence encoder. We use dot production to measure the similarities of sequence-text pairs in place of traditional classifier head and cross-entropy loss. Experiments from three human action datasets show that our framework can reach a higher recognition accuracy with the help of semantic information when training the network from scratch. The code has been shown at eunseo-v/SkeletonCLIP.