Yaozong Gan, Ren Togo, Takahiro Ogawa, M. Haseyama
{"title":"Scene Retrieval in Soccer Videos by Spatial-temporal Attention with Video Vision Transformer","authors":"Yaozong Gan, Ren Togo, Takahiro Ogawa, M. Haseyama","doi":"10.1109/ICCE-Taiwan55306.2022.9869188","DOIUrl":null,"url":null,"abstract":"This paper presents a scene retrieval method in soccer videos with video vision Transformer (ViViT). In soccer coaching, it is difficult for the training staff to find the required scenes efficiently from the large number of soccer videos. We tackle this problem with a simple yet effective method. We train ViViT and obtain the output token features of the soccer scene by the pre-trained ViViT model. The output tokens of the pre-trained ViViT contain spatio-temporal information of soccer scenes. We then transform a query scene and candidate scenes into output token features using the pre-trained ViViT and calculate the similarity between the tokens with cosine similarity. We conducted experiments on SoccerNet-V2dataset. The experimental results show that the proposed method achieves outstanding retrieval accuracy compared to the previous methods.","PeriodicalId":164671,"journal":{"name":"2022 IEEE International Conference on Consumer Electronics - Taiwan","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Consumer Electronics - Taiwan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCE-Taiwan55306.2022.9869188","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents a scene retrieval method in soccer videos with video vision Transformer (ViViT). In soccer coaching, it is difficult for the training staff to find the required scenes efficiently from the large number of soccer videos. We tackle this problem with a simple yet effective method. We train ViViT and obtain the output token features of the soccer scene by the pre-trained ViViT model. The output tokens of the pre-trained ViViT contain spatio-temporal information of soccer scenes. We then transform a query scene and candidate scenes into output token features using the pre-trained ViViT and calculate the similarity between the tokens with cosine similarity. We conducted experiments on SoccerNet-V2dataset. The experimental results show that the proposed method achieves outstanding retrieval accuracy compared to the previous methods.