Yanchao Liu , Xina Cheng , Yuan Li , Takeshi Ikenaga
{"title":"Bidirectional temporal and frame-segment attention for sparse action segmentation of figure skating","authors":"Yanchao Liu , Xina Cheng , Yuan Li , Takeshi Ikenaga","doi":"10.1016/j.cviu.2024.104186","DOIUrl":null,"url":null,"abstract":"<div><div>Temporal action segmentation is a task for understanding human activities in long-term videos. Most of the efforts have been focused on dense-frame action, which relies on strong correlations between frames. However, in the figure skating scene, technical actions are sparsely shown in the video. This brings new challenges: a large amount of redundant temporal information leads to weak frame correlation. To end this, we propose a Bidirectional Temporal and Frame-Segment Attention Module (FSAM). Specifically, we propose an additional reverse-temporal input stream to enhance frame correlation, learned by fusing bidirectional temporal features. In addition, the proposed FSAM contains a Multi-stage segment-aware GCN and decoder interaction module, aiming to learn the correlation between segment features across time domains and integrate embeddings between frame and segment representations. To evaluate our approach, we propose the Figure Skating Sparse Action Segmentation (FSSAS) dataset: The dataset comprises 100 samples of the Olympic figure skating final and semi-final competition, with more than 50 different men and women athletes. Extensive experiments show that our method achieves an accuracy of 87.75 and an edit score of 90.18 on the FSSAS dataset.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002674","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Temporal action segmentation is a task for understanding human activities in long-term videos. Most of the efforts have been focused on dense-frame action, which relies on strong correlations between frames. However, in the figure skating scene, technical actions are sparsely shown in the video. This brings new challenges: a large amount of redundant temporal information leads to weak frame correlation. To end this, we propose a Bidirectional Temporal and Frame-Segment Attention Module (FSAM). Specifically, we propose an additional reverse-temporal input stream to enhance frame correlation, learned by fusing bidirectional temporal features. In addition, the proposed FSAM contains a Multi-stage segment-aware GCN and decoder interaction module, aiming to learn the correlation between segment features across time domains and integrate embeddings between frame and segment representations. To evaluate our approach, we propose the Figure Skating Sparse Action Segmentation (FSSAS) dataset: The dataset comprises 100 samples of the Olympic figure skating final and semi-final competition, with more than 50 different men and women athletes. Extensive experiments show that our method achieves an accuracy of 87.75 and an edit score of 90.18 on the FSSAS dataset.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems