{"title":"Rhythmer: Ranking-Based Skill Assessment With Rhythm-Aware Transformer","authors":"Zhuang Luo;Yang Xiao;Feng Yang;Joey Tianyi Zhou;Zhiwen Fang","doi":"10.1109/TCSVT.2024.3459938","DOIUrl":null,"url":null,"abstract":"Ranking-based skill assessment is an essential component of video understanding. In this task lacking precise procedure annotations, existing methods place greater emphasis on evaluating the procedure quality via manually normalizing the execution duration. However, the inherent duration-related procedural patterns will undergo alteration. Experimentally, we discover that distinct duration biases are prevalent in duration-sensitive skills, such as those in medical and everyday life. Hence, duration information is crucial for ranking-based skill assessment when dealing with varying durations. Additionally, similar execution processes tend to have closer execution durations. Thus, another critical factor lies in extracting duration-related procedural information alongside similar durations. It is defined as mining rhythm patterns, which are inspired by music rhythms including various duration and duration-related procedures. In our work, a rhythm-aware transformer is proposed to mine the rhythm patterns adaptively. Given pairwise inputs, a co-attention module is designed to mutually highlight duration-related procedure information when comparing pairwise input videos with similar durations, and adaptively attenuate the efficacy when confronted with pairwise inputs featuring significantly different durations. A rhythm-encoding module further embeds duration information into the concatenation of raw features and co-attention features. Following these features, the transformer decoder is designed to learn duration-related queries supervised by a novel duration grouping loss among various duration groups. The experimental results demonstrate that the rhythm-aware transformer is effective for ranking-based skill assessment.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"259-272"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10679980/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Ranking-based skill assessment is an essential component of video understanding. In this task lacking precise procedure annotations, existing methods place greater emphasis on evaluating the procedure quality via manually normalizing the execution duration. However, the inherent duration-related procedural patterns will undergo alteration. Experimentally, we discover that distinct duration biases are prevalent in duration-sensitive skills, such as those in medical and everyday life. Hence, duration information is crucial for ranking-based skill assessment when dealing with varying durations. Additionally, similar execution processes tend to have closer execution durations. Thus, another critical factor lies in extracting duration-related procedural information alongside similar durations. It is defined as mining rhythm patterns, which are inspired by music rhythms including various duration and duration-related procedures. In our work, a rhythm-aware transformer is proposed to mine the rhythm patterns adaptively. Given pairwise inputs, a co-attention module is designed to mutually highlight duration-related procedure information when comparing pairwise input videos with similar durations, and adaptively attenuate the efficacy when confronted with pairwise inputs featuring significantly different durations. A rhythm-encoding module further embeds duration information into the concatenation of raw features and co-attention features. Following these features, the transformer decoder is designed to learn duration-related queries supervised by a novel duration grouping loss among various duration groups. The experimental results demonstrate that the rhythm-aware transformer is effective for ranking-based skill assessment.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.