{"title":"Human-Centric Fine-Grained Action Quality Assessment","authors":"Jinglin Xu;Sibo Yin;Yuxin Peng","doi":"10.1109/TPAMI.2025.3556935","DOIUrl":null,"url":null,"abstract":"Existing action quality assessment (AQA) methods mainly learn deep representations at the video level to score diverse actions. Due to the lack of a fine-grained understanding of actions in videos, they suffer from low credibility and accuracy, thus insufficient for stringent applications, such as competitive sports and sports injury rehabilitation. We argue that a fine-grained understanding of actions requires the model to parse actions in semantics, time, and space, which is the key to the credibility and accuracy of the AQA technique. Based on this insight, we propose a new human-centric fine-grained action quality assessment method named Unified Fine-grained spatial-temporal action Parser, namely <bold>Uni-FineParser</b>. It learns human-centric foreground action representations by focusing on target action regions within each frame and exploiting their fine-grained alignments in semantics, time, and space, minimizing the impact of invalid backgrounds during the assessment. In addition, we construct human-centric foreground action mask annotations for the FineDiving, AQA-7, and MTL-AQA datasets, respectively called <bold>FineDiving-HM</b>, <bold>AQA-7-HM</b>, and <bold>MTL-AQA-HM</b>. With refined spatio-temporal annotations on diverse target action procedures, Uni-FineParser can provide a potential for human-centric fine-grained action quality assessment with better interpretability. Through extensive experiments, we demonstrate the effectiveness of Uni-FineParser, which outperforms state-of-the-art methods while supporting more tasks of human-centric action understanding.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6242-6255"},"PeriodicalIF":18.6000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10946879/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Existing action quality assessment (AQA) methods mainly learn deep representations at the video level to score diverse actions. Due to the lack of a fine-grained understanding of actions in videos, they suffer from low credibility and accuracy, thus insufficient for stringent applications, such as competitive sports and sports injury rehabilitation. We argue that a fine-grained understanding of actions requires the model to parse actions in semantics, time, and space, which is the key to the credibility and accuracy of the AQA technique. Based on this insight, we propose a new human-centric fine-grained action quality assessment method named Unified Fine-grained spatial-temporal action Parser, namely Uni-FineParser. It learns human-centric foreground action representations by focusing on target action regions within each frame and exploiting their fine-grained alignments in semantics, time, and space, minimizing the impact of invalid backgrounds during the assessment. In addition, we construct human-centric foreground action mask annotations for the FineDiving, AQA-7, and MTL-AQA datasets, respectively called FineDiving-HM, AQA-7-HM, and MTL-AQA-HM. With refined spatio-temporal annotations on diverse target action procedures, Uni-FineParser can provide a potential for human-centric fine-grained action quality assessment with better interpretability. Through extensive experiments, we demonstrate the effectiveness of Uni-FineParser, which outperforms state-of-the-art methods while supporting more tasks of human-centric action understanding.