Changqin Huang;Fan Jiang;Zhongmei Han;Xiaodi Huang;Shijin Wang;Yanlai Zhu;Yunliang Jiang;Bin Hu
{"title":"Modeling Fine-Grained Relations in Dynamic Space-Time Graphs for Video-Based Facial Expression Recognition","authors":"Changqin Huang;Fan Jiang;Zhongmei Han;Xiaodi Huang;Shijin Wang;Yanlai Zhu;Yunliang Jiang;Bin Hu","doi":"10.1109/TAFFC.2025.3530973","DOIUrl":null,"url":null,"abstract":"Facial expressions in videos inherently mirror the dynamic nature of real-world facial events. Consequently, facial expression recognition (FER) should employ a dynamic graph-based representation to effectively capture the relational structure of facial expressions rather than relying on conventional grid or sequence methods. However, existing graph-based approaches have their limitations. Frame-level graph methods provide a coarse representation of the facial graph across time and space, while landmark-based graph methods need to introduce additional facial landmarks, resulting in a static graph structure. To address these challenges, we propose spatial-temporal relation-aware dynamic graph convolutional networks (ST-RDGCN). This fine-grained relation modeling approach enables the dynamic modeling of evolving facial expressions in videos through dynamic space-time graphs, eliminating the need for facial landmarks. ST-RDGCN encompasses three graph construction paradigms: dynamic independent space graph, dynamic joint space-time graph, and dynamic cross space-time graph. Furthermore, we propose a relation-aware space-time graph convolution (RSTG-Conv) operator to learn informative spatiotemporal correlations in dynamic space-time graphs. In extensive experimental evaluations, our ST-RDGCN demonstrates state-of-the-art performance on the five popular video-based FER datasets, achieving overall accuracy scores of 99.69%, 91.67%, 56.51%, 69.37%, and 49.03% on the CK+, Oulu-CASIA, AFEW, DFEW, and FERV39k datasets, respectively. In particular, our ST-RDGCN outperforms the current best method by 3.6% in UAR on the most challenging FERV39k dataset. Furthermore, our analysis reveals that the dynamic cross space-time graph scheme is the most effective among the three dynamic graph construction schemes.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1675-1692"},"PeriodicalIF":9.8000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10844531/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Facial expressions in videos inherently mirror the dynamic nature of real-world facial events. Consequently, facial expression recognition (FER) should employ a dynamic graph-based representation to effectively capture the relational structure of facial expressions rather than relying on conventional grid or sequence methods. However, existing graph-based approaches have their limitations. Frame-level graph methods provide a coarse representation of the facial graph across time and space, while landmark-based graph methods need to introduce additional facial landmarks, resulting in a static graph structure. To address these challenges, we propose spatial-temporal relation-aware dynamic graph convolutional networks (ST-RDGCN). This fine-grained relation modeling approach enables the dynamic modeling of evolving facial expressions in videos through dynamic space-time graphs, eliminating the need for facial landmarks. ST-RDGCN encompasses three graph construction paradigms: dynamic independent space graph, dynamic joint space-time graph, and dynamic cross space-time graph. Furthermore, we propose a relation-aware space-time graph convolution (RSTG-Conv) operator to learn informative spatiotemporal correlations in dynamic space-time graphs. In extensive experimental evaluations, our ST-RDGCN demonstrates state-of-the-art performance on the five popular video-based FER datasets, achieving overall accuracy scores of 99.69%, 91.67%, 56.51%, 69.37%, and 49.03% on the CK+, Oulu-CASIA, AFEW, DFEW, and FERV39k datasets, respectively. In particular, our ST-RDGCN outperforms the current best method by 3.6% in UAR on the most challenging FERV39k dataset. Furthermore, our analysis reveals that the dynamic cross space-time graph scheme is the most effective among the three dynamic graph construction schemes.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.