Modeling Fine-Grained Relations in Dynamic Space-Time Graphs for Video-Based Facial Expression Recognition

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2025-01-17 DOI:10.1109/TAFFC.2025.3530973

Changqin Huang;Fan Jiang;Zhongmei Han;Xiaodi Huang;Shijin Wang;Yanlai Zhu;Yunliang Jiang;Bin Hu

{"title":"Modeling Fine-Grained Relations in Dynamic Space-Time Graphs for Video-Based Facial Expression Recognition","authors":"Changqin Huang;Fan Jiang;Zhongmei Han;Xiaodi Huang;Shijin Wang;Yanlai Zhu;Yunliang Jiang;Bin Hu","doi":"10.1109/TAFFC.2025.3530973","DOIUrl":null,"url":null,"abstract":"Facial expressions in videos inherently mirror the dynamic nature of real-world facial events. Consequently, facial expression recognition (FER) should employ a dynamic graph-based representation to effectively capture the relational structure of facial expressions rather than relying on conventional grid or sequence methods. However, existing graph-based approaches have their limitations. Frame-level graph methods provide a coarse representation of the facial graph across time and space, while landmark-based graph methods need to introduce additional facial landmarks, resulting in a static graph structure. To address these challenges, we propose spatial-temporal relation-aware dynamic graph convolutional networks (ST-RDGCN). This fine-grained relation modeling approach enables the dynamic modeling of evolving facial expressions in videos through dynamic space-time graphs, eliminating the need for facial landmarks. ST-RDGCN encompasses three graph construction paradigms: dynamic independent space graph, dynamic joint space-time graph, and dynamic cross space-time graph. Furthermore, we propose a relation-aware space-time graph convolution (RSTG-Conv) operator to learn informative spatiotemporal correlations in dynamic space-time graphs. In extensive experimental evaluations, our ST-RDGCN demonstrates state-of-the-art performance on the five popular video-based FER datasets, achieving overall accuracy scores of 99.69%, 91.67%, 56.51%, 69.37%, and 49.03% on the CK+, Oulu-CASIA, AFEW, DFEW, and FERV39k datasets, respectively. In particular, our ST-RDGCN outperforms the current best method by 3.6% in UAR on the most challenging FERV39k dataset. Furthermore, our analysis reveals that the dynamic cross space-time graph scheme is the most effective among the three dynamic graph construction schemes.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1675-1692"},"PeriodicalIF":9.8000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10844531/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Facial expressions in videos inherently mirror the dynamic nature of real-world facial events. Consequently, facial expression recognition (FER) should employ a dynamic graph-based representation to effectively capture the relational structure of facial expressions rather than relying on conventional grid or sequence methods. However, existing graph-based approaches have their limitations. Frame-level graph methods provide a coarse representation of the facial graph across time and space, while landmark-based graph methods need to introduce additional facial landmarks, resulting in a static graph structure. To address these challenges, we propose spatial-temporal relation-aware dynamic graph convolutional networks (ST-RDGCN). This fine-grained relation modeling approach enables the dynamic modeling of evolving facial expressions in videos through dynamic space-time graphs, eliminating the need for facial landmarks. ST-RDGCN encompasses three graph construction paradigms: dynamic independent space graph, dynamic joint space-time graph, and dynamic cross space-time graph. Furthermore, we propose a relation-aware space-time graph convolution (RSTG-Conv) operator to learn informative spatiotemporal correlations in dynamic space-time graphs. In extensive experimental evaluations, our ST-RDGCN demonstrates state-of-the-art performance on the five popular video-based FER datasets, achieving overall accuracy scores of 99.69%, 91.67%, 56.51%, 69.37%, and 49.03% on the CK+, Oulu-CASIA, AFEW, DFEW, and FERV39k datasets, respectively. In particular, our ST-RDGCN outperforms the current best method by 3.6% in UAR on the most challenging FERV39k dataset. Furthermore, our analysis reveals that the dynamic cross space-time graph scheme is the most effective among the three dynamic graph construction schemes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为基于视频的面部表情识别建模动态时空图中的细粒度关系

视频中的面部表情本质上反映了现实世界面部事件的动态性。因此，面部表情识别（FER）应该采用基于动态图的表示来有效地捕获面部表情的关系结构，而不是依赖于传统的网格或序列方法。然而，现有的基于图的方法有其局限性。帧级图方法提供了跨越时间和空间的面部图的粗略表示，而基于地标的图方法需要引入额外的面部地标，导致静态图结构。为了解决这些挑战，我们提出了时空关系感知的动态图卷积网络（ST-RDGCN）。这种细粒度的关系建模方法可以通过动态时空图对视频中不断变化的面部表情进行动态建模，消除了对面部地标的需要。ST-RDGCN包含三种图构建范式：动态独立空间图、动态联合时空图和动态跨时空图。此外，我们提出了一种关系感知的时空图卷积算子（RSTG-Conv）来学习动态时空图中的信息时空相关性。在广泛的实验评估中，我们的ST-RDGCN在五种流行的基于视频的FER数据集上表现出了最先进的性能，在CK+、Oulu-CASIA、few、DFEW和FERV39k数据集上分别达到了99.69%、91.67%、56.51%、69.37%和49.03%的总体准确率。特别是，在最具挑战性的FERV39k数据集上，我们的ST-RDGCN在UAR上比目前最好的方法高出3.6%。此外，我们的分析表明，动态跨时空图方案是三种动态图构建方案中最有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.