From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2024-09-03 DOI:10.1109/TAFFC.2024.3453443

Yin Chen;Jia Li;Shiguang Shan;Meng Wang;Richang Hong

{"title":"From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos","authors":"Yin Chen;Jia Li;Shiguang Shan;Meng Wang;Richang Hong","doi":"10.1109/TAFFC.2024.3453443","DOIUrl":null,"url":null,"abstract":"Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. Recognizing the potential in leveraging SFER knowledge for DFER, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. First, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we have achieved a new state of the art.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 2","pages":"624-638"},"PeriodicalIF":9.8000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10663980/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. Recognizing the potential in leveraging SFER knowledge for DFER, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. First, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we have achieved a new state of the art.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从静态到动态：为视频中的面部表情识别调整地标感知图像模型

野外动态面部表情识别（DFER）仍然受到数据限制的阻碍，例如姿态、遮挡和光照的数量和多样性不足，以及面部表情固有的模糊性。相比之下，静态面部表情识别（SFER）目前表现出更高的性能，并且可以受益于更丰富的高质量训练数据。此外，DFER的外观特征和动态依赖关系在很大程度上仍未被探索。认识到利用SFER知识进行DFER的潜力，我们引入了一种新的静态到动态模型（S2D），该模型利用现有的SFER知识和隐含编码在提取的面部地标感知特征中的动态信息，从而显着提高了DFER性能。首先，我们建立并训练了SFER的图像模型，该模型仅包含标准视觉变压器（ViT）和多视图互补提示器（mcp）。然后，我们通过在图像模型中插入时间建模适配器（tma）来获得DFER的视频模型（即S2D）。mcp通过一个现成的面部地标检测器推断出的地标感知特征来增强面部表情特征。tma捕获并建模了面部表情动态变化的关系，有效地扩展了视频的预训练图像模型。值得注意的是，mcp和tma只增加了原始图像模型的一小部分可训练参数（小于+10%）。此外，我们提出了一种新的基于自蒸馏损失的情绪锚（即每个情绪类别的参考样本），以减少模糊情绪标签的有害影响，进一步增强了我们的S2D。在流行的SFER和DFER数据集上进行的实验表明，我们已经达到了一个新的水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.