From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos

IF 9.8 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2024-09-03 DOI:10.1109/TAFFC.2024.3453443
Yin Chen;Jia Li;Shiguang Shan;Meng Wang;Richang Hong
{"title":"From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos","authors":"Yin Chen;Jia Li;Shiguang Shan;Meng Wang;Richang Hong","doi":"10.1109/TAFFC.2024.3453443","DOIUrl":null,"url":null,"abstract":"Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. Recognizing the potential in leveraging SFER knowledge for DFER, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. First, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we have achieved a new state of the art.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 2","pages":"624-638"},"PeriodicalIF":9.8000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10663980/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. Recognizing the potential in leveraging SFER knowledge for DFER, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. First, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we have achieved a new state of the art.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从静态到动态:为视频中的面部表情识别调整地标感知图像模型
野外动态面部表情识别(DFER)仍然受到数据限制的阻碍,例如姿态、遮挡和光照的数量和多样性不足,以及面部表情固有的模糊性。相比之下,静态面部表情识别(SFER)目前表现出更高的性能,并且可以受益于更丰富的高质量训练数据。此外,DFER的外观特征和动态依赖关系在很大程度上仍未被探索。认识到利用SFER知识进行DFER的潜力,我们引入了一种新的静态到动态模型(S2D),该模型利用现有的SFER知识和隐含编码在提取的面部地标感知特征中的动态信息,从而显着提高了DFER性能。首先,我们建立并训练了SFER的图像模型,该模型仅包含标准视觉变压器(ViT)和多视图互补提示器(mcp)。然后,我们通过在图像模型中插入时间建模适配器(tma)来获得DFER的视频模型(即S2D)。mcp通过一个现成的面部地标检测器推断出的地标感知特征来增强面部表情特征。tma捕获并建模了面部表情动态变化的关系,有效地扩展了视频的预训练图像模型。值得注意的是,mcp和tma只增加了原始图像模型的一小部分可训练参数(小于+10%)。此外,我们提出了一种新的基于自蒸馏损失的情绪锚(即每个情绪类别的参考样本),以减少模糊情绪标签的有害影响,进一步增强了我们的S2D。在流行的SFER和DFER数据集上进行的实验表明,我们已经达到了一个新的水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Affective Computing
IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS
CiteScore
15.00
自引率
6.20%
发文量
174
期刊介绍: The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.
期刊最新文献
Multi-Level Relation-Aware Knowledge Distillation With Hierarchical Fusion for Incomplete Multimodal Sentiment Analysis UCSM-TG: Utterance, Conversation and Speaker-level Speech Emotion Tracking Model in Conversations Using Transformer-GRU Strength in Numbers, Power in Subjectivity: Scalable Modeling of Individual Annotators for Emotion Recognition Within and Across Corpora LPM-Aug: Latent Pathology-Informed Multimodal Augmentation for Generalized Cognitive Decline Detection Via Speech MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1