SimFLE: Simple Facial Landmark Encoding for Self-Supervised Facial Expression Recognition in the Wild

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2024-09-30 DOI:10.1109/TAFFC.2024.3470980

Jiyong Moon;Hyeryung Jang;Seongsik Park

{"title":"SimFLE: Simple Facial Landmark Encoding for Self-Supervised Facial Expression Recognition in the Wild","authors":"Jiyong Moon;Hyeryung Jang;Seongsik Park","doi":"10.1109/TAFFC.2024.3470980","DOIUrl":null,"url":null,"abstract":"Facial expression recognition in the wild (FER-W) entails classifying facial emotions in natural environments. The major challenges in FER-W stem from the complexity and ambiguity of facial images, making it difficult to curate a large-scale labeled dataset for training. Additionally, the subtle differences in emotions often reside in the fine-grained details of local facial landmarks, demanding innovative solutions to capture these crucial features efficiently. To address these issues, we employ two distinct self-supervised methods. First, we adopt a contrastive learning method to capture generalized global representations, enabling the model to understand the semantic context of facial expressions without relying on labeled data. Simultaneously, we leverage masked image modeling to focus on embedding fine-grained, local facial landmark information at the patch-level. We introduce a novel module called FaceMAE, which aims to reconstruct the masked facial patches. The semantic masking scheme is designed to preserve highly activated feature activations, allowing the encoding of crucial details of unmasked facial landmarks and their relationships within the broader facial context at the patch-level. It finally guides the backbone network to calibrate the learned global features to be attentive to facial landmarks. Our proposed method, called <bold>Simple <bold>Facial <bold>Landmark <bold>Encoding (<bold>SimFLE), significantly outperforms supervised baseline and other self-supervised methods in terms of facial landmark localization and overall performance, as demonstrated through extensive experiments across several FER-W benchmarks.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 2","pages":"799-813"},"PeriodicalIF":9.8000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10700612/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Facial expression recognition in the wild (FER-W) entails classifying facial emotions in natural environments. The major challenges in FER-W stem from the complexity and ambiguity of facial images, making it difficult to curate a large-scale labeled dataset for training. Additionally, the subtle differences in emotions often reside in the fine-grained details of local facial landmarks, demanding innovative solutions to capture these crucial features efficiently. To address these issues, we employ two distinct self-supervised methods. First, we adopt a contrastive learning method to capture generalized global representations, enabling the model to understand the semantic context of facial expressions without relying on labeled data. Simultaneously, we leverage masked image modeling to focus on embedding fine-grained, local facial landmark information at the patch-level. We introduce a novel module called FaceMAE, which aims to reconstruct the masked facial patches. The semantic masking scheme is designed to preserve highly activated feature activations, allowing the encoding of crucial details of unmasked facial landmarks and their relationships within the broader facial context at the patch-level. It finally guides the backbone network to calibrate the learned global features to be attentive to facial landmarks. Our proposed method, called Simple Facial Landmark Encoding (SimFLE), significantly outperforms supervised baseline and other self-supervised methods in terms of facial landmark localization and overall performance, as demonstrated through extensive experiments across several FER-W benchmarks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SimFLE：用于野外自我监督面部表情识别的简单面部地标编码

野外面部表情识别（FER-W）需要对自然环境中的面部情绪进行分类。ferw的主要挑战来自面部图像的复杂性和模糊性，这使得难以策划用于训练的大规模标记数据集。此外，情绪的细微差异通常存在于局部面部地标的细粒度细节中，需要创新的解决方案来有效地捕捉这些关键特征。为了解决这些问题，我们采用了两种不同的自我监督方法。首先，我们采用对比学习方法捕获广义全局表示，使模型能够在不依赖标记数据的情况下理解面部表情的语义上下文。同时，我们利用掩模图像建模来专注于嵌入细粒度的局部面部地标信息。我们介绍了一个新的模块FaceMAE，旨在重建被遮挡的面部斑块。语义掩蔽方案旨在保留高度激活的特征激活，允许在补丁级对未被掩蔽的面部标志及其在更广泛的面部上下文中的关系的关键细节进行编码。最后引导骨干网络校准学习到的全局特征，以关注面部地标。我们提出的方法，称为简单面部地标编码（Simple Facial Landmark Encoding, SimFLE），在面部地标定位和整体性能方面明显优于监督基线和其他自监督方法，这是通过在几个FER-W基准测试中进行的广泛实验证明的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.