基于集合先验约束的自监督学习的面部动作单元表示法

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-08-26 DOI:10.1109/TIP.2024.3446250

Haifeng Chen, Peng Zhang, Chujia Guo, Ke Lu, Dongmei Jiang

{"title":"基于集合先验约束的自监督学习的面部动作单元表示法","authors":"Haifeng Chen, Peng Zhang, Chujia Guo, Ke Lu, Dongmei Jiang","doi":"10.1109/TIP.2024.3446250","DOIUrl":null,"url":null,"abstract":"Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using autoencoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learnging method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction. The code is made available at https://github.com/Sunner4nwpu/SsupAU.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Facial Action Unit Representation based on Self-supervised Learning with Ensembled Priori Constraints.\",\"authors\":\"Haifeng Chen, Peng Zhang, Chujia Guo, Ke Lu, Dongmei Jiang\",\"doi\":\"10.1109/TIP.2024.3446250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using autoencoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learnging method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction. The code is made available at https://github.com/Sunner4nwpu/SsupAU.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TIP.2024.3446250\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TIP.2024.3446250","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

面部动作单元（AUs）侧重于一组全面的面部肌肉原子运动，用于人类表情理解。在监督学习的基础上，可以通过 AU 所在的局部斑块实现具有区分性的 AU 表示。遗憾的是，准确的 AU 定位和表征受到大量人工标注的挑战，这限制了现实场景中 AU 识别的性能。在本研究中，我们提出了一种端到端的自监督 AU 表示学习模型（SsupAU），用于从无标记的面部视频中学习 AU 表示。具体来说，输入的人脸通过自动编码器被分解成六个部分：五个有意义的照片几何部分和二维流场 AU。通过逐步构建典型中性脸、摆拍中性脸和摆拍表情脸，可以在没有监督的情况下将这些成分拆分开来，从而学习到 AU 表示。为了在没有人工标注情绪状态或 AU 强度的基本事实的情况下构建典型中性脸，提出了两个基于先验知识的假设：1) 身份一致性，即探索人脸视频中不同帧的相同反照率和深度，并帮助学习相机颜色模式，作为恢复标准中性人脸的额外线索。2) 平均人脸，它能使模型发现典型中性人脸的 "中性面部表情"，并在表征学习中解耦 AU。据我们所知，这是首次尝试根据 AU 的定义来设计自监督 AU 表示学习方法。在基准数据集上进行的大量实验表明，与其他最先进的方法相比，所提出的方法性能优越，而且能将输入的人脸分解为有意义的因素，从而重建人脸。代码可在 https://github.com/Sunner4nwpu/SsupAU 上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Facial Action Unit Representation based on Self-supervised Learning with Ensembled Priori Constraints.

Facial action units (AUs) focus on a comprehensive set of atomic facial muscle movements for human expression understanding. Based on supervised learning, discriminative AU representation can be achieved from local patches where the AUs are located. Unfortunately, accurate AU localization and characterization are challenged by the tremendous manual annotations, which limits the performance of AU recognition in realistic scenarios. In this study, we propose an end-to-end self-supervised AU representation learning model (SsupAU) to learn AU representations from unlabeled facial videos. Specifically, the input face is decomposed into six components using autoencoders: five photo-geometric meaningful components, together with 2D flow field AUs. By constructing the canonical neutral face, posed neutral face, and posed expressional face gradually, these components can be disentangled without supervision, therefore the AU representations can be learned. To construct the canonical neutral face without manually labeled ground truth of emotion state or AU intensity, two priori knowledge based assumptions are proposed: 1) identity consistency, which explores the identical albedos and depths of different frames in a face video, and helps to learn the camera color mode as an extra cue for canonical neutral face recovery. 2) average face, which enables the model to discover a 'neutral facial expression' of the canonical neutral face and decouple the AUs in representation learning. To the best of our knowledge, this is the first attempt to design self-supervised AU representation learnging method based on the definition of AUs. Substantial experiments on benchmark datasets have demonstrated the superior performance of the proposed work in comparison to other state-of-the-art approaches, as well as an outstanding capability of decomposing input face into meaningful factors for its reconstruction. The code is made available at https://github.com/Sunner4nwpu/SsupAU.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量