Multimodal Sentimental Privileged Information Embedding for Improving Facial Expression Recognition

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2024-06-18 DOI:10.1109/TAFFC.2024.3415625

Ning Sun;Changwei You;Wenming Zheng;Jixin Liu;Lei Chai;Haian Sun

{"title":"Multimodal Sentimental Privileged Information Embedding for Improving Facial Expression Recognition","authors":"Ning Sun;Changwei You;Wenming Zheng;Jixin Liu;Lei Chai;Haian Sun","doi":"10.1109/TAFFC.2024.3415625","DOIUrl":null,"url":null,"abstract":"Facial expression recognition (FER) has always been one of the key task in affective computing. Over the years, researchers have worked to improve the performance of FER by designing models with more powerful feature extraction, embedding attention mechanism, and reconstructing missing information, etc. Different from the paradigms above, we attempt to improve FER performance by using multimodal sentiment data, such as audio and text, as privileged information (PI) for facial images. To this end, a multimodal privileged information embedded facial expression recognition network (MPI-FER) is proposed in this paper. During the training phase, this model achieves the PI embedding of multimodal data for FER by developing cross-modality translation between multimodal sentiment data. During the test phase, input images alone are sufficient for the model inference to accomplish the FER task input. The MPI-FER is a large-scale, heterogeneous deep neural network. To achieve effective training of this model with limited training samples, we design a multi-stage training strategy of module-wise pre-training followed by end-to-end fine-tuning. In addition, a strategy of filling the multimodal sentiment quaternion is proposed for implementing our method on a facial expression database consisting only of face images. We conducted extensive experiments to evaluate the proposed method on two databases of multimodal sentiment analysis (CH-SIMS and CMU-MOSI) and two databases of FER in the wild (RAF-DB and AffectNet). The results show that embedding multimodal sentiment data as privileged information into the FER task based on face images can significantly improve the accuracy of FER. Furthermore, by only using image in the test phase, the proposed method can achieve better results of multimodal sentiment analysis than those methods achieved by using multimodal sentimental data fusion.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 1","pages":"133-144"},"PeriodicalIF":9.8000,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10561510/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Facial expression recognition (FER) has always been one of the key task in affective computing. Over the years, researchers have worked to improve the performance of FER by designing models with more powerful feature extraction, embedding attention mechanism, and reconstructing missing information, etc. Different from the paradigms above, we attempt to improve FER performance by using multimodal sentiment data, such as audio and text, as privileged information (PI) for facial images. To this end, a multimodal privileged information embedded facial expression recognition network (MPI-FER) is proposed in this paper. During the training phase, this model achieves the PI embedding of multimodal data for FER by developing cross-modality translation between multimodal sentiment data. During the test phase, input images alone are sufficient for the model inference to accomplish the FER task input. The MPI-FER is a large-scale, heterogeneous deep neural network. To achieve effective training of this model with limited training samples, we design a multi-stage training strategy of module-wise pre-training followed by end-to-end fine-tuning. In addition, a strategy of filling the multimodal sentiment quaternion is proposed for implementing our method on a facial expression database consisting only of face images. We conducted extensive experiments to evaluate the proposed method on two databases of multimodal sentiment analysis (CH-SIMS and CMU-MOSI) and two databases of FER in the wild (RAF-DB and AffectNet). The results show that embedding multimodal sentiment data as privileged information into the FER task based on face images can significantly improve the accuracy of FER. Furthermore, by only using image in the test phase, the proposed method can achieve better results of multimodal sentiment analysis than those methods achieved by using multimodal sentimental data fusion.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

嵌入多模态情感特权信息以提高面部表情识别能力

面部表情识别一直是情感计算领域的关键问题之一。多年来，研究者们通过设计更强大的特征提取模型、嵌入注意机制、重构缺失信息等方法来提高模型的性能。与上述范例不同，我们试图通过使用多模态情感数据（如音频和文本）作为面部图像的特权信息（PI）来提高FER性能。为此，本文提出了一种多模态特权信息嵌入式面部表情识别网络（MPI-FER）。在训练阶段，该模型通过在多模态情感数据之间进行跨模态翻译，实现了多模态数据的PI嵌入。在测试阶段，仅输入图像就足以使模型推理完成FER任务输入。MPI-FER是一个大规模的异构深度神经网络。为了在有限的训练样本下实现对该模型的有效训练，我们设计了一种多阶段的模块预训练策略，然后进行端到端微调。此外，提出了一种填充多模态情感四元数的策略，用于在仅由面部图像组成的面部表情数据库上实现我们的方法。我们在两个多模态情感分析数据库（CH-SIMS和CMU-MOSI）和两个野外情感分析数据库（RAF-DB和AffectNet）上进行了大量的实验来评估所提出的方法。结果表明，将多模态情感数据作为特权信息嵌入到基于人脸图像的特征识别任务中，可以显著提高特征识别的准确率。此外，该方法仅使用测试阶段的图像，可以获得比使用多模态情感数据融合的方法更好的多模态情感分析结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.