Ning Sun;Changwei You;Wenming Zheng;Jixin Liu;Lei Chai;Haian Sun
{"title":"Multimodal Sentimental Privileged Information Embedding for Improving Facial Expression Recognition","authors":"Ning Sun;Changwei You;Wenming Zheng;Jixin Liu;Lei Chai;Haian Sun","doi":"10.1109/TAFFC.2024.3415625","DOIUrl":null,"url":null,"abstract":"Facial expression recognition (FER) has always been one of the key task in affective computing. Over the years, researchers have worked to improve the performance of FER by designing models with more powerful feature extraction, embedding attention mechanism, and reconstructing missing information, etc. Different from the paradigms above, we attempt to improve FER performance by using multimodal sentiment data, such as audio and text, as privileged information (PI) for facial images. To this end, a multimodal privileged information embedded facial expression recognition network (MPI-FER) is proposed in this paper. During the training phase, this model achieves the PI embedding of multimodal data for FER by developing cross-modality translation between multimodal sentiment data. During the test phase, input images alone are sufficient for the model inference to accomplish the FER task input. The MPI-FER is a large-scale, heterogeneous deep neural network. To achieve effective training of this model with limited training samples, we design a multi-stage training strategy of module-wise pre-training followed by end-to-end fine-tuning. In addition, a strategy of filling the multimodal sentiment quaternion is proposed for implementing our method on a facial expression database consisting only of face images. We conducted extensive experiments to evaluate the proposed method on two databases of multimodal sentiment analysis (CH-SIMS and CMU-MOSI) and two databases of FER in the wild (RAF-DB and AffectNet). The results show that embedding multimodal sentiment data as privileged information into the FER task based on face images can significantly improve the accuracy of FER. Furthermore, by only using image in the test phase, the proposed method can achieve better results of multimodal sentiment analysis than those methods achieved by using multimodal sentimental data fusion.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 1","pages":"133-144"},"PeriodicalIF":9.8000,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10561510/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Facial expression recognition (FER) has always been one of the key task in affective computing. Over the years, researchers have worked to improve the performance of FER by designing models with more powerful feature extraction, embedding attention mechanism, and reconstructing missing information, etc. Different from the paradigms above, we attempt to improve FER performance by using multimodal sentiment data, such as audio and text, as privileged information (PI) for facial images. To this end, a multimodal privileged information embedded facial expression recognition network (MPI-FER) is proposed in this paper. During the training phase, this model achieves the PI embedding of multimodal data for FER by developing cross-modality translation between multimodal sentiment data. During the test phase, input images alone are sufficient for the model inference to accomplish the FER task input. The MPI-FER is a large-scale, heterogeneous deep neural network. To achieve effective training of this model with limited training samples, we design a multi-stage training strategy of module-wise pre-training followed by end-to-end fine-tuning. In addition, a strategy of filling the multimodal sentiment quaternion is proposed for implementing our method on a facial expression database consisting only of face images. We conducted extensive experiments to evaluate the proposed method on two databases of multimodal sentiment analysis (CH-SIMS and CMU-MOSI) and two databases of FER in the wild (RAF-DB and AffectNet). The results show that embedding multimodal sentiment data as privileged information into the FER task based on face images can significantly improve the accuracy of FER. Furthermore, by only using image in the test phase, the proposed method can achieve better results of multimodal sentiment analysis than those methods achieved by using multimodal sentimental data fusion.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.