In recent years, significant progress has been made in facial expression recognition (FER) methods based on deep learning. However, existing models still face challenges in terms of computational efficiency and generalization performance when dealing with diverse emotional expressions and complex environmental variations. Recently, large-scale vision-language pre-training models such as CLIP have achieved remarkable success in multi-modal learning. Their rich visual and textual representations offer valuable insights for downstream tasks. Consequently, transferring the knowledge to develop efficient and accurate facial expression recognition (FER) systems has emerged as a key research direction. To the end, this paper proposes a novel model, termed Knowledge Distillation and Retrieval-Augmented Generation (KDRAG), which combines Distillation and Retrieval-Augmented Generation (RAG) techniques to improve the efficiency and accuracy of FER. Through knowledge distillation, the teacher model (ViT-L/14) transfers its rich knowledge to the smaller student model (ViT-B/32). An additional linear projection layer is added to map the teacher model’s output features to the student model’s feature dimensions for feature alignment. Moreover, the RAG mechanism is developed to enhance the emotional understanding of students by retrieving text descriptions related to the input image. Additionally, this framework combines soft loss (from the teacher model’s knowledge) and hard loss (from the true targets of the labels) to enhance the model’s generalization ability. Extensive experimental results on multiple datasets demonstrate that the KDRAG framework can achieve significant improvements in accuracy and computational efficiency, providing new insights for real-time FER systems.
扫码关注我们
求助内容:
应助结果提醒方式:
