EmoAsst：通过预训练视觉和声学模型上的文本引导迁移学习进行情感识别的助手

IF 2.4 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Frontiers in Computer Science Pub Date : 2024-04-09 DOI:10.3389/fcomp.2024.1304687

Minxiao Wang, Ning Yang

{"title":"EmoAsst：通过预训练视觉和声学模型上的文本引导迁移学习进行情感识别的助手","authors":"Minxiao Wang, Ning Yang","doi":"10.3389/fcomp.2024.1304687","DOIUrl":null,"url":null,"abstract":"Children diagnosed with Autism Spectrum Disorder (ASD) often struggle to grasp social conventions and promptly recognize others' emotions. Recent advancements in the application of deep learning (DL) to emotion recognition are solidifying the role of AI-powered assistive technology in supporting autistic children. However, the cost of collecting and annotating large-scale high-quality human emotion data and the phenomenon of unbalanced performance on different modalities of data challenge DL-based emotion recognition. In response to these challenges, this paper explores transfer learning, wherein large pre-trained models like Contrastive Language-Image Pre-training (CLIP) and wav2vec 2.0 are fine-tuned to improve audio- and video-based emotion recognition with text- based guidance. In this work, we propose the EmoAsst framework, which includes a visual fusion module and emotion prompt fine-tuning for CLIP, in addition to leveraging CLIP's text encoder and supervised contrastive learning for audio-based emotion recognition on the wav2vec 2.0 model. In addition, a joint few-shot emotion classifier enhances the accuracy and offers great adaptability for real-world applications. The evaluation results on the MELD dataset highlight the outstanding performance of our methods, surpassing the majority of existing video and audio-based approaches. Notably, our research demonstrates the promising potential of the proposed text-based guidance techniques for improving video and audio-based Emotion Recognition and Classification (ERC).","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"EmoAsst: emotion recognition assistant via text-guided transfer learning on pre-trained visual and acoustic models\",\"authors\":\"Minxiao Wang, Ning Yang\",\"doi\":\"10.3389/fcomp.2024.1304687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Children diagnosed with Autism Spectrum Disorder (ASD) often struggle to grasp social conventions and promptly recognize others' emotions. Recent advancements in the application of deep learning (DL) to emotion recognition are solidifying the role of AI-powered assistive technology in supporting autistic children. However, the cost of collecting and annotating large-scale high-quality human emotion data and the phenomenon of unbalanced performance on different modalities of data challenge DL-based emotion recognition. In response to these challenges, this paper explores transfer learning, wherein large pre-trained models like Contrastive Language-Image Pre-training (CLIP) and wav2vec 2.0 are fine-tuned to improve audio- and video-based emotion recognition with text- based guidance. In this work, we propose the EmoAsst framework, which includes a visual fusion module and emotion prompt fine-tuning for CLIP, in addition to leveraging CLIP's text encoder and supervised contrastive learning for audio-based emotion recognition on the wav2vec 2.0 model. In addition, a joint few-shot emotion classifier enhances the accuracy and offers great adaptability for real-world applications. The evaluation results on the MELD dataset highlight the outstanding performance of our methods, surpassing the majority of existing video and audio-based approaches. Notably, our research demonstrates the promising potential of the proposed text-based guidance techniques for improving video and audio-based Emotion Recognition and Classification (ERC).\",\"PeriodicalId\":52823,\"journal\":{\"name\":\"Frontiers in Computer Science\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fcomp.2024.1304687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fcomp.2024.1304687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

被诊断患有自闭症谱系障碍（ASD）的儿童通常很难掌握社交惯例并及时识别他人的情绪。最近在应用深度学习（DL）进行情绪识别方面取得的进展，巩固了人工智能辅助技术在支持自闭症儿童方面的作用。然而，收集和注释大规模高质量人类情绪数据的成本，以及不同数据模式下的不平衡表现现象，都对基于深度学习的情绪识别提出了挑战。为了应对这些挑战，本文探讨了迁移学习，即对大型预训练模型（如对比语言-图像预训练（CLIP）和 wav2vec 2.0）进行微调，以改进基于音频和视频的情感识别，并提供基于文本的指导。在这项工作中，我们提出了 EmoAsst 框架，其中包括视觉融合模块和针对 CLIP 的情感提示微调，以及利用 CLIP 的文本编码器和监督对比学习，在 wav2vec 2.0 模型上进行基于音频的情感识别。此外，联合少量情感分类器提高了准确性，并为实际应用提供了极大的适应性。在 MELD 数据集上的评估结果表明，我们的方法性能卓越，超越了大多数现有的基于视频和音频的方法。值得注意的是，我们的研究证明了所提出的基于文本的引导技术在改进基于视频和音频的情感识别与分类（ERC）方面的巨大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

EmoAsst: emotion recognition assistant via text-guided transfer learning on pre-trained visual and acoustic models

Children diagnosed with Autism Spectrum Disorder (ASD) often struggle to grasp social conventions and promptly recognize others' emotions. Recent advancements in the application of deep learning (DL) to emotion recognition are solidifying the role of AI-powered assistive technology in supporting autistic children. However, the cost of collecting and annotating large-scale high-quality human emotion data and the phenomenon of unbalanced performance on different modalities of data challenge DL-based emotion recognition. In response to these challenges, this paper explores transfer learning, wherein large pre-trained models like Contrastive Language-Image Pre-training (CLIP) and wav2vec 2.0 are fine-tuned to improve audio- and video-based emotion recognition with text- based guidance. In this work, we propose the EmoAsst framework, which includes a visual fusion module and emotion prompt fine-tuning for CLIP, in addition to leveraging CLIP's text encoder and supervised contrastive learning for audio-based emotion recognition on the wav2vec 2.0 model. In addition, a joint few-shot emotion classifier enhances the accuracy and offers great adaptability for real-world applications. The evaluation results on the MELD dataset highlight the outstanding performance of our methods, surpassing the majority of existing video and audio-based approaches. Notably, our research demonstrates the promising potential of the proposed text-based guidance techniques for improving video and audio-based Emotion Recognition and Classification (ERC).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊