Multitask Transformer for Cross-Corpus Speech Emotion Recognition

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2025-01-07 DOI:10.1109/TAFFC.2025.3526592

Chung-Soo Ahn;Rajib Rana;Carlos Busso;Jagath C. Rajapakse

{"title":"Multitask Transformer for Cross-Corpus Speech Emotion Recognition","authors":"Chung-Soo Ahn;Rajib Rana;Carlos Busso;Jagath C. Rajapakse","doi":"10.1109/TAFFC.2025.3526592","DOIUrl":null,"url":null,"abstract":"Deep learning has significantly advanced the field of Speech Emotion Recognition (SER), yet its efficacy in cross-corpus scenarios remains a challenge. To overcome this limitation, recent studies demonstrate the success of multitask learning, which uses auxiliary tasks to reduce difference between source and target dataset (or transfer knowledge from source to target datasets). Despite the efforts, the overall accuracy for cross-corpus SER is still relatively low and needs attention. To improve performance, we propose a multitask framework with SER as the primary task and contrastive learning and information maximization as auxiliary tasks. We design the auxiliary tasks innovatively to use the target data without emotional labels to develop a better understanding of the target data. The core of our multitask framework is a pre-trained transformer. While transformers have gained attention in SER, their application to cross-corpus scenarios is still limited. Multimodal approaches for cross-corpus scenario is substantially limited as well. We use text as the second modality, developing separate multitask transformers for audio and text and conduct decision-level fusion during inference. We use publicly available and widely used speech corpora, including the IEMOCAP, MSP-IMPROV and EMO-DB databases. The results demonstrate the benefits of the proposed approach, achieving improved performance on the benchmark databases in cross-corpus settings.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1581-1591"},"PeriodicalIF":9.8000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10830494/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning has significantly advanced the field of Speech Emotion Recognition (SER), yet its efficacy in cross-corpus scenarios remains a challenge. To overcome this limitation, recent studies demonstrate the success of multitask learning, which uses auxiliary tasks to reduce difference between source and target dataset (or transfer knowledge from source to target datasets). Despite the efforts, the overall accuracy for cross-corpus SER is still relatively low and needs attention. To improve performance, we propose a multitask framework with SER as the primary task and contrastive learning and information maximization as auxiliary tasks. We design the auxiliary tasks innovatively to use the target data without emotional labels to develop a better understanding of the target data. The core of our multitask framework is a pre-trained transformer. While transformers have gained attention in SER, their application to cross-corpus scenarios is still limited. Multimodal approaches for cross-corpus scenario is substantially limited as well. We use text as the second modality, developing separate multitask transformers for audio and text and conduct decision-level fusion during inference. We use publicly available and widely used speech corpora, including the IEMOCAP, MSP-IMPROV and EMO-DB databases. The results demonstrate the benefits of the proposed approach, achieving improved performance on the benchmark databases in cross-corpus settings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

跨语料库语音情感识别的多任务转换器

深度学习极大地推动了语音情感识别（SER）领域的发展，但其在跨语料库场景中的有效性仍然是一个挑战。为了克服这一限制，最近的研究证明了多任务学习的成功，它使用辅助任务来减少源数据集和目标数据集之间的差异（或将知识从源数据集转移到目标数据集）。尽管做出了努力，但跨语料库SER的整体准确率仍然相对较低，需要引起重视。为了提高性能，我们提出了一个以SER为主要任务，对比学习和信息最大化为辅助任务的多任务框架。我们创新地设计了辅助任务，使用目标数据而不使用情感标签，以更好地理解目标数据。我们的多任务框架的核心是一个预训练的转换器。虽然变压器在SER中得到了关注，但它们在跨语料库场景中的应用仍然有限。跨语料库场景的多模态方法也有很大的局限性。我们使用文本作为第二种模态，为音频和文本开发单独的多任务转换器，并在推理过程中进行决策级融合。我们使用公开可用和广泛使用的语音语料库，包括IEMOCAP， MSP-IMPROV和EMO-DB数据库。结果证明了该方法的优点，在跨语料库设置的基准数据库上实现了改进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.