Multitask Transformer for Cross-Corpus Speech Emotion Recognition

IF 9.8 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2025-01-07 DOI:10.1109/TAFFC.2025.3526592
Chung-Soo Ahn;Rajib Rana;Carlos Busso;Jagath C. Rajapakse
{"title":"Multitask Transformer for Cross-Corpus Speech Emotion Recognition","authors":"Chung-Soo Ahn;Rajib Rana;Carlos Busso;Jagath C. Rajapakse","doi":"10.1109/TAFFC.2025.3526592","DOIUrl":null,"url":null,"abstract":"Deep learning has significantly advanced the field of Speech Emotion Recognition (SER), yet its efficacy in cross-corpus scenarios remains a challenge. To overcome this limitation, recent studies demonstrate the success of multitask learning, which uses auxiliary tasks to reduce difference between source and target dataset (or transfer knowledge from source to target datasets). Despite the efforts, the overall accuracy for cross-corpus SER is still relatively low and needs attention. To improve performance, we propose a multitask framework with SER as the primary task and contrastive learning and information maximization as auxiliary tasks. We design the auxiliary tasks innovatively to use the target data without emotional labels to develop a better understanding of the target data. The core of our multitask framework is a pre-trained transformer. While transformers have gained attention in SER, their application to cross-corpus scenarios is still limited. Multimodal approaches for cross-corpus scenario is substantially limited as well. We use text as the second modality, developing separate multitask transformers for audio and text and conduct decision-level fusion during inference. We use publicly available and widely used speech corpora, including the IEMOCAP, MSP-IMPROV and EMO-DB databases. The results demonstrate the benefits of the proposed approach, achieving improved performance on the benchmark databases in cross-corpus settings.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1581-1591"},"PeriodicalIF":9.8000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10830494/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Deep learning has significantly advanced the field of Speech Emotion Recognition (SER), yet its efficacy in cross-corpus scenarios remains a challenge. To overcome this limitation, recent studies demonstrate the success of multitask learning, which uses auxiliary tasks to reduce difference between source and target dataset (or transfer knowledge from source to target datasets). Despite the efforts, the overall accuracy for cross-corpus SER is still relatively low and needs attention. To improve performance, we propose a multitask framework with SER as the primary task and contrastive learning and information maximization as auxiliary tasks. We design the auxiliary tasks innovatively to use the target data without emotional labels to develop a better understanding of the target data. The core of our multitask framework is a pre-trained transformer. While transformers have gained attention in SER, their application to cross-corpus scenarios is still limited. Multimodal approaches for cross-corpus scenario is substantially limited as well. We use text as the second modality, developing separate multitask transformers for audio and text and conduct decision-level fusion during inference. We use publicly available and widely used speech corpora, including the IEMOCAP, MSP-IMPROV and EMO-DB databases. The results demonstrate the benefits of the proposed approach, achieving improved performance on the benchmark databases in cross-corpus settings.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
跨语料库语音情感识别的多任务转换器
深度学习极大地推动了语音情感识别(SER)领域的发展,但其在跨语料库场景中的有效性仍然是一个挑战。为了克服这一限制,最近的研究证明了多任务学习的成功,它使用辅助任务来减少源数据集和目标数据集之间的差异(或将知识从源数据集转移到目标数据集)。尽管做出了努力,但跨语料库SER的整体准确率仍然相对较低,需要引起重视。为了提高性能,我们提出了一个以SER为主要任务,对比学习和信息最大化为辅助任务的多任务框架。我们创新地设计了辅助任务,使用目标数据而不使用情感标签,以更好地理解目标数据。我们的多任务框架的核心是一个预训练的转换器。虽然变压器在SER中得到了关注,但它们在跨语料库场景中的应用仍然有限。跨语料库场景的多模态方法也有很大的局限性。我们使用文本作为第二种模态,为音频和文本开发单独的多任务转换器,并在推理过程中进行决策级融合。我们使用公开可用和广泛使用的语音语料库,包括IEMOCAP, MSP-IMPROV和EMO-DB数据库。结果证明了该方法的优点,在跨语料库设置的基准数据库上实现了改进的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Affective Computing
IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS
CiteScore
15.00
自引率
6.20%
发文量
174
期刊介绍: The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.
期刊最新文献
MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation Context-Aware Toxicity-Adaptive Sampling for Affective Language Generation SLAB: A Self-supervised Label Generation Framework to Reduce Annotation Overhead Data Distribution Evolution for Robust EEG Emotion Recognition with Limited Data Resource
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1