Representation Learning with Spectro-Temporal-Channel Attention for Speech Emotion Recognition

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2021-06-06 DOI:10.1109/ICASSP39728.2021.9414006

Lili Guo, Longbiao Wang, Chenglin Xu, J. Dang, Chng Eng Siong, Haizhou Li

引用次数: 22

Abstract

Convolutional neural network (CNN) is found to be effective in learning representation for speech emotion recognition. CNNs do not explicitly model the associations or relative importance of features in the spectral/temporal/channel-wise axes. In this paper, we propose an attention module, named spectro-temporal-channel (STC) attention module that is integrated with CNN to improve representation learning ability. Our module infers an attention map along the three dimensions, namely time, frequency, and CNN channel. Experiments are conducted on the IEMOCAP database to evaluate the effectiveness of the proposed representation learning method. The results demonstrate that the proposed method outperforms the traditional CNN method by an absolute increase of 3.13% in terms of F1 score.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于光谱-时间通道注意的表征学习用于语音情绪识别

卷积神经网络(CNN)是一种有效的语音情感识别表征学习方法。cnn没有明确地模拟光谱/时间/信道方向轴上特征的关联或相对重要性。在本文中，我们提出了一种与CNN相结合的注意力模块，称为光谱-时间-通道(spectral -temporal-channel, STC)注意力模块，以提高表征学习能力。我们的模块沿着三个维度，即时间、频率和CNN频道，推断出一个注意力图。在IEMOCAP数据库上进行了实验，以评估所提出的表示学习方法的有效性。结果表明，该方法在F1分数方面比传统CNN方法提高了3.13%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量