CNN-n-GRU: end-to-end speech emotion recognition from raw waveform signal using CNNs and gated recurrent unit networks

2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2022-12-01 DOI:10.1109/ICMLA55696.2022.00116

Alaa Nfissi, W. Bouachir, N. Bouguila, B. Mishara

{"title":"CNN-n-GRU: end-to-end speech emotion recognition from raw waveform signal using CNNs and gated recurrent unit networks","authors":"Alaa Nfissi, W. Bouachir, N. Bouguila, B. Mishara","doi":"10.1109/ICMLA55696.2022.00116","DOIUrl":null,"url":null,"abstract":"We present CNN-n-GRU, a new end-to-end (E2E) architecture built of an n-layer convolutional neural network (CNN) followed sequentially by an n-layer Gated Recurrent Unit (GRU) for speech emotion recognition. CNNs and RNNs both exhibited promising outcomes when fed raw waveform voice inputs. This inspired our idea to combine them into a single model to maximise their potential. Instead of using handcrafted features or spectrograms, we train CNNs to recognise low-level speech representations from raw waveform, which allows the network to capture relevant narrow-band emotion characteristics. On the other hand, RNNs (GRUs in our case) can learn temporal characteristics, allowing the network to better capture the signal’s time-distributed features. Because a CNN can generate multiple levels of representation abstraction, we exploit early layers to extract high-level features, then to supply the appropriate input to subsequent RNN layers in order to aggregate long-term dependencies. By taking advantage of both CNNs and GRUs in a single model, the proposed architecture has important advantages over other models from the literature. The proposed model was evaluated using the TESS dataset and compared to state-of-the-art methods. Our experimental results demonstrate that the proposed model is more accurate than traditional classification approaches for speech emotion recognition.","PeriodicalId":128160,"journal":{"name":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA55696.2022.00116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We present CNN-n-GRU, a new end-to-end (E2E) architecture built of an n-layer convolutional neural network (CNN) followed sequentially by an n-layer Gated Recurrent Unit (GRU) for speech emotion recognition. CNNs and RNNs both exhibited promising outcomes when fed raw waveform voice inputs. This inspired our idea to combine them into a single model to maximise their potential. Instead of using handcrafted features or spectrograms, we train CNNs to recognise low-level speech representations from raw waveform, which allows the network to capture relevant narrow-band emotion characteristics. On the other hand, RNNs (GRUs in our case) can learn temporal characteristics, allowing the network to better capture the signal’s time-distributed features. Because a CNN can generate multiple levels of representation abstraction, we exploit early layers to extract high-level features, then to supply the appropriate input to subsequent RNN layers in order to aggregate long-term dependencies. By taking advantage of both CNNs and GRUs in a single model, the proposed architecture has important advantages over other models from the literature. The proposed model was evaluated using the TESS dataset and compared to state-of-the-art methods. Our experimental results demonstrate that the proposed model is more accurate than traditional classification approaches for speech emotion recognition.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CNN-n-GRU:利用cnn和门控循环单元网络从原始波形信号进行端到端语音情感识别

我们提出了CNN-n-GRU，一种新的端到端(E2E)架构，该架构由一个n层卷积神经网络(CNN)和一个用于语音情感识别的n层门控循环单元(GRU)依次构建。当输入原始波形语音输入时，cnn和rnn都显示出有希望的结果。这激发了我们的想法，将它们组合成一个单一的模型，以最大限度地发挥它们的潜力。我们没有使用手工制作的特征或频谱图，而是训练cnn从原始波形中识别低级语音表示，这允许网络捕获相关的窄带情感特征。另一方面，rnn(在我们的例子中是gru)可以学习时间特征，允许网络更好地捕获信号的时间分布特征。因为CNN可以生成多层表示抽象，我们利用早期层提取高级特征，然后为后续RNN层提供适当的输入，以聚合长期依赖关系。通过在单个模型中同时利用cnn和gru，所提出的体系结构比文献中的其他模型具有重要的优势。使用TESS数据集对所提出的模型进行了评估，并与最先进的方法进行了比较。实验结果表明，该模型比传统的语音情感识别分类方法更准确。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量