SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition

IF 9.8 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Affective Computing Pub Date : 2025-02-13 DOI:10.1109/TAFFC.2025.3537991

Alaa Nfissi;Wassim Bouachir;Nizar Bouguila;Brian Mishara

{"title":"SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition","authors":"Alaa Nfissi;Wassim Bouachir;Nizar Bouguila;Brian Mishara","doi":"10.1109/TAFFC.2025.3537991","DOIUrl":null,"url":null,"abstract":"In the field of human-computer interaction and psychological assessment, speech emotion recognition (SER) plays an important role in deciphering emotional states from speech signals. Despite advancements, challenges persist due to system complexity, feature distinctiveness issues, and noise interference. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER, addressing these limitations by extracting meaningful representations directly from raw waveform speech signals. By leveraging the properties of the fast discrete wavelet transform (FDWT), including the cascade algorithm, conjugate quadrature filter, and coefficient denoising, our approach introduces a learnable model for both wavelet bases and denoising through deep learning techniques. The framework incorporates an activation function for learnable asymmetric hard thresholding of wavelet coefficients. Our approach exploits the capabilities of wavelets for effective localization in both time and frequency domains. We then combine one-dimensional dilated convolutional neural networks (1D dilated CNN) with a spatial attention layer and bidirectional gated recurrent units (Bi-GRU) with a temporal attention layer to efficiently capture the nuanced spatial and temporal characteristics of emotional features. By handling variable-length speech without segmentation and eliminating the need for pre or post-processing, the proposed model outperformed state-of-the-art methods on IEMOCAP and EMO-DB datasets.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 3","pages":"1839-1854"},"PeriodicalIF":9.8000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10886970/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In the field of human-computer interaction and psychological assessment, speech emotion recognition (SER) plays an important role in deciphering emotional states from speech signals. Despite advancements, challenges persist due to system complexity, feature distinctiveness issues, and noise interference. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER, addressing these limitations by extracting meaningful representations directly from raw waveform speech signals. By leveraging the properties of the fast discrete wavelet transform (FDWT), including the cascade algorithm, conjugate quadrature filter, and coefficient denoising, our approach introduces a learnable model for both wavelet bases and denoising through deep learning techniques. The framework incorporates an activation function for learnable asymmetric hard thresholding of wavelet coefficients. Our approach exploits the capabilities of wavelets for effective localization in both time and frequency domains. We then combine one-dimensional dilated convolutional neural networks (1D dilated CNN) with a spatial attention layer and bidirectional gated recurrent units (Bi-GRU) with a temporal attention layer to efficiently capture the nuanced spatial and temporal characteristics of emotional features. By handling variable-length speech without segmentation and eliminating the need for pre or post-processing, the proposed model outperformed state-of-the-art methods on IEMOCAP and EMO-DB datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

学习多分辨率信号小波网络用于语音情感识别

在人机交互和心理评估领域，语音情绪识别（SER）在从语音信号中破译情绪状态方面发挥着重要作用。尽管取得了进步，但由于系统复杂性、特征独特性问题和噪声干扰，挑战仍然存在。本文介绍了一个新的端到端（E2E）深度学习多分辨率框架，通过直接从原始波形语音信号中提取有意义的表示来解决这些限制。通过利用快速离散小波变换（FDWT）的特性，包括级联算法、共轭正交滤波器和系数去噪，我们的方法通过深度学习技术为小波基和去噪引入了一个可学习的模型。该框架包含一个激活函数，用于小波系数的可学习非对称硬阈值。我们的方法利用小波在时域和频域的有效定位能力。然后，我们将一维扩张卷积神经网络（1D expanded CNN）与空间注意层结合，将双向门控循环单元（Bi-GRU）与时间注意层结合，以有效捕捉情绪特征的细微时空特征。该模型在IEMOCAP和EMO-DB数据集上的表现优于当前最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Affective Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

15.00

自引率

6.20%

发文量

174

期刊介绍： The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.