采用时频注意和 S-TCN 的多尺度编码器-解码器网络用于单通道语音增强

Journal of Intelligent & Fuzzy Systems Pub Date : 2024-03-12 DOI:10.3233/jifs-233312

Veeraswamy Parisae, S. Bhavanam

{"title":"采用时频注意和 S-TCN 的多尺度编码器-解码器网络用于单通道语音增强","authors":"Veeraswamy Parisae, S. Bhavanam","doi":"10.3233/jifs-233312","DOIUrl":null,"url":null,"abstract":"The goal of speech enhancement is to restore clean speech in noisy environments. Acoustic scenarios with low signal-to-noise ratios (SNR) make it quite challenging to extract the target speech from its noise. In the current study, to enhance noisy speech, we propose a feature recalibration based multi-scale convolutional encoder-decoder architecture with squeeze temporal convolutional networks (S-TCN) bottleneck. Each multi-scale convolutional layer in encoder and decoder is followed by time-frequency attention module (TFA). The recalibration based multi-scale 2D convolution layers are used to extract local and contextual information. Additionally, the recalibration network is equipped with a gating mechanism to control the flow of information among the layers, enabling weighting of the scaled features for noise suppression and speech retention. The fully connected layer (FC) in the bottleneck part of encoder-decoder contains a few neurons, which capture the global information from the multi-scale 2D convolution layer and reduce parameters. A S-TCN, inspired by the popular temporal convolutional neural network (TCNN), is inserted between the encoder and the decoder to model long-term dependencies in speech. The TFA is a highly efficient network component, that operates through two simultaneous attentions, one focused on time frames, and the other on frequency channels. These attentions work together to explicitly exploit positional information to create a two-dimensional attention map to effectively capture the significant time-frequency distribution of speech. Utilizing the common voice dataset, our proposed model consistently enhances results compared to the current benchmarks, as demonstrated by two extensively utilized objective measures PESQ and STOI. The proposed model shows significant improvements, with average PESQ and STOI scores increasing by 45.7% and 23.8% respectively for seen background noises, and by 43.5% and 21.4% for unseen background noises, when compared to the quality of noisy speech. Tests validate that the proposed approach outperforms numerous cutting-edge algorithms.","PeriodicalId":509313,"journal":{"name":"Journal of Intelligent & Fuzzy Systems","volume":"59 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi scale encoder-decoder network with Time Frequency Attention and S-TCN for single channel speech enhancement\",\"authors\":\"Veeraswamy Parisae, S. Bhavanam\",\"doi\":\"10.3233/jifs-233312\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The goal of speech enhancement is to restore clean speech in noisy environments. Acoustic scenarios with low signal-to-noise ratios (SNR) make it quite challenging to extract the target speech from its noise. In the current study, to enhance noisy speech, we propose a feature recalibration based multi-scale convolutional encoder-decoder architecture with squeeze temporal convolutional networks (S-TCN) bottleneck. Each multi-scale convolutional layer in encoder and decoder is followed by time-frequency attention module (TFA). The recalibration based multi-scale 2D convolution layers are used to extract local and contextual information. Additionally, the recalibration network is equipped with a gating mechanism to control the flow of information among the layers, enabling weighting of the scaled features for noise suppression and speech retention. The fully connected layer (FC) in the bottleneck part of encoder-decoder contains a few neurons, which capture the global information from the multi-scale 2D convolution layer and reduce parameters. A S-TCN, inspired by the popular temporal convolutional neural network (TCNN), is inserted between the encoder and the decoder to model long-term dependencies in speech. The TFA is a highly efficient network component, that operates through two simultaneous attentions, one focused on time frames, and the other on frequency channels. These attentions work together to explicitly exploit positional information to create a two-dimensional attention map to effectively capture the significant time-frequency distribution of speech. Utilizing the common voice dataset, our proposed model consistently enhances results compared to the current benchmarks, as demonstrated by two extensively utilized objective measures PESQ and STOI. The proposed model shows significant improvements, with average PESQ and STOI scores increasing by 45.7% and 23.8% respectively for seen background noises, and by 43.5% and 21.4% for unseen background noises, when compared to the quality of noisy speech. Tests validate that the proposed approach outperforms numerous cutting-edge algorithms.\",\"PeriodicalId\":509313,\"journal\":{\"name\":\"Journal of Intelligent & Fuzzy Systems\",\"volume\":\"59 5\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Intelligent & Fuzzy Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/jifs-233312\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent & Fuzzy Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/jifs-233312","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

语音增强的目标是在噪声环境中恢复干净的语音。在信噪比（SNR）较低的声学场景中，从噪声中提取目标语音具有相当大的挑战性。在当前的研究中，为了增强噪声语音，我们提出了一种基于特征重校准的多尺度卷积编码器-解码器架构，该架构具有挤压时间卷积网络（S-TCN）瓶颈。编码器和解码器中的每个多尺度卷积层之后都有时频注意模块（TFA）。基于重新校准的多尺度二维卷积层用于提取局部和上下文信息。此外，重新校准网络还配备了一个门控机制，用于控制各层之间的信息流，从而对按比例划分的特征进行加权，以实现噪声抑制和语音保留。编码器-解码器瓶颈部分的全连接层（FC）包含几个神经元，用于捕捉来自多尺度二维卷积层的全局信息并减少参数。受流行的时序卷积神经网络（TCNN）启发，在编码器和解码器之间插入了一个 S-TCN，以模拟语音中的长期依赖关系。TFA 是一个高效的网络组件，它通过两个同步关注点运行，一个关注时间框架，另一个关注频率通道。这两种注意力相互配合，明确利用位置信息创建二维注意力图，从而有效捕捉语音的重要时频分布。利用普通语音数据集，我们提出的模型与目前的基准相比不断提高结果，两个广泛使用的客观指标 PESQ 和 STOI 就证明了这一点。与嘈杂语音质量相比，所提出的模型显示出明显的改进，对于可见背景噪音，平均 PESQ 和 STOI 分数分别提高了 45.7% 和 23.8%，对于未见背景噪音，分别提高了 43.5% 和 21.4%。测试验证了所提出的方法优于众多先进算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi scale encoder-decoder network with Time Frequency Attention and S-TCN for single channel speech enhancement

The goal of speech enhancement is to restore clean speech in noisy environments. Acoustic scenarios with low signal-to-noise ratios (SNR) make it quite challenging to extract the target speech from its noise. In the current study, to enhance noisy speech, we propose a feature recalibration based multi-scale convolutional encoder-decoder architecture with squeeze temporal convolutional networks (S-TCN) bottleneck. Each multi-scale convolutional layer in encoder and decoder is followed by time-frequency attention module (TFA). The recalibration based multi-scale 2D convolution layers are used to extract local and contextual information. Additionally, the recalibration network is equipped with a gating mechanism to control the flow of information among the layers, enabling weighting of the scaled features for noise suppression and speech retention. The fully connected layer (FC) in the bottleneck part of encoder-decoder contains a few neurons, which capture the global information from the multi-scale 2D convolution layer and reduce parameters. A S-TCN, inspired by the popular temporal convolutional neural network (TCNN), is inserted between the encoder and the decoder to model long-term dependencies in speech. The TFA is a highly efficient network component, that operates through two simultaneous attentions, one focused on time frames, and the other on frequency channels. These attentions work together to explicitly exploit positional information to create a two-dimensional attention map to effectively capture the significant time-frequency distribution of speech. Utilizing the common voice dataset, our proposed model consistently enhances results compared to the current benchmarks, as demonstrated by two extensively utilized objective measures PESQ and STOI. The proposed model shows significant improvements, with average PESQ and STOI scores increasing by 45.7% and 23.8% respectively for seen background noises, and by 43.5% and 21.4% for unseen background noises, when compared to the quality of noisy speech. Tests validate that the proposed approach outperforms numerous cutting-edge algorithms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Intelligent & Fuzzy Systems

自引率

0.00%

发文量