Stacked U-Net with Time–Frequency Attention and Deep Connection Net for Single Channel Speech Enhancement

IF 0.8 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING International Journal of Image and Graphics Pub Date : 2024-04-09 DOI:10.1142/s0219467825500676
Veeraswamy Parisae, S. Nagakishore Bhavanam
{"title":"Stacked U-Net with Time–Frequency Attention and Deep Connection Net for Single Channel Speech Enhancement","authors":"Veeraswamy Parisae, S. Nagakishore Bhavanam","doi":"10.1142/s0219467825500676","DOIUrl":null,"url":null,"abstract":"Deep neural networks have significantly promoted the progress of speech enhancement technology. However, a great number of speech enhancement approaches are unable to fully utilize context information from various scales, hindering performance enhancement. To tackle this issue, we introduce a method called TFADCSU-Net (Stacked U-Net with Time-Frequency Attention (TFA) and Deep Connection Layer (DCL)) for enhancing noisy speech in the time–frequency domain. TFADCSU-Net adopts an encoder-decoder structure with skip links. Within TFADCSU-Net, a multiscale feature extraction layer (MSFEL) is proposed to effectively capture contextual data from various scales. This allows us to leverage both global and local speech features to enhance the reconstruction of speech signals. Moreover, we incorporate deep connection layer and TFA mechanisms into the network to further improve feature extraction and aggregate utterance level context. The deep connection layer effectively captures rich and precise features by establishing direct connections starting from the initial layer to all subsequent layers, rather than relying on connections from earlier layers to subsequent layers. This approach not only enhances the information flow within the network but also avoids a significant rise in computational complexity as the number of network layers increases. The TFA module consists of two attention branches operating concurrently: one directed towards the temporal dimension and the other towards the frequency dimension. These branches generate distinct forms of attention — one for identifying relevant time frames and another for selecting frequency wise channels. These attention mechanisms assist the models in discerning “where” and “what” to prioritize. Subsequently, the TA and FA branches are combined to produce a comprehensive attention map in two dimensions. This map assigns specific attention weights to individual spectral components in the time–frequency representation, enabling the networks to proficiently capture the speech characteristics in the T-F representation. The results confirm that the proposed method outperforms other models in terms of objective speech quality as well as intelligibility.","PeriodicalId":44688,"journal":{"name":"International Journal of Image and Graphics","volume":null,"pages":null},"PeriodicalIF":0.8000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Image and Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219467825500676","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Deep neural networks have significantly promoted the progress of speech enhancement technology. However, a great number of speech enhancement approaches are unable to fully utilize context information from various scales, hindering performance enhancement. To tackle this issue, we introduce a method called TFADCSU-Net (Stacked U-Net with Time-Frequency Attention (TFA) and Deep Connection Layer (DCL)) for enhancing noisy speech in the time–frequency domain. TFADCSU-Net adopts an encoder-decoder structure with skip links. Within TFADCSU-Net, a multiscale feature extraction layer (MSFEL) is proposed to effectively capture contextual data from various scales. This allows us to leverage both global and local speech features to enhance the reconstruction of speech signals. Moreover, we incorporate deep connection layer and TFA mechanisms into the network to further improve feature extraction and aggregate utterance level context. The deep connection layer effectively captures rich and precise features by establishing direct connections starting from the initial layer to all subsequent layers, rather than relying on connections from earlier layers to subsequent layers. This approach not only enhances the information flow within the network but also avoids a significant rise in computational complexity as the number of network layers increases. The TFA module consists of two attention branches operating concurrently: one directed towards the temporal dimension and the other towards the frequency dimension. These branches generate distinct forms of attention — one for identifying relevant time frames and another for selecting frequency wise channels. These attention mechanisms assist the models in discerning “where” and “what” to prioritize. Subsequently, the TA and FA branches are combined to produce a comprehensive attention map in two dimensions. This map assigns specific attention weights to individual spectral components in the time–frequency representation, enabling the networks to proficiently capture the speech characteristics in the T-F representation. The results confirm that the proposed method outperforms other models in terms of objective speech quality as well as intelligibility.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于单声道语音增强的具有时频注意力和深度连接网的叠加 U 网
深度神经网络极大地推动了语音增强技术的进步。然而,大量语音增强方法无法充分利用各种尺度的上下文信息,从而阻碍了性能的提升。为解决这一问题,我们引入了一种名为 TFADCSU-Net (Stacked U-Net with Time-Frequency Attention (TFA) and Deep Connection Layer (DCL))的方法,用于增强时频域的噪声语音。TFADCSU-Net 采用带跳过链接的编码器-解码器结构。在 TFADCSU-Net 中,我们提出了多尺度特征提取层 (MSFEL),以有效捕捉来自不同尺度的上下文数据。这样,我们就能利用全局和局部语音特征来增强语音信号的重构。此外,我们还在网络中加入了深度连接层和 TFA 机制,以进一步改进特征提取和语句级上下文聚合。深度连接层通过建立从初始层到所有后续层的直接连接,而不是依赖于从早期层到后续层的连接,从而有效地捕捉丰富而精确的特征。这种方法不仅增强了网络内的信息流,还避免了因网络层数增加而导致的计算复杂度大幅上升。TFA 模块由两个同时运行的注意力分支组成:一个针对时间维度,另一个针对频率维度。这些分支产生了不同形式的注意力--一种用于识别相关的时间框架,另一种用于选择频率明智的通道。这些注意机制有助于模型辨别 "哪里 "和 "什么 "需要优先处理。随后,TA 和 FA 分支结合在一起,生成一个两维的综合注意力地图。该图谱为时频表征中的各个频谱成分分配了特定的注意力权重,使网络能够熟练捕捉时频表征中的语音特征。结果证实,就客观语音质量和可懂度而言,所提出的方法优于其他模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
International Journal of Image and Graphics
International Journal of Image and Graphics COMPUTER SCIENCE, SOFTWARE ENGINEERING-
CiteScore
2.40
自引率
18.80%
发文量
67
期刊最新文献
Modified Whale Algorithm and Morley PSO-ML-Based Hyperparameter Optimization for Intrusion Detection A Novel Hybrid Attention-Based Dilated Network for Depression Classification Model from Multimodal Data Using Improved Heuristic Approach An Extensive Review on Lung Cancer Detection Models CMVT: ConVit Transformer Network Recombined with Convolutional Layer Two-Phase Speckle Noise Removal in US Images: Speckle Reducing Improved Anisotropic Diffusion and Optimal Bayes Threshold
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1