{"title":"Shuffle Attention U-Net for Speech Enhancement in Time Domain","authors":"Chaitanya Jannu, S. Vanambathina","doi":"10.1142/s0219467824500438","DOIUrl":null,"url":null,"abstract":"Over the past 10 years, deep learning has enabled significant advancements in the improvement of noisy speech. In an end-to-end speech enhancement, the deep neural networks transform a noisy speech signal to a clean speech signal in the time domain directly without any conversion or estimation of mask. Recently, the U-Net-based models achieved good enhancement performance. Despite this, some of them may neglect context-related information and detailed features of input speech in case of ordinary convolution. To address the above issues, recent studies have upgraded the performance of the model by adding various network modules such as attention mechanisms, long and short-term memory (LSTM). In this work, we propose a new U-Net-based speech enhancement model using a novel lightweight and efficient Shuffle Attention (SA), Gated Recurrent Unit (GRU), residual blocks with dilated convolutions. Residual block will be followed by a multi-scale convolution block (MSCB). The proposed hybrid structure enables the temporal context aggregation in time domain. The advantage of shuffle attention mechanism is that the channel and spatial attention are carried out simultaneously for each sub-feature in order to prevent potential noises while also highlighting the proper semantic feature areas by combining the same features from all locations. MSCB is employed for extracting rich temporal features. To represent the correlation between neighboring noisy speech frames, a two Layer GRU is added in the bottleneck of U-Net. The experimental findings demonstrate that the proposed model outperformed the other existing models in terms of short-time objective intelligibility (STOI), and perceptual evaluation of the speech quality (PESQ).","PeriodicalId":44688,"journal":{"name":"International Journal of Image and Graphics","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Image and Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219467824500438","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 1
Abstract
Over the past 10 years, deep learning has enabled significant advancements in the improvement of noisy speech. In an end-to-end speech enhancement, the deep neural networks transform a noisy speech signal to a clean speech signal in the time domain directly without any conversion or estimation of mask. Recently, the U-Net-based models achieved good enhancement performance. Despite this, some of them may neglect context-related information and detailed features of input speech in case of ordinary convolution. To address the above issues, recent studies have upgraded the performance of the model by adding various network modules such as attention mechanisms, long and short-term memory (LSTM). In this work, we propose a new U-Net-based speech enhancement model using a novel lightweight and efficient Shuffle Attention (SA), Gated Recurrent Unit (GRU), residual blocks with dilated convolutions. Residual block will be followed by a multi-scale convolution block (MSCB). The proposed hybrid structure enables the temporal context aggregation in time domain. The advantage of shuffle attention mechanism is that the channel and spatial attention are carried out simultaneously for each sub-feature in order to prevent potential noises while also highlighting the proper semantic feature areas by combining the same features from all locations. MSCB is employed for extracting rich temporal features. To represent the correlation between neighboring noisy speech frames, a two Layer GRU is added in the bottleneck of U-Net. The experimental findings demonstrate that the proposed model outperformed the other existing models in terms of short-time objective intelligibility (STOI), and perceptual evaluation of the speech quality (PESQ).