{"title":"Real-Time Single Channel Speech Enhancement Using Triple Attention and Stacked Squeeze-TCN","authors":"Chaitanya Jannu, Manaswini Burra, Sunny Dayal Vanambathina, Veeraswamy Parisae","doi":"10.1111/coin.70016","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Speech enhancement is crucial in many speech processing applications. Recently, researchers have been exploring ways to improve performance by effectively capturing the long-term contextual relationships within speech signals. Using multiple stages of learning, where several deep learning modules are activated one after the other, has been shown to be an effective approach. Recently, the attention mechanism has been explored for improving speech quality, showing significant improvements. The attention modules have been developed to improve CNNs backbone network performance. However, these attention modules often use fully connected (FC) and convolution layers, which increase the model's parameter count and computational requirements. The present study employs multi-stage learning within the framework of speech enhancement. The proposed study uses a multi-stage structure in which a sequence of Squeeze temporal convolutional modules (STCM) with twice dilation rates comes after a Triple attention block (TAB) at each stage. An estimate is generated at each phase and refined in the subsequent phase. To reintroduce the original information, a feature fusion module (FFM) is inserted at the beginning of each following phase. In the proposed model, the intermediate output can go through several phases of step-by-step improvement by continually unfolding STCMs, which eventually leads to the precise estimation of the spectrum. A TAB is crafted to enhance the model performance, allowing it to concurrently concentrate on areas of interest in the channel, spatial, and time-frequency dimensions. To be more specific, the CSA has two parallel regions combining channel with spatial attention, enabling both the channel dimension and the spatial dimension to be captured simultaneously. Next, the signal can be emphasized as a function of time and frequency by aggregating the feature maps along these dimensions. This improves its capability to model the temporal dependencies of speech signals. Using the VCTK and Librispeech datasets, the proposed speech enhancement system is assessed against state-of-the-art deep learning techniques and yielded better results in terms of PESQ, STOI, CSIG, CBAK, and COVL.</p>\n </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"41 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.70016","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Speech enhancement is crucial in many speech processing applications. Recently, researchers have been exploring ways to improve performance by effectively capturing the long-term contextual relationships within speech signals. Using multiple stages of learning, where several deep learning modules are activated one after the other, has been shown to be an effective approach. Recently, the attention mechanism has been explored for improving speech quality, showing significant improvements. The attention modules have been developed to improve CNNs backbone network performance. However, these attention modules often use fully connected (FC) and convolution layers, which increase the model's parameter count and computational requirements. The present study employs multi-stage learning within the framework of speech enhancement. The proposed study uses a multi-stage structure in which a sequence of Squeeze temporal convolutional modules (STCM) with twice dilation rates comes after a Triple attention block (TAB) at each stage. An estimate is generated at each phase and refined in the subsequent phase. To reintroduce the original information, a feature fusion module (FFM) is inserted at the beginning of each following phase. In the proposed model, the intermediate output can go through several phases of step-by-step improvement by continually unfolding STCMs, which eventually leads to the precise estimation of the spectrum. A TAB is crafted to enhance the model performance, allowing it to concurrently concentrate on areas of interest in the channel, spatial, and time-frequency dimensions. To be more specific, the CSA has two parallel regions combining channel with spatial attention, enabling both the channel dimension and the spatial dimension to be captured simultaneously. Next, the signal can be emphasized as a function of time and frequency by aggregating the feature maps along these dimensions. This improves its capability to model the temporal dependencies of speech signals. Using the VCTK and Librispeech datasets, the proposed speech enhancement system is assessed against state-of-the-art deep learning techniques and yielded better results in terms of PESQ, STOI, CSIG, CBAK, and COVL.
期刊介绍:
This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.