{"title":"DeConformer-SENet:高效的可变形保形语音增强网络","authors":"Man Li, Ya Liu, Li Zhou","doi":"10.1016/j.dsp.2024.104787","DOIUrl":null,"url":null,"abstract":"<div><div>The Conformer model has demonstrated superior performance in speech enhancement by combining the long-range relationship modeling capability of self-attention with the local information processing ability of convolutional neural networks (CNNs). However, existing Conformer-based speech enhancement models struggle to balance performance and model complexity. In this work, we propose, DeConformer-SENet, an end-to-end time-domain deformable Conformer speech enhancement model, with modifications to both the self-attention and CNN components. Firstly, we introduce the time-frequency-channel self-attention (TFC-SA) module, which compresses information from each dimension of the input features into a one-dimensional vector. By calculating the energy distribution, this module models long-range relationships across three dimensions, reducing computational complexity while maintaining performance. Additionally, we replace standard convolutions with deformable convolutions, aiming to expand the receptive field of the CNN and accurately model local features. We validate our proposed DeConformer-SENet on the WSJ0-SI84 + DNS Challenge dataset. Experimental results demonstrate that DeConformer-SENet outperforms existing Conformer and Transformer models in terms of ESTOI and PESQ metrics, while also being more computationally efficient. Furthermore, ablation studies confirm that DeConformer-SENet improvements enhance the performance of conventional Conformer and reduce model complexity without compromising the overall effectiveness.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"156 ","pages":"Article 104787"},"PeriodicalIF":2.9000,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DeConformer-SENet: An efficient deformable conformer speech enhancement network\",\"authors\":\"Man Li, Ya Liu, Li Zhou\",\"doi\":\"10.1016/j.dsp.2024.104787\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The Conformer model has demonstrated superior performance in speech enhancement by combining the long-range relationship modeling capability of self-attention with the local information processing ability of convolutional neural networks (CNNs). However, existing Conformer-based speech enhancement models struggle to balance performance and model complexity. In this work, we propose, DeConformer-SENet, an end-to-end time-domain deformable Conformer speech enhancement model, with modifications to both the self-attention and CNN components. Firstly, we introduce the time-frequency-channel self-attention (TFC-SA) module, which compresses information from each dimension of the input features into a one-dimensional vector. By calculating the energy distribution, this module models long-range relationships across three dimensions, reducing computational complexity while maintaining performance. Additionally, we replace standard convolutions with deformable convolutions, aiming to expand the receptive field of the CNN and accurately model local features. We validate our proposed DeConformer-SENet on the WSJ0-SI84 + DNS Challenge dataset. Experimental results demonstrate that DeConformer-SENet outperforms existing Conformer and Transformer models in terms of ESTOI and PESQ metrics, while also being more computationally efficient. Furthermore, ablation studies confirm that DeConformer-SENet improvements enhance the performance of conventional Conformer and reduce model complexity without compromising the overall effectiveness.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"156 \",\"pages\":\"Article 104787\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200424004123\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200424004123","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
DeConformer-SENet: An efficient deformable conformer speech enhancement network
The Conformer model has demonstrated superior performance in speech enhancement by combining the long-range relationship modeling capability of self-attention with the local information processing ability of convolutional neural networks (CNNs). However, existing Conformer-based speech enhancement models struggle to balance performance and model complexity. In this work, we propose, DeConformer-SENet, an end-to-end time-domain deformable Conformer speech enhancement model, with modifications to both the self-attention and CNN components. Firstly, we introduce the time-frequency-channel self-attention (TFC-SA) module, which compresses information from each dimension of the input features into a one-dimensional vector. By calculating the energy distribution, this module models long-range relationships across three dimensions, reducing computational complexity while maintaining performance. Additionally, we replace standard convolutions with deformable convolutions, aiming to expand the receptive field of the CNN and accurately model local features. We validate our proposed DeConformer-SENet on the WSJ0-SI84 + DNS Challenge dataset. Experimental results demonstrate that DeConformer-SENet outperforms existing Conformer and Transformer models in terms of ESTOI and PESQ metrics, while also being more computationally efficient. Furthermore, ablation studies confirm that DeConformer-SENet improvements enhance the performance of conventional Conformer and reduce model complexity without compromising the overall effectiveness.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,