DeConformer-SENet：高效的可变形保形语音增强网络

IF 3 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Digital Signal Processing Pub Date : 2025-01-01 Epub Date: 2024-09-24 DOI:10.1016/j.dsp.2024.104787

Man Li, Ya Liu, Li Zhou

{"title":"DeConformer-SENet：高效的可变形保形语音增强网络","authors":"Man Li, Ya Liu, Li Zhou","doi":"10.1016/j.dsp.2024.104787","DOIUrl":null,"url":null,"abstract":"<div><div>The Conformer model has demonstrated superior performance in speech enhancement by combining the long-range relationship modeling capability of self-attention with the local information processing ability of convolutional neural networks (CNNs). However, existing Conformer-based speech enhancement models struggle to balance performance and model complexity. In this work, we propose, DeConformer-SENet, an end-to-end time-domain deformable Conformer speech enhancement model, with modifications to both the self-attention and CNN components. Firstly, we introduce the time-frequency-channel self-attention (TFC-SA) module, which compresses information from each dimension of the input features into a one-dimensional vector. By calculating the energy distribution, this module models long-range relationships across three dimensions, reducing computational complexity while maintaining performance. Additionally, we replace standard convolutions with deformable convolutions, aiming to expand the receptive field of the CNN and accurately model local features. We validate our proposed DeConformer-SENet on the WSJ0-SI84 + DNS Challenge dataset. Experimental results demonstrate that DeConformer-SENet outperforms existing Conformer and Transformer models in terms of ESTOI and PESQ metrics, while also being more computationally efficient. Furthermore, ablation studies confirm that DeConformer-SENet improvements enhance the performance of conventional Conformer and reduce model complexity without compromising the overall effectiveness.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"156 ","pages":"Article 104787"},"PeriodicalIF":3.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DeConformer-SENet: An efficient deformable conformer speech enhancement network\",\"authors\":\"Man Li, Ya Liu, Li Zhou\",\"doi\":\"10.1016/j.dsp.2024.104787\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The Conformer model has demonstrated superior performance in speech enhancement by combining the long-range relationship modeling capability of self-attention with the local information processing ability of convolutional neural networks (CNNs). However, existing Conformer-based speech enhancement models struggle to balance performance and model complexity. In this work, we propose, DeConformer-SENet, an end-to-end time-domain deformable Conformer speech enhancement model, with modifications to both the self-attention and CNN components. Firstly, we introduce the time-frequency-channel self-attention (TFC-SA) module, which compresses information from each dimension of the input features into a one-dimensional vector. By calculating the energy distribution, this module models long-range relationships across three dimensions, reducing computational complexity while maintaining performance. Additionally, we replace standard convolutions with deformable convolutions, aiming to expand the receptive field of the CNN and accurately model local features. We validate our proposed DeConformer-SENet on the WSJ0-SI84 + DNS Challenge dataset. Experimental results demonstrate that DeConformer-SENet outperforms existing Conformer and Transformer models in terms of ESTOI and PESQ metrics, while also being more computationally efficient. Furthermore, ablation studies confirm that DeConformer-SENet improvements enhance the performance of conventional Conformer and reduce model complexity without compromising the overall effectiveness.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"156 \",\"pages\":\"Article 104787\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200424004123\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/24 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200424004123","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/24 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

Conformer 模型将自我注意的长程关系建模能力与卷积神经网络（CNN）的局部信息处理能力相结合，在语音增强方面表现出卓越的性能。然而，现有的基于 Conformer 的语音增强模型很难在性能和模型复杂度之间取得平衡。在这项工作中，我们提出了端到端时域可变形 Conformer 语音增强模型 DeConformer-SENet，并对自注意和 CNN 部分进行了修改。首先，我们引入了时频信道自注意（TFC-SA）模块，它将输入特征的每个维度的信息压缩为一维向量。通过计算能量分布，该模块对三个维度的长程关系进行建模，从而在保持性能的同时降低了计算复杂度。此外，我们还用可变形卷积取代了标准卷积，旨在扩大 CNN 的感受野，并对局部特征进行精确建模。我们在 WSJ0-SI84 + DNS Challenge 数据集上验证了我们提出的 DeConformer-SENet。实验结果表明，DeConformer-SENet 在 ESTOI 和 PESQ 指标方面优于现有的 Conformer 和 Transformer 模型，同时计算效率也更高。此外，消融研究证实，DeConformer-SENet 的改进提高了传统 Conformer 的性能，降低了模型的复杂性，同时不影响整体效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DeConformer-SENet: An efficient deformable conformer speech enhancement network

The Conformer model has demonstrated superior performance in speech enhancement by combining the long-range relationship modeling capability of self-attention with the local information processing ability of convolutional neural networks (CNNs). However, existing Conformer-based speech enhancement models struggle to balance performance and model complexity. In this work, we propose, DeConformer-SENet, an end-to-end time-domain deformable Conformer speech enhancement model, with modifications to both the self-attention and CNN components. Firstly, we introduce the time-frequency-channel self-attention (TFC-SA) module, which compresses information from each dimension of the input features into a one-dimensional vector. By calculating the energy distribution, this module models long-range relationships across three dimensions, reducing computational complexity while maintaining performance. Additionally, we replace standard convolutions with deformable convolutions, aiming to expand the receptive field of the CNN and accurately model local features. We validate our proposed DeConformer-SENet on the WSJ0-SI84 + DNS Challenge dataset. Experimental results demonstrate that DeConformer-SENet outperforms existing Conformer and Transformer models in terms of ESTOI and PESQ metrics, while also being more computationally efficient. Furthermore, ablation studies confirm that DeConformer-SENet improvements enhance the performance of conventional Conformer and reduce model complexity without compromising the overall effectiveness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,