MFFR-net：用于深度神经语音增强的多尺度特征融合和注意力重新校准网络

IF 2.9 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Digital Signal Processing Pub Date : 2024-11-14 DOI:10.1016/j.dsp.2024.104870

Nasir Saleem , Sami Bourouis

{"title":"MFFR-net：用于深度神经语音增强的多尺度特征融合和注意力重新校准网络","authors":"Nasir Saleem , Sami Bourouis","doi":"10.1016/j.dsp.2024.104870","DOIUrl":null,"url":null,"abstract":"<div><div>Deep neural networks (DNNs) have been successfully applied in advancing speech enhancement (SE), particularly in overcoming the challenges posed by nonstationary noisy backgrounds. In this context, multi-scale feature fusion and recalibration (MFFR) can improve speech enhancement performance by combining multi-scale and recalibrated features. This paper proposes a speech enhancement system that capitalizes on a large-scale pre-trained model, seamlessly fused with features attentively recalibrated using varying kernel sizes in convolutional layers. This process enables the SE system to capture features across diverse scales, enhancing its overall performance. The proposed SE system uses a transferable features extractor architecture and integrates with multi-scaled attentively recalibrated features. Utilizing 2D-convolutional layers, the convolutional encoder-decoder extracts both local and contextual features from speech signals. To capture long-term temporal dependencies, a bidirectional simple recurrent unit (BSRU) serves as a bottleneck layer positioned between the encoder and decoder. The experiments are conducted on three publicly available datasets including Texas Instruments/Massachusetts Institute of Technology (TIMIT), LibriSpeech, and Voice Cloning Toolkit+Diverse Environments Multi-channel Acoustic Noise Database (VCTK+DEMAND). The experimental results show that the proposed SE system performs better than several recent approaches on the Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) evaluation metrics. On the TIMIT dataset, the proposed system showcases a considerable improvement in STOI (17.3%) and PESQ (0.74) over the noisy mixture. The evaluation on the LibriSpeech dataset yields results with a 17.6% and 0.87 improvement in STOI and PESQ.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"156 ","pages":"Article 104870"},"PeriodicalIF":2.9000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MFFR-net: Multi-scale feature fusion and attentive recalibration network for deep neural speech enhancement\",\"authors\":\"Nasir Saleem , Sami Bourouis\",\"doi\":\"10.1016/j.dsp.2024.104870\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Deep neural networks (DNNs) have been successfully applied in advancing speech enhancement (SE), particularly in overcoming the challenges posed by nonstationary noisy backgrounds. In this context, multi-scale feature fusion and recalibration (MFFR) can improve speech enhancement performance by combining multi-scale and recalibrated features. This paper proposes a speech enhancement system that capitalizes on a large-scale pre-trained model, seamlessly fused with features attentively recalibrated using varying kernel sizes in convolutional layers. This process enables the SE system to capture features across diverse scales, enhancing its overall performance. The proposed SE system uses a transferable features extractor architecture and integrates with multi-scaled attentively recalibrated features. Utilizing 2D-convolutional layers, the convolutional encoder-decoder extracts both local and contextual features from speech signals. To capture long-term temporal dependencies, a bidirectional simple recurrent unit (BSRU) serves as a bottleneck layer positioned between the encoder and decoder. The experiments are conducted on three publicly available datasets including Texas Instruments/Massachusetts Institute of Technology (TIMIT), LibriSpeech, and Voice Cloning Toolkit+Diverse Environments Multi-channel Acoustic Noise Database (VCTK+DEMAND). The experimental results show that the proposed SE system performs better than several recent approaches on the Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) evaluation metrics. On the TIMIT dataset, the proposed system showcases a considerable improvement in STOI (17.3%) and PESQ (0.74) over the noisy mixture. The evaluation on the LibriSpeech dataset yields results with a 17.6% and 0.87 improvement in STOI and PESQ.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"156 \",\"pages\":\"Article 104870\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200424004949\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200424004949","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

深度神经网络（DNN）已成功应用于语音增强（SE），尤其是在克服非稳态噪声背景带来的挑战方面。在这种情况下，多尺度特征融合和重新校准（MFFR）可通过结合多尺度和重新校准特征来提高语音增强性能。本文提出了一种语音增强系统，该系统利用大规模预训练模型，与卷积层中使用不同核大小重新校准的特征进行无缝融合。这一过程使 SE 系统能够捕捉不同尺度的特征，从而提高其整体性能。拟议的 SE 系统采用了可转移的特征提取器架构，并与多尺度的专心重新校准特征相结合。利用二维卷积层，卷积编码器-解码器可从语音信号中提取局部和上下文特征。为了捕捉长期的时间依赖性，双向简单递归单元（BSRU）作为瓶颈层位于编码器和解码器之间。实验在三个公开数据集上进行，包括德州仪器/麻省理工学院（TIMIT）、LibriSpeech 和 Voice Cloning Toolkit+Diverse Environments Multi-channel Acoustic Noise Database（VCTK+DEMAND）。实验结果表明，在短时客观可懂度（STOI）和语音质量感知评估（PESQ）评估指标上，所提出的 SE 系统的表现优于最近的几种方法。在 TIMIT 数据集上，建议的系统比噪声混合物的 STOI（17.3%）和 PESQ（0.74）都有显著提高。在 LibriSpeech 数据集上的评估结果显示，STOI 和 PESQ 分别提高了 17.6% 和 0.87%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MFFR-net: Multi-scale feature fusion and attentive recalibration network for deep neural speech enhancement

Deep neural networks (DNNs) have been successfully applied in advancing speech enhancement (SE), particularly in overcoming the challenges posed by nonstationary noisy backgrounds. In this context, multi-scale feature fusion and recalibration (MFFR) can improve speech enhancement performance by combining multi-scale and recalibrated features. This paper proposes a speech enhancement system that capitalizes on a large-scale pre-trained model, seamlessly fused with features attentively recalibrated using varying kernel sizes in convolutional layers. This process enables the SE system to capture features across diverse scales, enhancing its overall performance. The proposed SE system uses a transferable features extractor architecture and integrates with multi-scaled attentively recalibrated features. Utilizing 2D-convolutional layers, the convolutional encoder-decoder extracts both local and contextual features from speech signals. To capture long-term temporal dependencies, a bidirectional simple recurrent unit (BSRU) serves as a bottleneck layer positioned between the encoder and decoder. The experiments are conducted on three publicly available datasets including Texas Instruments/Massachusetts Institute of Technology (TIMIT), LibriSpeech, and Voice Cloning Toolkit+Diverse Environments Multi-channel Acoustic Noise Database (VCTK+DEMAND). The experimental results show that the proposed SE system performs better than several recent approaches on the Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) evaluation metrics. On the TIMIT dataset, the proposed system showcases a considerable improvement in STOI (17.3%) and PESQ (0.74) over the noisy mixture. The evaluation on the LibriSpeech dataset yields results with a 17.6% and 0.87 improvement in STOI and PESQ.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,

期刊最新文献

Editorial Board Editorial Board Research on ZYNQ neural network acceleration method for aluminum surface microdefects Cross-scale informative priors network for medical image segmentation An improved digital predistortion scheme for nonlinear transmitters with limited bandwidth