Compact deep neural networks for real-time speech enhancement on resource-limited devices

IF 2.4 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2023-11-19 DOI:10.1016/j.specom.2023.103008

Fazal E Wahab , Zhongfu Ye , Nasir Saleem , Rizwan Ullah

{"title":"Compact deep neural networks for real-time speech enhancement on resource-limited devices","authors":"Fazal E Wahab , Zhongfu Ye , Nasir Saleem , Rizwan Ullah","doi":"10.1016/j.specom.2023.103008","DOIUrl":null,"url":null,"abstract":"<div><p>In real-time applications, the aim of speech enhancement (SE) is to achieve optimal performance while ensuring computational efficiency and near-instant outputs. Many deep neural models have achieved optimal performance in terms of speech quality and intelligibility. However, formulating efficient and compact deep neural models for real-time processing on resource-limited devices remains a challenge. This study presents a compact neural model designed in a complex frequency domain for speech enhancement, optimized for resource-limited devices. The proposed model combines convolutional encoder–decoder and recurrent architectures to effectively learn complex mappings from noisy speech for real-time speech enhancement, enabling low-latency causal processing. Recurrent architectures such as Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Simple Recurrent Unit (SRU), are incorporated as bottlenecks to capture temporal dependencies and improve the performance of SE. By representing the speech in the complex frequency domain, the proposed model processes both magnitude and phase information. Further, this study extends the proposed models and incorporates attention-gate-based skip connections, enabling the models to focus on relevant information and dynamically weigh the important features. The results show that the proposed models outperform the recent benchmark models and obtain better speech quality and intelligibility. The proposed models show less computational load and deliver better results. This study uses the WSJ0 database where clean sentences from WSJ0 are mixed with different background noises to create noisy mixtures. The results show that STOI and PESQ are improved by 21.1% and 1.25 (41.5%) on the WSJ0 database whereas, on the VoiceBank+DEMAND database, STOI and PESQ are imp4.1% and 1.24 (38.6%) respectively. The extension of the models shows further improvement in STOI and PESQ in seen and unseen noisy conditions.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103008"},"PeriodicalIF":2.4000,"publicationDate":"2023-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001425/pdfft?md5=4520d0cf6044eb3b8b478a4f2214ba07&pid=1-s2.0-S0167639323001425-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001425","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

In real-time applications, the aim of speech enhancement (SE) is to achieve optimal performance while ensuring computational efficiency and near-instant outputs. Many deep neural models have achieved optimal performance in terms of speech quality and intelligibility. However, formulating efficient and compact deep neural models for real-time processing on resource-limited devices remains a challenge. This study presents a compact neural model designed in a complex frequency domain for speech enhancement, optimized for resource-limited devices. The proposed model combines convolutional encoder–decoder and recurrent architectures to effectively learn complex mappings from noisy speech for real-time speech enhancement, enabling low-latency causal processing. Recurrent architectures such as Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Simple Recurrent Unit (SRU), are incorporated as bottlenecks to capture temporal dependencies and improve the performance of SE. By representing the speech in the complex frequency domain, the proposed model processes both magnitude and phase information. Further, this study extends the proposed models and incorporates attention-gate-based skip connections, enabling the models to focus on relevant information and dynamically weigh the important features. The results show that the proposed models outperform the recent benchmark models and obtain better speech quality and intelligibility. The proposed models show less computational load and deliver better results. This study uses the WSJ0 database where clean sentences from WSJ0 are mixed with different background noises to create noisy mixtures. The results show that STOI and PESQ are improved by 21.1% and 1.25 (41.5%) on the WSJ0 database whereas, on the VoiceBank+DEMAND database, STOI and PESQ are imp4.1% and 1.24 (38.6%) respectively. The extension of the models shows further improvement in STOI and PESQ in seen and unseen noisy conditions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于资源有限设备上实时语音增强的紧凑深度神经网络

在实时应用中，语音增强(SE)的目标是在保证计算效率和近即时输出的同时实现最佳性能。许多深度神经模型在语音质量和可理解性方面都达到了最佳性能。然而，在资源有限的设备上制定高效、紧凑的深度神经模型进行实时处理仍然是一个挑战。本研究提出了一个紧凑的神经网络模型，设计在复频域用于语音增强，优化为资源有限的设备。该模型结合了卷积编码器-解码器和循环架构，有效地从噪声语音中学习复杂映射，实现实时语音增强，实现低延迟因果处理。循环架构，如长短期内存(LSTM)、门控循环单元(GRU)和简单循环单元(SRU)，被作为瓶颈来捕获时间依赖性并提高SE的性能。通过在复频域中表示语音，该模型同时处理幅值和相位信息。此外，本研究扩展了所提出的模型，并引入了基于注意门的跳过连接，使模型能够关注相关信息并动态权衡重要特征。结果表明，所提出的模型优于现有的基准模型，获得了更好的语音质量和可理解性。所提出的模型计算量小，结果好。本研究使用WSJ0数据库，其中来自WSJ0的干净句子与不同的背景噪声混合以产生噪声混合物。结果表明，在WSJ0数据库上STOI和PESQ分别提高了21.1%和1.25(41.5%)，而在VoiceBank+DEMAND数据库上STOI和PESQ分别提高了4.1%和1.24(38.6%)。模型的扩展表明，在可见和不可见噪声条件下，STOI和PESQ得到了进一步的改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.