Compact deep neural networks for real-time speech enhancement on resource-limited devices

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2023-11-19 DOI:10.1016/j.specom.2023.103008
Fazal E Wahab , Zhongfu Ye , Nasir Saleem , Rizwan Ullah
{"title":"Compact deep neural networks for real-time speech enhancement on resource-limited devices","authors":"Fazal E Wahab ,&nbsp;Zhongfu Ye ,&nbsp;Nasir Saleem ,&nbsp;Rizwan Ullah","doi":"10.1016/j.specom.2023.103008","DOIUrl":null,"url":null,"abstract":"<div><p>In real-time applications, the aim of speech enhancement (SE) is to achieve optimal performance while ensuring computational efficiency and near-instant outputs. Many deep neural models have achieved optimal performance in terms of speech quality and intelligibility. However, formulating efficient and compact deep neural models for real-time processing on resource-limited devices remains a challenge. This study presents a compact neural model designed in a complex frequency domain for speech enhancement, optimized for resource-limited devices. The proposed model combines convolutional encoder–decoder and recurrent architectures to effectively learn complex mappings from noisy speech for real-time speech enhancement, enabling low-latency causal processing. Recurrent architectures such as Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Simple Recurrent Unit (SRU), are incorporated as bottlenecks to capture temporal dependencies and improve the performance of SE. By representing the speech in the complex frequency domain, the proposed model processes both magnitude and phase information. Further, this study extends the proposed models and incorporates attention-gate-based skip connections, enabling the models to focus on relevant information and dynamically weigh the important features. The results show that the proposed models outperform the recent benchmark models and obtain better speech quality and intelligibility. The proposed models show less computational load and deliver better results. This study uses the WSJ0 database where clean sentences from WSJ0 are mixed with different background noises to create noisy mixtures. The results show that STOI and PESQ are improved by 21.1% and 1.25 (41.5%) on the WSJ0 database whereas, on the VoiceBank+DEMAND database, STOI and PESQ are imp4.1% and 1.24 (38.6%) respectively. The extension of the models shows further improvement in STOI and PESQ in seen and unseen noisy conditions.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103008"},"PeriodicalIF":2.4000,"publicationDate":"2023-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001425/pdfft?md5=4520d0cf6044eb3b8b478a4f2214ba07&pid=1-s2.0-S0167639323001425-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001425","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

In real-time applications, the aim of speech enhancement (SE) is to achieve optimal performance while ensuring computational efficiency and near-instant outputs. Many deep neural models have achieved optimal performance in terms of speech quality and intelligibility. However, formulating efficient and compact deep neural models for real-time processing on resource-limited devices remains a challenge. This study presents a compact neural model designed in a complex frequency domain for speech enhancement, optimized for resource-limited devices. The proposed model combines convolutional encoder–decoder and recurrent architectures to effectively learn complex mappings from noisy speech for real-time speech enhancement, enabling low-latency causal processing. Recurrent architectures such as Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Simple Recurrent Unit (SRU), are incorporated as bottlenecks to capture temporal dependencies and improve the performance of SE. By representing the speech in the complex frequency domain, the proposed model processes both magnitude and phase information. Further, this study extends the proposed models and incorporates attention-gate-based skip connections, enabling the models to focus on relevant information and dynamically weigh the important features. The results show that the proposed models outperform the recent benchmark models and obtain better speech quality and intelligibility. The proposed models show less computational load and deliver better results. This study uses the WSJ0 database where clean sentences from WSJ0 are mixed with different background noises to create noisy mixtures. The results show that STOI and PESQ are improved by 21.1% and 1.25 (41.5%) on the WSJ0 database whereas, on the VoiceBank+DEMAND database, STOI and PESQ are imp4.1% and 1.24 (38.6%) respectively. The extension of the models shows further improvement in STOI and PESQ in seen and unseen noisy conditions.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于资源有限设备上实时语音增强的紧凑深度神经网络
在实时应用中,语音增强(SE)的目标是在保证计算效率和近即时输出的同时实现最佳性能。许多深度神经模型在语音质量和可理解性方面都达到了最佳性能。然而,在资源有限的设备上制定高效、紧凑的深度神经模型进行实时处理仍然是一个挑战。本研究提出了一个紧凑的神经网络模型,设计在复频域用于语音增强,优化为资源有限的设备。该模型结合了卷积编码器-解码器和循环架构,有效地从噪声语音中学习复杂映射,实现实时语音增强,实现低延迟因果处理。循环架构,如长短期内存(LSTM)、门控循环单元(GRU)和简单循环单元(SRU),被作为瓶颈来捕获时间依赖性并提高SE的性能。通过在复频域中表示语音,该模型同时处理幅值和相位信息。此外,本研究扩展了所提出的模型,并引入了基于注意门的跳过连接,使模型能够关注相关信息并动态权衡重要特征。结果表明,所提出的模型优于现有的基准模型,获得了更好的语音质量和可理解性。所提出的模型计算量小,结果好。本研究使用WSJ0数据库,其中来自WSJ0的干净句子与不同的背景噪声混合以产生噪声混合物。结果表明,在WSJ0数据库上STOI和PESQ分别提高了21.1%和1.25(41.5%),而在VoiceBank+DEMAND数据库上STOI和PESQ分别提高了4.1%和1.24(38.6%)。模型的扩展表明,在可见和不可见噪声条件下,STOI和PESQ得到了进一步的改善。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A new universal camouflage attack algorithm for intelligent speech system Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection A robust temporal map of speech monitoring from planning to articulation The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1