GALD-SE: Guided Anisotropic Lightweight Diffusion for Efficient Speech Enhancement

IF 3.9 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Signal Processing Letters Pub Date : 2024-12-26 DOI:10.1109/LSP.2024.3522852

Chengzhong Wang;Jianjun Gu;Dingding Yao;Junfeng Li;Yonghong Yan

{"title":"GALD-SE: Guided Anisotropic Lightweight Diffusion for Efficient Speech Enhancement","authors":"Chengzhong Wang;Jianjun Gu;Dingding Yao;Junfeng Li;Yonghong Yan","doi":"10.1109/LSP.2024.3522852","DOIUrl":null,"url":null,"abstract":"Speech enhancement is designed to enhance the intelligibility and quality of speech across diverse noise conditions. Recently, diffusion models have gained lots of attention in speech enhancement area, achieving competitive results. Current diffusion-based methods blur the distribution of the signal with isotropic Gaussian noise and recover clean speech distribution from the prior. However, these methods often suffer from a substantial computational burden. We argue that the computational inefficiency partially stems from the oversight that speech enhancement is not purely a generative task; it primarily involves noise reduction and completion of missing information, while the clean clues in the original mixture do not need to be regenerated. In this paper, we propose a method that introduces noise with anisotropic guidance during the diffusion process, allowing the neural network to preserve clean clues within noisy recordings. This approach substantially reduces computational complexity while exhibiting robustness against various forms of noise and speech distortion. Experiments demonstrate that the proposed method achieves state-of-the-art results with only approximately 4.5 million parameters, a number significantly lower than that required by other diffusion methods. This effectively narrows the model size disparity between diffusion-based and predictive speech enhancement approaches. Additionally, the proposed method performs well in very noisy scenarios, demonstrating its potential for applications in highly challenging environments.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"426-430"},"PeriodicalIF":3.9000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10816305/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Speech enhancement is designed to enhance the intelligibility and quality of speech across diverse noise conditions. Recently, diffusion models have gained lots of attention in speech enhancement area, achieving competitive results. Current diffusion-based methods blur the distribution of the signal with isotropic Gaussian noise and recover clean speech distribution from the prior. However, these methods often suffer from a substantial computational burden. We argue that the computational inefficiency partially stems from the oversight that speech enhancement is not purely a generative task; it primarily involves noise reduction and completion of missing information, while the clean clues in the original mixture do not need to be regenerated. In this paper, we propose a method that introduces noise with anisotropic guidance during the diffusion process, allowing the neural network to preserve clean clues within noisy recordings. This approach substantially reduces computational complexity while exhibiting robustness against various forms of noise and speech distortion. Experiments demonstrate that the proposed method achieves state-of-the-art results with only approximately 4.5 million parameters, a number significantly lower than that required by other diffusion methods. This effectively narrows the model size disparity between diffusion-based and predictive speech enhancement approaches. Additionally, the proposed method performs well in very noisy scenarios, demonstrating its potential for applications in highly challenging environments.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GALD-SE：有效语音增强的引导各向异性轻量扩散

语音增强的目的是在不同的噪声条件下提高语音的清晰度和质量。近年来，扩散模型在语音增强领域得到了广泛的关注，并取得了较好的效果。目前基于扩散的方法模糊了各向同性高斯噪声的信号分布，并从先验中恢复干净的语音分布。然而，这些方法往往有大量的计算负担。我们认为，计算效率低下的部分原因在于人们忽视了语音增强不是纯粹的生成任务；它主要涉及降噪和缺失信息的补充，而原始混合中的干净线索不需要再生。在本文中，我们提出了一种在扩散过程中引入各向异性引导噪声的方法，使神经网络能够在噪声记录中保留干净的线索。这种方法大大降低了计算复杂度，同时表现出对各种形式的噪声和语音失真的鲁棒性。实验表明，该方法仅需要约450万个参数即可获得最先进的结果，大大低于其他扩散方法所需的参数。这有效地缩小了基于扩散和预测语音增强方法之间的模型大小差距。此外，所提出的方法在非常嘈杂的场景中表现良好，证明了其在高度挑战性环境中的应用潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Signal Processing Letters 工程技术-工程：电子与电气

CiteScore

7.40

自引率

12.80%

发文量

339

审稿时长

2.8 months

期刊介绍： The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.