On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement

IF 5.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-15 DOI:10.1109/TASLP.2024.3444492

Eike J. Nustede;Jörn Anemüller

{"title":"On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement","authors":"Eike J. Nustede;Jörn Anemüller","doi":"10.1109/TASLP.2024.3444492","DOIUrl":null,"url":null,"abstract":"The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3838-3849"},"PeriodicalIF":5.1000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637717","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10637717/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

论用于单声道语音增强的复值变分 U 网络的泛化能力

对于真实世界中的音频去噪系统来说，能够很好地适应不同环境是非常重要的。尤其是单通道信号，需要在不对语音清晰度产生负面影响的情况下进行高效的噪声过滤。我们之前的工作表明，概率潜空间模型与 U-Network 架构相结合，在一定程度上提高了性能和泛化能力。在此，我们将在两个不同的数据集上，并在训练-测试不匹配的情况下，进一步评估纯幅度模型和复值 U-Network 模型。通过引入与 "曲线下面积 "指标类似的基于曲线的评分，对模型的适应性进行了评估。所提出的概率潜空间模型在大多数情况下都优于其消融变体，也优于著名的比较方法，而网络规模的增加可以忽略不计。在匹配条件下，SI-SDR 可提高 0.97 dB，在不匹配条件下，SI-SDR 可提高 2.72 dB，SI-SDR 总分最高分别为 20.21 dB 和 18.71 dB。建议的稳定性分数与观察到的性能表现非常吻合，进一步验证了概率潜空间模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach