通过有监督的对抗训练增强真实世界远场语音能力

IF 3.4 2区物理与天体物理 Q1 ACOUSTICS Applied Acoustics Pub Date : 2024-11-15 DOI:10.1016/j.apacoust.2024.110407

Tong Lei , Qinwen Hu , Zhongshu Hou , Jing Lu

{"title":"通过有监督的对抗训练增强真实世界远场语音能力","authors":"Tong Lei , Qinwen Hu , Zhongshu Hou , Jing Lu","doi":"10.1016/j.apacoust.2024.110407","DOIUrl":null,"url":null,"abstract":"<div><div>The generalization of speech enhancement models to real-world far-field speech encounters significant challenges, including low signal-to-noise ratio, high reverberation, and variable latency between far-field and near-field recordings. Additionally, using the non-ideal near-field recordings as the labeled desired output further reduces the effectiveness of commonly utilized predictive models. To tackle these challenges, we propose the Far-field to Near-field Speech Enhancement through Supervised Adversarial Training (FNSE-SAT) strategy. This approach leverages supervised adversarial learning via the Multi-Resolution Discriminator, leveraging diverse speech characteristics with different frequency resolutions. A temporal frame shift operation is also incorporated to mitigate alignment discrepancies observed in real-world data and its effectiveness is confirmed by counting the accuracy of Voice Activity Detection. Experimental validation in both causal and non-causal configurations demonstrates that FNSE-SAT significantly outperforms the state-of-the-art predictive model on real-world datasets. Furthermore, adopting the transfer learning strategy, where the model is initialized with a simulated dataset before fine-tuning with real-world data, strengthens the efficacy of FNSE-SAT, leading to superior outcomes. The results of character error rate show that FNSE-SAT generates fewer components that deviate from the textual content compared to the generative diffusion method. Reducing the Discriminator's resolution to a single version decreases the DNSMOS but has a slight effect on the character error rate.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"229 ","pages":"Article 110407"},"PeriodicalIF":3.4000,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing real-world far-field speech with supervised adversarial training\",\"authors\":\"Tong Lei , Qinwen Hu , Zhongshu Hou , Jing Lu\",\"doi\":\"10.1016/j.apacoust.2024.110407\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The generalization of speech enhancement models to real-world far-field speech encounters significant challenges, including low signal-to-noise ratio, high reverberation, and variable latency between far-field and near-field recordings. Additionally, using the non-ideal near-field recordings as the labeled desired output further reduces the effectiveness of commonly utilized predictive models. To tackle these challenges, we propose the Far-field to Near-field Speech Enhancement through Supervised Adversarial Training (FNSE-SAT) strategy. This approach leverages supervised adversarial learning via the Multi-Resolution Discriminator, leveraging diverse speech characteristics with different frequency resolutions. A temporal frame shift operation is also incorporated to mitigate alignment discrepancies observed in real-world data and its effectiveness is confirmed by counting the accuracy of Voice Activity Detection. Experimental validation in both causal and non-causal configurations demonstrates that FNSE-SAT significantly outperforms the state-of-the-art predictive model on real-world datasets. Furthermore, adopting the transfer learning strategy, where the model is initialized with a simulated dataset before fine-tuning with real-world data, strengthens the efficacy of FNSE-SAT, leading to superior outcomes. The results of character error rate show that FNSE-SAT generates fewer components that deviate from the textual content compared to the generative diffusion method. Reducing the Discriminator's resolution to a single version decreases the DNSMOS but has a slight effect on the character error rate.</div></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":\"229 \",\"pages\":\"Article 110407\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X24005589\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X24005589","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

将语音增强模型推广到真实世界的远场语音会遇到巨大挑战，包括低信噪比、高混响以及远场和近场录音之间的不同延迟。此外，使用非理想的近场录音作为标注的预期输出会进一步降低常用预测模型的有效性。为了应对这些挑战，我们提出了通过监督对抗训练（FNSE-SAT）进行远场到近场语音增强的策略。这种方法通过多分辨率判别器利用监督对抗学习，利用不同频率分辨率的各种语音特征。该方法还采用了时间帧移动操作，以减少在真实世界数据中观察到的对齐差异，并通过计算语音活动检测的准确性来证实其有效性。因果和非因果配置的实验验证表明，FNSE-SAT 在实际数据集上的表现明显优于最先进的预测模型。此外，FNSE-SAT 还采用了迁移学习策略，即先使用模拟数据集初始化模型，然后再使用真实数据进行微调。字符错误率的结果表明，与生成扩散法相比，FNSE-SAT 生成的偏离文本内容的成分更少。将判别器的分辨率降低到单一版本会降低 DNSMOS，但对字符错误率影响不大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Enhancing real-world far-field speech with supervised adversarial training

The generalization of speech enhancement models to real-world far-field speech encounters significant challenges, including low signal-to-noise ratio, high reverberation, and variable latency between far-field and near-field recordings. Additionally, using the non-ideal near-field recordings as the labeled desired output further reduces the effectiveness of commonly utilized predictive models. To tackle these challenges, we propose the Far-field to Near-field Speech Enhancement through Supervised Adversarial Training (FNSE-SAT) strategy. This approach leverages supervised adversarial learning via the Multi-Resolution Discriminator, leveraging diverse speech characteristics with different frequency resolutions. A temporal frame shift operation is also incorporated to mitigate alignment discrepancies observed in real-world data and its effectiveness is confirmed by counting the accuracy of Voice Activity Detection. Experimental validation in both causal and non-causal configurations demonstrates that FNSE-SAT significantly outperforms the state-of-the-art predictive model on real-world datasets. Furthermore, adopting the transfer learning strategy, where the model is initialized with a simulated dataset before fine-tuning with real-world data, strengthens the efficacy of FNSE-SAT, leading to superior outcomes. The results of character error rate show that FNSE-SAT generates fewer components that deviate from the textual content compared to the generative diffusion method. Reducing the Discriminator's resolution to a single version decreases the DNSMOS but has a slight effect on the character error rate.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Acoustics 物理-声学

CiteScore

7.40

自引率

11.80%

发文量

618

审稿时长

7.5 months

期刊介绍： Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense. Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems. Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.