Enhancing real-world far-field speech with supervised adversarial training

IF 3.4 2区 物理与天体物理 Q1 ACOUSTICS Applied Acoustics Pub Date : 2024-11-15 DOI:10.1016/j.apacoust.2024.110407
Tong Lei , Qinwen Hu , Zhongshu Hou , Jing Lu
{"title":"Enhancing real-world far-field speech with supervised adversarial training","authors":"Tong Lei ,&nbsp;Qinwen Hu ,&nbsp;Zhongshu Hou ,&nbsp;Jing Lu","doi":"10.1016/j.apacoust.2024.110407","DOIUrl":null,"url":null,"abstract":"<div><div>The generalization of speech enhancement models to real-world far-field speech encounters significant challenges, including low signal-to-noise ratio, high reverberation, and variable latency between far-field and near-field recordings. Additionally, using the non-ideal near-field recordings as the labeled desired output further reduces the effectiveness of commonly utilized predictive models. To tackle these challenges, we propose the Far-field to Near-field Speech Enhancement through Supervised Adversarial Training (FNSE-SAT) strategy. This approach leverages supervised adversarial learning via the Multi-Resolution Discriminator, leveraging diverse speech characteristics with different frequency resolutions. A temporal frame shift operation is also incorporated to mitigate alignment discrepancies observed in real-world data and its effectiveness is confirmed by counting the accuracy of Voice Activity Detection. Experimental validation in both causal and non-causal configurations demonstrates that FNSE-SAT significantly outperforms the state-of-the-art predictive model on real-world datasets. Furthermore, adopting the transfer learning strategy, where the model is initialized with a simulated dataset before fine-tuning with real-world data, strengthens the efficacy of FNSE-SAT, leading to superior outcomes. The results of character error rate show that FNSE-SAT generates fewer components that deviate from the textual content compared to the generative diffusion method. Reducing the Discriminator's resolution to a single version decreases the DNSMOS but has a slight effect on the character error rate.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"229 ","pages":"Article 110407"},"PeriodicalIF":3.4000,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X24005589","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

The generalization of speech enhancement models to real-world far-field speech encounters significant challenges, including low signal-to-noise ratio, high reverberation, and variable latency between far-field and near-field recordings. Additionally, using the non-ideal near-field recordings as the labeled desired output further reduces the effectiveness of commonly utilized predictive models. To tackle these challenges, we propose the Far-field to Near-field Speech Enhancement through Supervised Adversarial Training (FNSE-SAT) strategy. This approach leverages supervised adversarial learning via the Multi-Resolution Discriminator, leveraging diverse speech characteristics with different frequency resolutions. A temporal frame shift operation is also incorporated to mitigate alignment discrepancies observed in real-world data and its effectiveness is confirmed by counting the accuracy of Voice Activity Detection. Experimental validation in both causal and non-causal configurations demonstrates that FNSE-SAT significantly outperforms the state-of-the-art predictive model on real-world datasets. Furthermore, adopting the transfer learning strategy, where the model is initialized with a simulated dataset before fine-tuning with real-world data, strengthens the efficacy of FNSE-SAT, leading to superior outcomes. The results of character error rate show that FNSE-SAT generates fewer components that deviate from the textual content compared to the generative diffusion method. Reducing the Discriminator's resolution to a single version decreases the DNSMOS but has a slight effect on the character error rate.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过有监督的对抗训练增强真实世界远场语音能力
将语音增强模型推广到真实世界的远场语音会遇到巨大挑战,包括低信噪比、高混响以及远场和近场录音之间的不同延迟。此外,使用非理想的近场录音作为标注的预期输出会进一步降低常用预测模型的有效性。为了应对这些挑战,我们提出了通过监督对抗训练(FNSE-SAT)进行远场到近场语音增强的策略。这种方法通过多分辨率判别器利用监督对抗学习,利用不同频率分辨率的各种语音特征。该方法还采用了时间帧移动操作,以减少在真实世界数据中观察到的对齐差异,并通过计算语音活动检测的准确性来证实其有效性。因果和非因果配置的实验验证表明,FNSE-SAT 在实际数据集上的表现明显优于最先进的预测模型。此外,FNSE-SAT 还采用了迁移学习策略,即先使用模拟数据集初始化模型,然后再使用真实数据进行微调。字符错误率的结果表明,与生成扩散法相比,FNSE-SAT 生成的偏离文本内容的成分更少。将判别器的分辨率降低到单一版本会降低 DNSMOS,但对字符错误率影响不大。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Applied Acoustics
Applied Acoustics 物理-声学
CiteScore
7.40
自引率
11.80%
发文量
618
审稿时长
7.5 months
期刊介绍: Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense. Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems. Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.
期刊最新文献
Application of a priori knowledge-enhanced fuzzy clustering to acoustic emission-based damage identification of composite laminates A wideband damage source localization method using enhanced virtual time reversal mirror technique and modal analysis with sparse acoustic emission array Enhancing real-world far-field speech with supervised adversarial training Estimation of lung sound cycle span using spectro-temporal respiratory frequency evaluation Speech emotion recognition using multi resolution Hilbert transform based spectral and entropy features
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1