{"title":"通过有监督的对抗训练增强真实世界远场语音能力","authors":"Tong Lei , Qinwen Hu , Zhongshu Hou , Jing Lu","doi":"10.1016/j.apacoust.2024.110407","DOIUrl":null,"url":null,"abstract":"<div><div>The generalization of speech enhancement models to real-world far-field speech encounters significant challenges, including low signal-to-noise ratio, high reverberation, and variable latency between far-field and near-field recordings. Additionally, using the non-ideal near-field recordings as the labeled desired output further reduces the effectiveness of commonly utilized predictive models. To tackle these challenges, we propose the Far-field to Near-field Speech Enhancement through Supervised Adversarial Training (FNSE-SAT) strategy. This approach leverages supervised adversarial learning via the Multi-Resolution Discriminator, leveraging diverse speech characteristics with different frequency resolutions. A temporal frame shift operation is also incorporated to mitigate alignment discrepancies observed in real-world data and its effectiveness is confirmed by counting the accuracy of Voice Activity Detection. Experimental validation in both causal and non-causal configurations demonstrates that FNSE-SAT significantly outperforms the state-of-the-art predictive model on real-world datasets. Furthermore, adopting the transfer learning strategy, where the model is initialized with a simulated dataset before fine-tuning with real-world data, strengthens the efficacy of FNSE-SAT, leading to superior outcomes. The results of character error rate show that FNSE-SAT generates fewer components that deviate from the textual content compared to the generative diffusion method. Reducing the Discriminator's resolution to a single version decreases the DNSMOS but has a slight effect on the character error rate.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"229 ","pages":"Article 110407"},"PeriodicalIF":3.4000,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing real-world far-field speech with supervised adversarial training\",\"authors\":\"Tong Lei , Qinwen Hu , Zhongshu Hou , Jing Lu\",\"doi\":\"10.1016/j.apacoust.2024.110407\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The generalization of speech enhancement models to real-world far-field speech encounters significant challenges, including low signal-to-noise ratio, high reverberation, and variable latency between far-field and near-field recordings. Additionally, using the non-ideal near-field recordings as the labeled desired output further reduces the effectiveness of commonly utilized predictive models. To tackle these challenges, we propose the Far-field to Near-field Speech Enhancement through Supervised Adversarial Training (FNSE-SAT) strategy. This approach leverages supervised adversarial learning via the Multi-Resolution Discriminator, leveraging diverse speech characteristics with different frequency resolutions. A temporal frame shift operation is also incorporated to mitigate alignment discrepancies observed in real-world data and its effectiveness is confirmed by counting the accuracy of Voice Activity Detection. Experimental validation in both causal and non-causal configurations demonstrates that FNSE-SAT significantly outperforms the state-of-the-art predictive model on real-world datasets. Furthermore, adopting the transfer learning strategy, where the model is initialized with a simulated dataset before fine-tuning with real-world data, strengthens the efficacy of FNSE-SAT, leading to superior outcomes. The results of character error rate show that FNSE-SAT generates fewer components that deviate from the textual content compared to the generative diffusion method. Reducing the Discriminator's resolution to a single version decreases the DNSMOS but has a slight effect on the character error rate.</div></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":\"229 \",\"pages\":\"Article 110407\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X24005589\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X24005589","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
Enhancing real-world far-field speech with supervised adversarial training
The generalization of speech enhancement models to real-world far-field speech encounters significant challenges, including low signal-to-noise ratio, high reverberation, and variable latency between far-field and near-field recordings. Additionally, using the non-ideal near-field recordings as the labeled desired output further reduces the effectiveness of commonly utilized predictive models. To tackle these challenges, we propose the Far-field to Near-field Speech Enhancement through Supervised Adversarial Training (FNSE-SAT) strategy. This approach leverages supervised adversarial learning via the Multi-Resolution Discriminator, leveraging diverse speech characteristics with different frequency resolutions. A temporal frame shift operation is also incorporated to mitigate alignment discrepancies observed in real-world data and its effectiveness is confirmed by counting the accuracy of Voice Activity Detection. Experimental validation in both causal and non-causal configurations demonstrates that FNSE-SAT significantly outperforms the state-of-the-art predictive model on real-world datasets. Furthermore, adopting the transfer learning strategy, where the model is initialized with a simulated dataset before fine-tuning with real-world data, strengthens the efficacy of FNSE-SAT, leading to superior outcomes. The results of character error rate show that FNSE-SAT generates fewer components that deviate from the textual content compared to the generative diffusion method. Reducing the Discriminator's resolution to a single version decreases the DNSMOS but has a slight effect on the character error rate.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.