只选择最好的语音模仿者:Top-K多对多语音转换与StarGAN

IF 2.4 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2023-11-30 DOI:10.1016/j.specom.2023.103022

Claudio Fernandez-Martín , Adrian Colomer , Claudio Panariello , Valery Naranjo

{"title":"只选择最好的语音模仿者:Top-K多对多语音转换与StarGAN","authors":"Claudio Fernandez-Martín , Adrian Colomer , Claudio Panariello , Valery Naranjo","doi":"10.1016/j.specom.2023.103022","DOIUrl":null,"url":null,"abstract":"<div><p>Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best <span><math><mi>K</mi></math></span> generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of <span><math><mi>K</mi></math></span> during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103022"},"PeriodicalIF":2.4000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001565/pdfft?md5=74a68a8324a3af4dc4558e4166e99f23&pid=1-s2.0-S0167639323001565-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN\",\"authors\":\"Claudio Fernandez-Martín , Adrian Colomer , Claudio Panariello , Valery Naranjo\",\"doi\":\"10.1016/j.specom.2023.103022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best <span><math><mi>K</mi></math></span> generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of <span><math><mi>K</mi></math></span> during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"156 \",\"pages\":\"Article 103022\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167639323001565/pdfft?md5=74a68a8324a3af4dc4558e4166e99f23&pid=1-s2.0-S0167639323001565-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639323001565\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001565","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

随着语音技术应用的增长，语音转换系统变得越来越重要。深度学习技术，特别是生成对抗网络(GANs)，使合成媒体的创造取得了重大进展，包括语音合成领域。最近的一个例子是StarGAN-VC，它使用一对发生器和鉴别器来转换多个扬声器之间的声音。然而，gan的训练稳定性可能是一个问题。Top-K方法只使用最好的K个生成样本来训练生成器，这些样本可以“欺骗”鉴别器，该方法已应用于图像任务和简单的GAN架构。在这项工作中，我们证明了Top-K方法可以在最先进的语音转换系统(如StarGAN-VC)中提高转换语音的质量和稳定性。我们还探讨了实现Top-K方法的最佳时间以及如何在训练期间减少K的值。通过定量和定性研究，发现Top-K方法与常规或香草训练相比，收敛速度更快，转换质量更好。此外，人类听众认为使用Top-K生成的样本更自然，更有可能相信它们是由人类说话者产生的。本研究的结果表明，Top-K方法可以有效地提高基于深度学习的语音转换系统的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN

Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best $K$ generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of $K$ during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.