{"title":"只选择最好的语音模仿者:Top-K多对多语音转换与StarGAN","authors":"Claudio Fernandez-Martín , Adrian Colomer , Claudio Panariello , Valery Naranjo","doi":"10.1016/j.specom.2023.103022","DOIUrl":null,"url":null,"abstract":"<div><p>Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best <span><math><mi>K</mi></math></span> generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of <span><math><mi>K</mi></math></span> during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103022"},"PeriodicalIF":2.4000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001565/pdfft?md5=74a68a8324a3af4dc4558e4166e99f23&pid=1-s2.0-S0167639323001565-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN\",\"authors\":\"Claudio Fernandez-Martín , Adrian Colomer , Claudio Panariello , Valery Naranjo\",\"doi\":\"10.1016/j.specom.2023.103022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best <span><math><mi>K</mi></math></span> generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of <span><math><mi>K</mi></math></span> during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"156 \",\"pages\":\"Article 103022\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167639323001565/pdfft?md5=74a68a8324a3af4dc4558e4166e99f23&pid=1-s2.0-S0167639323001565-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639323001565\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001565","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN
Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.