只选择最好的语音模仿者:Top-K多对多语音转换与StarGAN

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2023-11-30 DOI:10.1016/j.specom.2023.103022
Claudio Fernandez-Martín , Adrian Colomer , Claudio Panariello , Valery Naranjo
{"title":"只选择最好的语音模仿者:Top-K多对多语音转换与StarGAN","authors":"Claudio Fernandez-Martín ,&nbsp;Adrian Colomer ,&nbsp;Claudio Panariello ,&nbsp;Valery Naranjo","doi":"10.1016/j.specom.2023.103022","DOIUrl":null,"url":null,"abstract":"<div><p>Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best <span><math><mi>K</mi></math></span> generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of <span><math><mi>K</mi></math></span> during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001565/pdfft?md5=74a68a8324a3af4dc4558e4166e99f23&pid=1-s2.0-S0167639323001565-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN\",\"authors\":\"Claudio Fernandez-Martín ,&nbsp;Adrian Colomer ,&nbsp;Claudio Panariello ,&nbsp;Valery Naranjo\",\"doi\":\"10.1016/j.specom.2023.103022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best <span><math><mi>K</mi></math></span> generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of <span><math><mi>K</mi></math></span> during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167639323001565/pdfft?md5=74a68a8324a3af4dc4558e4166e99f23&pid=1-s2.0-S0167639323001565-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639323001565\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001565","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

随着语音技术应用的增长,语音转换系统变得越来越重要。深度学习技术,特别是生成对抗网络(GANs),使合成媒体的创造取得了重大进展,包括语音合成领域。最近的一个例子是StarGAN-VC,它使用一对发生器和鉴别器来转换多个扬声器之间的声音。然而,gan的训练稳定性可能是一个问题。Top-K方法只使用最好的K个生成样本来训练生成器,这些样本可以“欺骗”鉴别器,该方法已应用于图像任务和简单的GAN架构。在这项工作中,我们证明了Top-K方法可以在最先进的语音转换系统(如StarGAN-VC)中提高转换语音的质量和稳定性。我们还探讨了实现Top-K方法的最佳时间以及如何在训练期间减少K的值。通过定量和定性研究,发现Top-K方法与常规或香草训练相比,收敛速度更快,转换质量更好。此外,人类听众认为使用Top-K生成的样本更自然,更有可能相信它们是由人类说话者产生的。本研究的结果表明,Top-K方法可以有效地提高基于深度学习的语音转换系统的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN

Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best K generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of K during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments Forms, factors and functions of phonetic convergence: Editorial Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study Zero-shot voice conversion based on feature disentanglement Multi-modal co-learning for silent speech recognition based on ultrasound tongue images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1