基于扩散概率建模的统一语音克隆和语音转换系统

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-10879

T. Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, M. Kudinov, Jiansheng Wei

{"title":"基于扩散概率建模的统一语音克隆和语音转换系统","authors":"T. Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, M. Kudinov, Jiansheng Wei","doi":"10.21437/interspeech.2022-10879","DOIUrl":null,"url":null,"abstract":"Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker’s voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3003-3007"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling\",\"authors\":\"T. Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, M. Kudinov, Jiansheng Wei\",\"doi\":\"10.21437/interspeech.2022-10879\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker’s voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"3003-3007\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-10879\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-10879","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

文本到语音和语音转换是两个常见的语音生成任务，通常使用不同的模型来解决。在本文中，我们提出了一种基于单一扩散概率模型的语音克隆和任意到任意语音转换的新方法，该模型具有两个编码器，每个编码器在其输入域上操作，并共享一个解码器。大量的人类评估表明，所提出的模型通过说话人自适应复制目标说话人的声音的能力优于其他已知的同类多模态系统，并且我们的系统在语音克隆和语音转换模式下合成的语音质量与最近提出的针对相应单一任务的算法相当。此外，只需3分钟的GPU时间就可以使我们的模型适应只有15秒未转录音频的新扬声器，这使得它在实际应用中具有吸引力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling

Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker’s voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量