{"title":"PRVAE-VC: Non-Parallel Many-to-Many Voice Conversion with Perturbation-Resistant Variational Autoencoder","authors":"Kou Tanaka, H. Kameoka, Takuhiro Kaneko","doi":"10.21437/ssw.2023-14","DOIUrl":null,"url":null,"abstract":"This paper describes a novel approach to non-parallel many-to-many voice conversion (VC) that utilizes a variant of the conditional variational autoencoder (VAE) called a perturbation-resistant VAE (PRVAE). In VAE-based VC, it is commonly assumed that the encoder extracts content from the input speech while removing source speaker information. Following this extraction, the decoder generates output from the extracted content and target speaker information. However, in practice, the encoded features may still retain source speaker information, which can lead to a degradation of speech quality during speaker conversion tasks. To address this issue, we propose a perturbation-resistant encoder trained to match the encoded features of the input speech with those of a pseudo-speech generated through a content-preserving transformation of the input speech’s fundamental frequency and spectral envelope using a combination of pure signal processing techniques. Our experimental results demonstrate that this straightforward constraint significantly enhances the performance in non-parallel many-to-many speaker conversion tasks. Audio samples can be accessed at our webpage 1 .","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper describes a novel approach to non-parallel many-to-many voice conversion (VC) that utilizes a variant of the conditional variational autoencoder (VAE) called a perturbation-resistant VAE (PRVAE). In VAE-based VC, it is commonly assumed that the encoder extracts content from the input speech while removing source speaker information. Following this extraction, the decoder generates output from the extracted content and target speaker information. However, in practice, the encoded features may still retain source speaker information, which can lead to a degradation of speech quality during speaker conversion tasks. To address this issue, we propose a perturbation-resistant encoder trained to match the encoded features of the input speech with those of a pseudo-speech generated through a content-preserving transformation of the input speech’s fundamental frequency and spectral envelope using a combination of pure signal processing techniques. Our experimental results demonstrate that this straightforward constraint significantly enhances the performance in non-parallel many-to-many speaker conversion tasks. Audio samples can be accessed at our webpage 1 .