W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision

IF 2.4 3区 计算机科学 Journal on Audio Speech and Music Processing Pub Date : 2023-10-28 DOI:10.1186/s13636-023-00312-8
Hao Huang, Lin Wang, Jichen Yang, Ying Hu, Liang He
{"title":"W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision","authors":"Hao Huang, Lin Wang, Jichen Yang, Ying Hu, Liang He","doi":"10.1186/s13636-023-00312-8","DOIUrl":null,"url":null,"abstract":"Abstract Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2023-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal on Audio Speech and Music Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s13636-023-00312-8","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
W2VC:基于梯度反转蒸馏和CTC监督的WavLM表示的一次性语音转换
近年来,由于自监督预训练表示(SSPR)的应用,非并行数据语音转换(VC)取得了相当大的突破。通过预训练模型提取的特征被期望包含更多的内容信息。然而,在常用的带有SSPR的VC中,没有专门的实现来去除SSPR提取的内容表示中的说话人信息,这阻碍了从SSPR表示中进一步纯化说话人信息。此外,在传统的VC中,通常选择mel -谱图作为重构的声学特征,这与内容编码器的输入不一致,导致部分信息丢失。基于以上原因,我们提出了W2VC来解决这些问题。W2VC由三部分组成:(1)从WavLM表示(WLMR)中重构出与内容编码器输入更一致的特征;(2)使用连接时态分类(CTC)从音素层面对内容表示和文本上下文进行对齐,在内容表示提取中使用内容编码器加基于梯度反转层(GRL)的说话人分类器去除说话人信息;(3)训练基于WLMR的HiFi-GAN将WLMR转换为波形语音。VC实验结果表明,GRL能够很好地净化自监督模型的内容信息。对内容编码器进行GRL净化和CTC监督是提高VC性能的互补措施。此外,使用WLMR再训练声码器合成的语音在主观和客观评价上都取得了更好的效果。在VCTK和CMU数据库上对该方法进行了验证。结果表明,该方法的客观MCD得分为8.901,语音自然度得分为4.45,主观MOS得分为3.62,均优于基线。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal on Audio Speech and Music Processing
Journal on Audio Speech and Music Processing Engineering-Electrical and Electronic Engineering
CiteScore
4.10
自引率
4.20%
发文量
28
期刊介绍: The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.
期刊最新文献
A survey of technologies for automatic Dysarthric speech recognition Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1