Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

arXiv - CS - Sound Pub Date : 2024-05-02 DOI:arxiv-2405.01730

Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman

引用次数: 0

Abstract

Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

转换任何人的声音：利用条件扩散模型进行端到端表达式语音转换

表达式语音转换（VC）通过对说话者身份和情感风格进行联合转换，对情感丰富的说话者进行说话者身份转换。以往的方法依赖于声码器进行语音重构，这使得语音质量严重依赖于声码器的性能。表达式 VC 的一大挑战在于情感前体建模。为了应对这些挑战，本文提出了一种基于条件去噪扩散概率模型（DDPM）的完全端到表达式 VC 框架。我们利用从自我监督语音模型中提取的语音单元作为内容条件，同时利用从语音情感识别和说话人验证系统中提取的深度特征来模拟情感风格和说话人身份。客观和主观评价显示了我们框架的有效性。代码和样本可公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量