Discourse-Level Prosody Modeling with a Variational Autoencoder for Non-Autoregressive Expressive Speech Synthesis

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2022-05-23 DOI:10.1109/icassp43922.2022.9746238

Ning Wu, Zhaoci Liu, Zhenhua Ling

引用次数: 1

Abstract

To address the issue of one-to-many mapping from phoneme sequences to acoustic features in expressive speech synthesis, this paper proposes a method of discourse-level prosody modeling with a variational autoencoder (VAE) based on the non-autoregressive architecture of FastSpeech. In this method, phone-level prosody codes are extracted from prosody features by combining VAE with FastSpeech, and are predicted using discourse-level text features together with BERT embeddings. The continuous wavelet transform (CWT) in FastSpeech2 for F0 representation is not necessary anymore. Experimental results on a Chinese audiobook dataset show that our proposed method can effectively take advantage of discourse-level linguistic information and has outperformed FastSpeech2 on the naturalness and expressiveness of synthetic speech.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于变分自编码器的非自回归表达语音合成语篇级韵律建模

为了解决表达性语音合成中音素序列到声学特征的一对多映射问题，本文提出了一种基于FastSpeech非自回归架构的变分自编码器(VAE)的语篇级韵律建模方法。在该方法中，通过结合VAE和FastSpeech从韵律特征中提取语音级韵律代码，并结合BERT嵌入使用语篇级文本特征进行预测。FastSpeech2中的连续小波变换(CWT)对F0表示不再是必要的。在中文有声读物数据集上的实验结果表明，本文提出的方法可以有效地利用语篇级语言信息，在合成语音的自然度和表达性方面优于FastSpeech2。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量