Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2022-10-18 DOI:10.21437/ssw.2023-10

Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, H. Saruwatari

引用次数: 0

Abstract

We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-inserted synthetic speech is often limited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverse FP insertions. Regularization is used to stabilize the synthesis of the linguistic speech (i.e., non-FP) elements. To further improve robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truth FPs. Our experiments demonstrated that the proposed method improves the naturalness of synthetic speech with ground-truth and predicted FPs by 0.24 and 0.26, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用语言语音正则化和伪填充停顿插入提高自发语音合成的鲁棒性

本文提出了一种基于语言语音正则化的训练方法，该方法提高了带有填充停顿(FP)插入的自发语音合成方法的鲁棒性。自发语音合成旨在产生具有类似人类不流畅的语音，例如FPs。由于使用丰富的FP词汇对自发语音的复杂数据分布进行建模是具有挑战性的，因此插入FP的合成语音的质量往往受到限制。为了解决这个问题，我们提出了一种合成自发语音的方法，提高了对不同FP插入的鲁棒性。正则化用于稳定语言语音(即非fp)元素的合成。为了进一步提高对不同FP插入的鲁棒性，它利用使用FP单词预测模型采样的伪FPs和真实FPs。实验表明，该方法将合成语音的自然度和预测FPs分别提高了0.24和0.26。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

12th ISCA Speech Synthesis Workshop (SSW2023)

自引率

0.00%

发文量

期刊最新文献

Re-examining the quality dimensions of synthetic speech Synthesising turn-taking cues using natural conversational data Diffusion Transformer for Adaptive Text-to-Speech Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping Audiobook synthesis with long-form neural text-to-speech