Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI:arxiv-2409.07265

Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari

引用次数: 0

Abstract

We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers' voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

结合多方言音素级 BERT 的跨方言音高附着语言文本到语音技术

我们探讨了跨方言文本到语音（CD-TTS），这是一项用非母语方言，特别是音高增强语言合成学习者声音的任务。CD-TTS 对于开发能与不同地区的人自然交流的语音代理非常重要。我们提出了一种由三个子模块组成的新型 TTS 模型，以在这项任务中表现出竞争力。首先，我们训练一个骨干 TTS 模型，以语音编码器从语音中提取的音素级口音潜变量（ALV）为条件，从文本中合成方言语音。然后，我们训练一个 ALV 预测器，利用我们新颖的多方言音素级 BERT，从输入文本中预测适合目标方言的 ALV。我们进行了多方言 TTS 实验，并通过与传统方言 TTS 方法得出的基线进行比较，评估了我们模型的有效性。结果表明，我们的模型提高了 CD-TTS 中合成语音的方言自然度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量