Multi-MelGAN Voice Conversion for the Creation of Under-Resourced Child Speech Synthesis

2022 IST-Africa Conference (IST-Africa) Pub Date : 2022-05-16 DOI:10.23919/IST-Africa56635.2022.9845637

Avashna Govender, D. Paul

引用次数: 0

Abstract

Voice conversion (VC) is an important technique for the development of text-to-speech voices in the use case of lacking speech resources. VC can convert an audio signal from a source speaker to a specific target speaker whilst maintaining the linguistic information. The benefit of VC is that you only require a small amount of target data which therefore makes it possible to build high quality text-to-speech voices using only a limited amount of speech data. In this work, we implement VC using a mel-spectrogram Generative Adversarial Network called MelGAN-VC. This technique does not require parallel data and has been proven successful on as little as 1 hour of target speech data. The aim of this work was to build child voices by modifying the original one-to-one MelGAN-VC model to a many-to-many model and determine if there is any gain in using such a model. We found that using a many-to-many model performs better than the baseline one-to-one model in terms of speaker similarity and the naturalness of the output speech when using only 24 minutes of speech data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

资源不足儿童语音合成的多melgan语音转换

语音转换(VC)是在缺乏语音资源的情况下实现文本语音转换的一项重要技术。VC可以在保持语言信息的同时，将源说话者的音频信号转换为特定的目标说话者。VC的好处是，您只需要少量的目标数据，因此可以使用有限的语音数据构建高质量的文本到语音语音。在这项工作中，我们使用称为MelGAN-VC的梅尔谱图生成对抗网络来实现VC。该技术不需要并行数据，并且已被证明在1小时的目标语音数据上是成功的。这项工作的目的是通过将原始的一对一MelGAN-VC模型修改为多对多模型来构建儿童声音，并确定使用这种模型是否有任何好处。我们发现，当只使用24分钟的语音数据时，在说话人相似度和输出语音的自然度方面，使用多对多模型比基线一对一模型表现得更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IST-Africa Conference (IST-Africa)

自引率

0.00%

发文量