资源不足儿童语音合成的多melgan语音转换

2022 IST-Africa Conference (IST-Africa) Pub Date : 2022-05-16 DOI:10.23919/IST-Africa56635.2022.9845637

Avashna Govender, D. Paul

{"title":"资源不足儿童语音合成的多melgan语音转换","authors":"Avashna Govender, D. Paul","doi":"10.23919/IST-Africa56635.2022.9845637","DOIUrl":null,"url":null,"abstract":"Voice conversion (VC) is an important technique for the development of text-to-speech voices in the use case of lacking speech resources. VC can convert an audio signal from a source speaker to a specific target speaker whilst maintaining the linguistic information. The benefit of VC is that you only require a small amount of target data which therefore makes it possible to build high quality text-to-speech voices using only a limited amount of speech data. In this work, we implement VC using a mel-spectrogram Generative Adversarial Network called MelGAN-VC. This technique does not require parallel data and has been proven successful on as little as 1 hour of target speech data. The aim of this work was to build child voices by modifying the original one-to-one MelGAN-VC model to a many-to-many model and determine if there is any gain in using such a model. We found that using a many-to-many model performs better than the baseline one-to-one model in terms of speaker similarity and the naturalness of the output speech when using only 24 minutes of speech data.","PeriodicalId":142887,"journal":{"name":"2022 IST-Africa Conference (IST-Africa)","volume":"139 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-MelGAN Voice Conversion for the Creation of Under-Resourced Child Speech Synthesis\",\"authors\":\"Avashna Govender, D. Paul\",\"doi\":\"10.23919/IST-Africa56635.2022.9845637\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Voice conversion (VC) is an important technique for the development of text-to-speech voices in the use case of lacking speech resources. VC can convert an audio signal from a source speaker to a specific target speaker whilst maintaining the linguistic information. The benefit of VC is that you only require a small amount of target data which therefore makes it possible to build high quality text-to-speech voices using only a limited amount of speech data. In this work, we implement VC using a mel-spectrogram Generative Adversarial Network called MelGAN-VC. This technique does not require parallel data and has been proven successful on as little as 1 hour of target speech data. The aim of this work was to build child voices by modifying the original one-to-one MelGAN-VC model to a many-to-many model and determine if there is any gain in using such a model. We found that using a many-to-many model performs better than the baseline one-to-one model in terms of speaker similarity and the naturalness of the output speech when using only 24 minutes of speech data.\",\"PeriodicalId\":142887,\"journal\":{\"name\":\"2022 IST-Africa Conference (IST-Africa)\",\"volume\":\"139 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IST-Africa Conference (IST-Africa)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/IST-Africa56635.2022.9845637\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IST-Africa Conference (IST-Africa)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/IST-Africa56635.2022.9845637","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

语音转换(VC)是在缺乏语音资源的情况下实现文本语音转换的一项重要技术。VC可以在保持语言信息的同时，将源说话者的音频信号转换为特定的目标说话者。VC的好处是，您只需要少量的目标数据，因此可以使用有限的语音数据构建高质量的文本到语音语音。在这项工作中，我们使用称为MelGAN-VC的梅尔谱图生成对抗网络来实现VC。该技术不需要并行数据，并且已被证明在1小时的目标语音数据上是成功的。这项工作的目的是通过将原始的一对一MelGAN-VC模型修改为多对多模型来构建儿童声音，并确定使用这种模型是否有任何好处。我们发现，当只使用24分钟的语音数据时，在说话人相似度和输出语音的自然度方面，使用多对多模型比基线一对一模型表现得更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi-MelGAN Voice Conversion for the Creation of Under-Resourced Child Speech Synthesis

Voice conversion (VC) is an important technique for the development of text-to-speech voices in the use case of lacking speech resources. VC can convert an audio signal from a source speaker to a specific target speaker whilst maintaining the linguistic information. The benefit of VC is that you only require a small amount of target data which therefore makes it possible to build high quality text-to-speech voices using only a limited amount of speech data. In this work, we implement VC using a mel-spectrogram Generative Adversarial Network called MelGAN-VC. This technique does not require parallel data and has been proven successful on as little as 1 hour of target speech data. The aim of this work was to build child voices by modifying the original one-to-one MelGAN-VC model to a many-to-many model and determine if there is any gain in using such a model. We found that using a many-to-many model performs better than the baseline one-to-one model in terms of speaker similarity and the naturalness of the output speech when using only 24 minutes of speech data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IST-Africa Conference (IST-Africa)

自引率

0.00%

发文量