FCH-TTS:快速，可控和高质量的非自回归文本到语音合成

2022 International Joint Conference on Neural Networks (IJCNN) Pub Date : 2022-07-18 DOI:10.1109/IJCNN55064.2022.9892512

Xun Zhou, Zhiyang Zhou, Xiaodon Shi

{"title":"FCH-TTS:快速，可控和高质量的非自回归文本到语音合成","authors":"Xun Zhou, Zhiyang Zhou, Xiaodon Shi","doi":"10.1109/IJCNN55064.2022.9892512","DOIUrl":null,"url":null,"abstract":"Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output mel-spectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.","PeriodicalId":106974,"journal":{"name":"2022 International Joint Conference on Neural Networks (IJCNN)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis\",\"authors\":\"Xun Zhou, Zhiyang Zhou, Xiaodon Shi\",\"doi\":\"10.1109/IJCNN55064.2022.9892512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output mel-spectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.\",\"PeriodicalId\":106974,\"journal\":{\"name\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJCNN55064.2022.9892512\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN55064.2022.9892512","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

受非自回归语音合成模型FastSpeech成功的启发，我们提出了FCH-TTS，一种快速、可控和通用的神经文本到语音(TTS)，能够生成高质量的频谱图。FCH-TTS的基本架构与FastSpeech类似，但FCH-TTS使用了一种简单而有效的基于注意力的软对齐机制来取代FastSpeech中复杂的教师模型，使该模型能够更好地适应不同的语言。具体来说，除了控制语速和韵律外，还设计了一个融合模块来更好地模拟扬声器的特征，以获得所需的音色。同时，采用了几种特殊的损失函数来保证输出的mel谱图的质量。在LJSpeech数据集上的实验结果表明，与所有基线模型相比，FCH-TTS获得了最快的推理速度，同时也获得了最好的语音质量。此外，在多个数据集上验证了该模型在韵律、语速和音色方面的可控性，在低资源藏语数据集上的良好表现证明了该模型的通用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis

Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output mel-spectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 International Joint Conference on Neural Networks (IJCNN)

自引率

0.00%

发文量