FCH-TTS:快速,可控和高质量的非自回归文本到语音合成

Xun Zhou, Zhiyang Zhou, Xiaodon Shi
{"title":"FCH-TTS:快速,可控和高质量的非自回归文本到语音合成","authors":"Xun Zhou, Zhiyang Zhou, Xiaodon Shi","doi":"10.1109/IJCNN55064.2022.9892512","DOIUrl":null,"url":null,"abstract":"Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output mel-spectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.","PeriodicalId":106974,"journal":{"name":"2022 International Joint Conference on Neural Networks (IJCNN)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis\",\"authors\":\"Xun Zhou, Zhiyang Zhou, Xiaodon Shi\",\"doi\":\"10.1109/IJCNN55064.2022.9892512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output mel-spectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.\",\"PeriodicalId\":106974,\"journal\":{\"name\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJCNN55064.2022.9892512\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN55064.2022.9892512","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

受非自回归语音合成模型FastSpeech成功的启发,我们提出了FCH-TTS,一种快速、可控和通用的神经文本到语音(TTS),能够生成高质量的频谱图。FCH-TTS的基本架构与FastSpeech类似,但FCH-TTS使用了一种简单而有效的基于注意力的软对齐机制来取代FastSpeech中复杂的教师模型,使该模型能够更好地适应不同的语言。具体来说,除了控制语速和韵律外,还设计了一个融合模块来更好地模拟扬声器的特征,以获得所需的音色。同时,采用了几种特殊的损失函数来保证输出的mel谱图的质量。在LJSpeech数据集上的实验结果表明,与所有基线模型相比,FCH-TTS获得了最快的推理速度,同时也获得了最好的语音质量。此外,在多个数据集上验证了该模型在韵律、语速和音色方面的可控性,在低资源藏语数据集上的良好表现证明了该模型的通用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FCH-TTS: Fast, Controllable and High-quality Non-Autoregressive Text-to-Speech Synthesis
Inspired by the success of the non-autoregressive speech synthesis model FastSpeech, we propose FCH-TTS, a fast, controllable and universal neural text-to-speech (TTS) capable of generating high-quality spectrograms. The basic architecture of FCH-TTS is similar to that of FastSpeech, but FCH-TTS uses a simple yet effective attention-based soft alignment mechanism to replace the complex teacher model in FastSpeech, allowing the model to be better adapted to different languages. Specifically, in addition to the control of voice speed and prosody, a fusion module has been designed to better model speaker features in order to obtain the desired timbre. Meanwhile, several special loss functions were applied to ensure the quality of the output mel-spectrogram. Experimental results on the dataset LJSpeech show that FCH-TTS achieves the fastest inference speed compared to all baseline models, while also achieving the best speech quality. In addition, the controllability of the model with respect to prosody, voice speed and timbre was validated on several datasets, and the good performance on the low-resource Tibetan dataset demonstrates the universality of the model.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Parameterization of Vector Symbolic Approach for Sequence Encoding Based Visual Place Recognition Nested compression of convolutional neural networks with Tucker-2 decomposition SQL-Rank++: A Novel Listwise Approach for Collaborative Ranking with Implicit Feedback ACTSS: Input Detection Defense against Backdoor Attacks via Activation Subset Scanning ADV-ResNet: Residual Network with Controlled Adversarial Regularization for Effective Classification of Practical Time Series Under Training Data Scarcity Problem
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1