{"title":"卷积-递归-卷积文本-语音系统","authors":"Kuo Chen, Xuebin Sun","doi":"10.1145/3548608.3559304","DOIUrl":null,"url":null,"abstract":"End-to-end speech synthesis technology has already replaced the positions of Statistical Parametric Speech Synthesis (SPSS) in text-to-speech (TTS) field. The end-to-end model based on neural network, does not require a lot of domain knowledge but synthesize more natural speeches. Tacotron is the first model that can synthesize speeches which even human is hard to distinguish. We propose a new end-to-end speech synthesis system which is called Convolution-Recurrent-Convolution Text-to-Speech (CRCTTS). We chose Tacotron as our baseline model and adjust the architecture through fully Convolution Neural Network (CNN) module and Dynamic Convolution Attention (DCA). Besides, we also introduce the attention guided mechanism to our model for accelerating the attention alignment in the decoder module. The model we proposed has been proved that can synthesis speech with better quality and cost less time in terms of training stage and synthesis stage than the baseline model with these technologies.","PeriodicalId":201434,"journal":{"name":"Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CRCTTS: Convolution-Recurrent-Convolution Text-to-Speech System\",\"authors\":\"Kuo Chen, Xuebin Sun\",\"doi\":\"10.1145/3548608.3559304\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"End-to-end speech synthesis technology has already replaced the positions of Statistical Parametric Speech Synthesis (SPSS) in text-to-speech (TTS) field. The end-to-end model based on neural network, does not require a lot of domain knowledge but synthesize more natural speeches. Tacotron is the first model that can synthesize speeches which even human is hard to distinguish. We propose a new end-to-end speech synthesis system which is called Convolution-Recurrent-Convolution Text-to-Speech (CRCTTS). We chose Tacotron as our baseline model and adjust the architecture through fully Convolution Neural Network (CNN) module and Dynamic Convolution Attention (DCA). Besides, we also introduce the attention guided mechanism to our model for accelerating the attention alignment in the decoder module. The model we proposed has been proved that can synthesis speech with better quality and cost less time in terms of training stage and synthesis stage than the baseline model with these technologies.\",\"PeriodicalId\":201434,\"journal\":{\"name\":\"Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3548608.3559304\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3548608.3559304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CRCTTS: Convolution-Recurrent-Convolution Text-to-Speech System
End-to-end speech synthesis technology has already replaced the positions of Statistical Parametric Speech Synthesis (SPSS) in text-to-speech (TTS) field. The end-to-end model based on neural network, does not require a lot of domain knowledge but synthesize more natural speeches. Tacotron is the first model that can synthesize speeches which even human is hard to distinguish. We propose a new end-to-end speech synthesis system which is called Convolution-Recurrent-Convolution Text-to-Speech (CRCTTS). We chose Tacotron as our baseline model and adjust the architecture through fully Convolution Neural Network (CNN) module and Dynamic Convolution Attention (DCA). Besides, we also introduce the attention guided mechanism to our model for accelerating the attention alignment in the decoder module. The model we proposed has been proved that can synthesis speech with better quality and cost less time in terms of training stage and synthesis stage than the baseline model with these technologies.