{"title":"The Ximalaya TTS System for Blizzard Challenge 2020","authors":"Wendi He, Zhiba Su, Yang Sun","doi":"10.21437/vcc_bc.2020-10","DOIUrl":null,"url":null,"abstract":"This paper describes the proposed Himalaya text-to-speech synthesis system built for the Blizzard Challenge 2020. The two tasks are to build expressive speech synthesizers based on the released 9.5-hour Mandarin corpus from a male native speaker and 3-hour Shanghainese corpus from a female native speaker respectively. Our architecture is Tacotron2-based acoustic model with WaveRNN vocoder. Several methods for preprocessing and checking the raw BC transcript are imple-mented. Firstly, the multi-task TTS front-end module trans-forms the text sequences into phoneme-level sequences with prosody label after implement the polyphonic disambiguation and prosody prediction module. Then, we train the released corpus on a Seq2seq multi-speaker acoustic model for Mel spec-trograms modeling. Besides, the neural vocoder WaveRNN[1] with minor improvements generate high-quality audio for the submitted results. The identifier for our system is M, and the experimental evaluation results in listening tests show that the system we submitted performed well in most of the criterion.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/vcc_bc.2020-10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper describes the proposed Himalaya text-to-speech synthesis system built for the Blizzard Challenge 2020. The two tasks are to build expressive speech synthesizers based on the released 9.5-hour Mandarin corpus from a male native speaker and 3-hour Shanghainese corpus from a female native speaker respectively. Our architecture is Tacotron2-based acoustic model with WaveRNN vocoder. Several methods for preprocessing and checking the raw BC transcript are imple-mented. Firstly, the multi-task TTS front-end module trans-forms the text sequences into phoneme-level sequences with prosody label after implement the polyphonic disambiguation and prosody prediction module. Then, we train the released corpus on a Seq2seq multi-speaker acoustic model for Mel spec-trograms modeling. Besides, the neural vocoder WaveRNN[1] with minor improvements generate high-quality audio for the submitted results. The identifier for our system is M, and the experimental evaluation results in listening tests show that the system we submitted performed well in most of the criterion.