{"title":"The HITSZ TTS system for Blizzard challenge 2020","authors":"Huhao Fu, Yiben Zhang, Kai Liu, Chao Liu","doi":"10.21437/vcc_bc.2020-11","DOIUrl":null,"url":null,"abstract":"In this paper, we present the techniques that were used in HITSZ-TTS 1 entry in Blizzard Challenge 2020. The corpus released to the participants this year is about 10-hours speech recordings from a Chinese male speaker with mixed Mandarin and English speech. Based on the above situation, we build an end to end speech synthesis system for this task. It is divided into the following parts: (1) the front-end module to analyze the pronunciation and prosody of text; (2) The phoneme-converted tool; (3) The forward-attention based sequence-to-sequence acoustic model with jointly learning with prosody labels to predict 80-dimensional Mel-spectrogram; (4) The Parallel WaveGAN based neural vocoder to reconstruct waveforms. This is the first time for us to join the Blizzard Challenge, and the identifier for our system is G. The evaluation results of subjective listening tests show that the proposed system achieves unsatisfactory performance. The problems in the system are also discussed in this paper.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"125 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/vcc_bc.2020-11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we present the techniques that were used in HITSZ-TTS 1 entry in Blizzard Challenge 2020. The corpus released to the participants this year is about 10-hours speech recordings from a Chinese male speaker with mixed Mandarin and English speech. Based on the above situation, we build an end to end speech synthesis system for this task. It is divided into the following parts: (1) the front-end module to analyze the pronunciation and prosody of text; (2) The phoneme-converted tool; (3) The forward-attention based sequence-to-sequence acoustic model with jointly learning with prosody labels to predict 80-dimensional Mel-spectrogram; (4) The Parallel WaveGAN based neural vocoder to reconstruct waveforms. This is the first time for us to join the Blizzard Challenge, and the identifier for our system is G. The evaluation results of subjective listening tests show that the proposed system achieves unsatisfactory performance. The problems in the system are also discussed in this paper.