Yi Zhou, Xiaohai Tian, Xuehao Zhou, Mingyang Zhang, Grandee Lee, Rui Liu, Berrak, Sisman, Haizhou Li
{"title":"NUS-HLT System for Blizzard Challenge 2020","authors":"Yi Zhou, Xiaohai Tian, Xuehao Zhou, Mingyang Zhang, Grandee Lee, Rui Liu, Berrak, Sisman, Haizhou Li","doi":"10.21437/vcc_bc.2020-7","DOIUrl":null,"url":null,"abstract":"The paper presents the NUS-HLT text-to-speech (TTS) system for the Blizzard Challenge 2020. The challenge has two tasks: Hub task 2020-MH1 to synthesize Mandarin Chinese given 9.5 hours of speech data from a male native speaker of Mandarin; Spoke task 2020-SS1 to synthesize Shanghainese given 3 hours of speech data from a female native speaker of Shanghainese. Our submitted system combines the word embedding, which is extracted from a pre-trained language model, with the E2E TTS synthesizer to generate acoustic features from text input. WaveRNN neural vocoder and WaveNet neural vocoder are utilized to generate speech waveforms from acoustic features in MH1 and SS1 tasks, respectively. Evaluation results provided by the challenge organizers demonstrate the effectiveness of our submitted TTS system.","PeriodicalId":355114,"journal":{"name":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/vcc_bc.2020-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The paper presents the NUS-HLT text-to-speech (TTS) system for the Blizzard Challenge 2020. The challenge has two tasks: Hub task 2020-MH1 to synthesize Mandarin Chinese given 9.5 hours of speech data from a male native speaker of Mandarin; Spoke task 2020-SS1 to synthesize Shanghainese given 3 hours of speech data from a female native speaker of Shanghainese. Our submitted system combines the word embedding, which is extracted from a pre-trained language model, with the E2E TTS synthesizer to generate acoustic features from text input. WaveRNN neural vocoder and WaveNet neural vocoder are utilized to generate speech waveforms from acoustic features in MH1 and SS1 tasks, respectively. Evaluation results provided by the challenge organizers demonstrate the effectiveness of our submitted TTS system.