{"title":"Enhancing Prosodic Features by Adopting Pre-trained Language Model in Bahasa Indonesia Speech Synthesis","authors":"Lixuan Zhao, Jian Yang, Qinglai Qin","doi":"10.1145/3446132.3446196","DOIUrl":null,"url":null,"abstract":"Deep neural network text-to-speech (TTS) systems can produce high-quality audio. However, modern TTS systems usually need a sizable of studio-quality pairs as input. In view of the insufficient research on Bahasa Indonesia, available data are usually worse in term of both quality and size. The End-to-End(E2E) TTS systems trained on those corpora are difficult to generate satisfactory speech, especially the prosodic features are not obvious. Therefore, we propose a method to enhance the prosodic features of synthesized speech based on GST-Tacotron2 model, and pre-trained language model with the BERT (Bidirectional Encoder Representation from Transformers) model. The BERT learned from large number of unlabeled text data contains rich linguistic information, which can help TTS systems produce the more obvious prosodic features. The subjective evaluation of our experimental results shows that the proposed method can indeed enhance the rhythm of synthesized speech.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3446132.3446196","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Deep neural network text-to-speech (TTS) systems can produce high-quality audio. However, modern TTS systems usually need a sizable of studio-quality pairs as input. In view of the insufficient research on Bahasa Indonesia, available data are usually worse in term of both quality and size. The End-to-End(E2E) TTS systems trained on those corpora are difficult to generate satisfactory speech, especially the prosodic features are not obvious. Therefore, we propose a method to enhance the prosodic features of synthesized speech based on GST-Tacotron2 model, and pre-trained language model with the BERT (Bidirectional Encoder Representation from Transformers) model. The BERT learned from large number of unlabeled text data contains rich linguistic information, which can help TTS systems produce the more obvious prosodic features. The subjective evaluation of our experimental results shows that the proposed method can indeed enhance the rhythm of synthesized speech.