Enhancing Prosodic Features by Adopting Pre-trained Language Model in Bahasa Indonesia Speech Synthesis

Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence Pub Date : 2020-12-24 DOI:10.1145/3446132.3446196

Lixuan Zhao, Jian Yang, Qinglai Qin

{"title":"Enhancing Prosodic Features by Adopting Pre-trained Language Model in Bahasa Indonesia Speech Synthesis","authors":"Lixuan Zhao, Jian Yang, Qinglai Qin","doi":"10.1145/3446132.3446196","DOIUrl":null,"url":null,"abstract":"Deep neural network text-to-speech (TTS) systems can produce high-quality audio. However, modern TTS systems usually need a sizable of studio-quality pairs as input. In view of the insufficient research on Bahasa Indonesia, available data are usually worse in term of both quality and size. The End-to-End(E2E) TTS systems trained on those corpora are difficult to generate satisfactory speech, especially the prosodic features are not obvious. Therefore, we propose a method to enhance the prosodic features of synthesized speech based on GST-Tacotron2 model, and pre-trained language model with the BERT (Bidirectional Encoder Representation from Transformers) model. The BERT learned from large number of unlabeled text data contains rich linguistic information, which can help TTS systems produce the more obvious prosodic features. The subjective evaluation of our experimental results shows that the proposed method can indeed enhance the rhythm of synthesized speech.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3446132.3446196","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Deep neural network text-to-speech (TTS) systems can produce high-quality audio. However, modern TTS systems usually need a sizable of studio-quality pairs as input. In view of the insufficient research on Bahasa Indonesia, available data are usually worse in term of both quality and size. The End-to-End(E2E) TTS systems trained on those corpora are difficult to generate satisfactory speech, especially the prosodic features are not obvious. Therefore, we propose a method to enhance the prosodic features of synthesized speech based on GST-Tacotron2 model, and pre-trained language model with the BERT (Bidirectional Encoder Representation from Transformers) model. The BERT learned from large number of unlabeled text data contains rich linguistic information, which can help TTS systems produce the more obvious prosodic features. The subjective evaluation of our experimental results shows that the proposed method can indeed enhance the rhythm of synthesized speech.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

采用预训练语言模型增强印尼语语音合成中的韵律特征

深度神经网络文本到语音(TTS)系统可以产生高质量的音频。然而，现代TTS系统通常需要相当数量的工作室质量对作为输入。鉴于对印尼语的研究不足，现有的数据在质量和规模上通常都较差。在这些语料库上训练的端到端TTS系统很难产生令人满意的语音，尤其是韵律特征不明显。因此，我们提出了一种基于GST-Tacotron2模型和BERT (Bidirectional Encoder Representation from Transformers)模型的预训练语言模型来增强合成语音的韵律特征的方法。BERT从大量未标注的文本数据中学习到丰富的语言信息，可以帮助TTS系统产生更明显的韵律特征。对实验结果的主观评价表明，所提出的方法确实可以提高合成语音的节奏。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

自引率

0.00%

发文量