{"title":"Self-attention Based Prosodic Boundary Prediction for Chinese Speech Synthesis","authors":"Chunhui Lu, Pengyuan Zhang, Yonghong Yan","doi":"10.1109/ICASSP.2019.8682770","DOIUrl":null,"url":null,"abstract":"Predicting prosodic boundaries from input text plays an important role in Chinese text-to-speech (TTS) system, which directly influences the naturalness and intelligibility of synthesized speech. In this paper, we propose to combine self-attention with multitask learning for prosodic boundary prediction. Self-attention is used to capture the dependency between two arbitrary characters in the input sentence, while multitask learning models the relationships between prosodic boundaries and lexicon words by setting word segmentation as an auxiliary task. The proposed method can generate prosodic boundary labels directly from Chinese characters and achieve the whole process end-to-end. Experimental results show the effectiveness of our proposed model and prove that the performance can be further improved by pretraining the model with extra word segmentation data.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"39 1","pages":"7035-7039"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2019.8682770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24
Abstract
Predicting prosodic boundaries from input text plays an important role in Chinese text-to-speech (TTS) system, which directly influences the naturalness and intelligibility of synthesized speech. In this paper, we propose to combine self-attention with multitask learning for prosodic boundary prediction. Self-attention is used to capture the dependency between two arbitrary characters in the input sentence, while multitask learning models the relationships between prosodic boundaries and lexicon words by setting word segmentation as an auxiliary task. The proposed method can generate prosodic boundary labels directly from Chinese characters and achieve the whole process end-to-end. Experimental results show the effectiveness of our proposed model and prove that the performance can be further improved by pretraining the model with extra word segmentation data.