{"title":"Speech Synthesis Method Based on Tacotron2","authors":"Yang Li, Donghong Qin, Jinbo Zhang","doi":"10.1109/ICACI52617.2021.9435882","DOIUrl":null,"url":null,"abstract":"Compared with traditional speech synthesis systems, end-to-end speech synthesis systems based on deep learning (such as DeepVoice3, Tacotron2) not only reduce the requirements for linguistic knowledge, but the synthesis effect is almost close to the level of human pronunciation. However, the end-to-end speech synthesis system based on deep learning has disadvantages such as missing words, repeated pronunciation, and slow synthesis speed. In view of the local information preference of the Tacotron2 model in the decoder, this paper proposes to maximize the interactive information between the text and the predicted acoustic features and use the WaveGlow synthesizer to reduce the local information preference and the problem of slow synthesis speed, pronunciation in the Tacotron2 model. Experimental results show that the improved model subjective evaluation MOS (Mean Opinion Score) score is 3.94, and the synthesis speed is significantly improved.","PeriodicalId":382483,"journal":{"name":"2021 13th International Conference on Advanced Computational Intelligence (ICACI)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Advanced Computational Intelligence (ICACI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACI52617.2021.9435882","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Compared with traditional speech synthesis systems, end-to-end speech synthesis systems based on deep learning (such as DeepVoice3, Tacotron2) not only reduce the requirements for linguistic knowledge, but the synthesis effect is almost close to the level of human pronunciation. However, the end-to-end speech synthesis system based on deep learning has disadvantages such as missing words, repeated pronunciation, and slow synthesis speed. In view of the local information preference of the Tacotron2 model in the decoder, this paper proposes to maximize the interactive information between the text and the predicted acoustic features and use the WaveGlow synthesizer to reduce the local information preference and the problem of slow synthesis speed, pronunciation in the Tacotron2 model. Experimental results show that the improved model subjective evaluation MOS (Mean Opinion Score) score is 3.94, and the synthesis speed is significantly improved.