{"title":"Enriching Style Transfer in multi-scale control based personalized end-to-end speech synthesis","authors":"Zhongcai Lyu, Jie Zhu","doi":"10.1109/ICIST55546.2022.9926908","DOIUrl":null,"url":null,"abstract":"Personalized speech synthesis aims to transfer speech style with a few speech samples from the target speaker. However, pretrain and fine-tuning techniques are required to overcome the problem of poor performance for similarity and prosody in a data-limited condition. In this paper, a zero-shot style transfer framework based on multi-scale control is presented to handle the above problems. Firstly, speaker embedding is extracted from a single reference speech audio by a specially designed reference encoder, with which Speaker-Adaptive Linear Modulation (SALM) could generate the scale and bias vector to influence the encoder output, and consequently greatly enhance the adaptability to unseen speakers. Secondly, we propose a prosody module that includes a prosody extractor and prosody predictor, which can efficiently predict the prosody of the generated speech from the reference audio and text information and achieve phoneme-level prosody control, thus increasing the diversity of the synthesized speech. Using both objective and subjective metrics for evaluation, the experiments demonstrate that our model is capable of synthesizing speech of high naturalness and similarity of speech, with only a few or even a single piece of data from the target speaker.","PeriodicalId":211213,"journal":{"name":"2022 12th International Conference on Information Science and Technology (ICIST)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 12th International Conference on Information Science and Technology (ICIST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIST55546.2022.9926908","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Personalized speech synthesis aims to transfer speech style with a few speech samples from the target speaker. However, pretrain and fine-tuning techniques are required to overcome the problem of poor performance for similarity and prosody in a data-limited condition. In this paper, a zero-shot style transfer framework based on multi-scale control is presented to handle the above problems. Firstly, speaker embedding is extracted from a single reference speech audio by a specially designed reference encoder, with which Speaker-Adaptive Linear Modulation (SALM) could generate the scale and bias vector to influence the encoder output, and consequently greatly enhance the adaptability to unseen speakers. Secondly, we propose a prosody module that includes a prosody extractor and prosody predictor, which can efficiently predict the prosody of the generated speech from the reference audio and text information and achieve phoneme-level prosody control, thus increasing the diversity of the synthesized speech. Using both objective and subjective metrics for evaluation, the experiments demonstrate that our model is capable of synthesizing speech of high naturalness and similarity of speech, with only a few or even a single piece of data from the target speaker.