This paper discusses non-filter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-filter model in text-to-speech (TTS) systems. As the primary purpose of the use of filters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the source-filter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed “ phase reconstruction ” from the power spectrogram. Given cepstral features and fundamental frequency ( F 0 ) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F 0 . The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive filters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) filter. Results show the proposed method performed better than the MLSA filter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.
{"title":"Non-filter waveform generation from cepstrum using spectral phase reconstruction","authors":"Yasuhiro Hamada, Nobutaka Ono, S. Sagayama","doi":"10.21437/SSW.2016-5","DOIUrl":"https://doi.org/10.21437/SSW.2016-5","url":null,"abstract":"This paper discusses non-filter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-filter model in text-to-speech (TTS) systems. As the primary purpose of the use of filters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the source-filter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed “ phase reconstruction ” from the power spectrogram. Given cepstral features and fundamental frequency ( F 0 ) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F 0 . The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive filters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) filter. Results show the proposed method performed better than the MLSA filter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122312985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An artificial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an emotional voice, because the time sequence of F0 for an emotional voice changes drastically. There-fore, in this paper, we propose an effective method that uses the continuous wavelet transform (CWT) to decompose F0 into different temporal scales that can be well trained by NNs for prosody modeling in emotional voice conversion. Meanwhile, the proposed method uses deep belief networks (DBNs) to pre-train the NNs that convert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the prosody for an emotional voice at the same time, and was able to outperform other state-of-the-art methods for emotional voice conversion.
{"title":"Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform","authors":"Zhaojie Luo, T. Takiguchi, Y. Ariki","doi":"10.21437/SSW.2016-23","DOIUrl":"https://doi.org/10.21437/SSW.2016-23","url":null,"abstract":"An artificial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an emotional voice, because the time sequence of F0 for an emotional voice changes drastically. There-fore, in this paper, we propose an effective method that uses the continuous wavelet transform (CWT) to decompose F0 into different temporal scales that can be well trained by NNs for prosody modeling in emotional voice conversion. Meanwhile, the proposed method uses deep belief networks (DBNs) to pre-train the NNs that convert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the prosody for an emotional voice at the same time, and was able to outperform other state-of-the-art methods for emotional voice conversion.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127658025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and time-consuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.
{"title":"On the impact of phoneme alignment in DNN-based speech synthesis","authors":"Mei Li, Zhizheng Wu, Lei Xie","doi":"10.21437/SSW.2016-32","DOIUrl":"https://doi.org/10.21437/SSW.2016-32","url":null,"abstract":"Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and time-consuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"306 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114818922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, J. Yamagishi
Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers.
{"title":"Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech","authors":"Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, J. Yamagishi","doi":"10.21437/SSW.2016-24","DOIUrl":"https://doi.org/10.21437/SSW.2016-24","url":null,"abstract":"Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- \u0000gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130688252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A new filter design strategy to shorten the length of the filter is introduced for sinusoidal speech synthesis using cosine-modulated filter banks. Multiple sinusoidal waveforms for speech synthesis can be effectively synthesized by using pseudo-quadrature mirror filter (pseudo-QMF) banks, which are constructed by cosine modulation of the coefficients of a low-pass filter. This is because stable sinusoids are represented as sparse vectors on the subband domain of the pseudo-QMF banks and computation for the filter banks can be effectively performed with fast algorithms for discrete cosine transformation (DCT). However, the pseudo-QMF banks require relatively long filters to reduce noise caused by aliasing. In this study, a wider passband design with a perfect reconstruction (PR) QMF bank is introduced. The properties of experimentally designed filters indicated that the length of the filters can be reduced from 448 taps to 384 taps for 32-subband systems with less than -96dB errors where the computational cost for speech synthesis does not significantly increase.
{"title":"Wide Passband Design for Cosine-Modulated Filter Banks in Sinusoidal Speech Synthesis","authors":"Nobuyuki Nishizawa, Tomonori Yazaki","doi":"10.21437/SSW.2016-29","DOIUrl":"https://doi.org/10.21437/SSW.2016-29","url":null,"abstract":"A new filter design strategy to shorten the length of the filter is introduced for sinusoidal speech synthesis using cosine-modulated filter banks. Multiple sinusoidal waveforms for speech synthesis can be effectively synthesized by using pseudo-quadrature mirror filter (pseudo-QMF) banks, which are constructed by cosine modulation of the coefficients of a low-pass filter. This is because stable sinusoids are represented as sparse vectors on the subband domain of the pseudo-QMF banks and computation for the filter banks can be effectively performed with fast algorithms for discrete cosine transformation (DCT). However, the pseudo-QMF banks require relatively long filters to reduce noise caused by aliasing. In this study, a wider passband design with a perfect reconstruction (PR) QMF bank is introduced. The properties of experimentally designed filters indicated that the length of the filters can be reduced from 448 taps to 384 taps for 32-subband systems with less than -96dB errors where the computational cost for speech synthesis does not significantly increase.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132685915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal in this paper is to investigate data selection techniques for found speech. Found speech unlike clean, phonetically-balanced datasets recorded specifically for synthesis contain a lot of noise which might not get labeled well and it might contain utterances with varying channel conditions. These channel variations and other noise distortions might sometimes be useful in terms of adding diverse data to our training set, however in other cases it might be detrimental to the system. The ap-proach outlined in this work investigates various metrics to detect noisy data which degrade the performance of the system on a held-out test set. We assume a seed set of 100 utterances to which we then incrementally add in a fixed set of utterances and find which metrics can capture the misaligned and noisy data. We report results on three datasets, an artificially degraded set of clean speech, a single speaker database of found speech and a multi - speaker database of found speech. All of our experiments are carried out on male speakers. We also show compa-rable results are obtained on a female multi-speaker corpus.
{"title":"Utterance Selection Techniques for TTS Systems Using Found Speech","authors":"P. Baljekar, A. Black","doi":"10.21437/SSW.2016-30","DOIUrl":"https://doi.org/10.21437/SSW.2016-30","url":null,"abstract":"The goal in this paper is to investigate data selection techniques for found speech. Found speech unlike clean, phonetically-balanced datasets recorded specifically for synthesis contain a lot of noise which might not get labeled well and it might contain utterances with varying channel conditions. These channel variations and other noise distortions might sometimes be useful in terms of adding diverse data to our training set, however in other cases it might be detrimental to the system. The ap-proach outlined in this work investigates various metrics to detect noisy data which degrade the performance of the system on a held-out test set. We assume a seed set of 100 utterances to which we then incrementally add in a fixed set of utterances and find which metrics can capture the misaligned and noisy data. We report results on three datasets, an artificially degraded set of clean speech, a single speaker database of found speech and a multi - speaker database of found speech. All of our experiments are carried out on male speakers. We also show compa-rable results are obtained on a female multi-speaker corpus.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123826261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistical speech synthesis systems rely on a parametric speech generation model, typically some sort of vocoder. Vocoders are great for voiced speech because they offer independent control over voice source (e.g. pitch) and vocal tract filter (e.g. vowel quality) through control parameters that typically vary smoothly in time and lend themselves well to statistical modelling. Voiceless sounds and transients such as plosives and fricatives on the other hand exhibit fundamentally different spectro-temporal behaviour. Here the benefits of the vocoder are not as clear. In this paper, we investigate a hybrid approach to modeling the speech signal, where speech is decomposed into an harmonic part and a noise burst part through spectrogram kernel filtering. The harmonic part is modeled using vocoder and statistical parameter generation, while the burst part is modeled by concatenation. The two channels are then mixed together to form the final synthesized waveform. The proposed method was compared against a state of the art statistical speech synthesis system (HTS 2.3) in a perceptual evaluation, which reveled that the harmonics plus bursts method was perceived as significantly more natural than the purely statistical variant.
{"title":"A hybrid harmonics-and-bursts modelling approach to speech synthesis","authors":"J. Beskow, Harald Berthelsen","doi":"10.21437/SSW.2016-34","DOIUrl":"https://doi.org/10.21437/SSW.2016-34","url":null,"abstract":"Statistical speech synthesis systems rely on a parametric speech generation model, typically some sort of vocoder. Vocoders are great for voiced speech because they offer independent control over voice source (e.g. pitch) and vocal tract filter (e.g. vowel quality) through control parameters that typically vary smoothly in time and lend themselves well to statistical modelling. Voiceless sounds and transients such as plosives and fricatives on the other hand exhibit fundamentally different spectro-temporal behaviour. Here the benefits of the vocoder are not as clear. In this paper, we investigate a hybrid approach to modeling the speech signal, where speech is decomposed into an harmonic part and a noise burst part through spectrogram kernel filtering. The harmonic part is modeled using vocoder and statistical parameter generation, while the burst part is modeled by concatenation. The two channels are then mixed together to form the final synthesized waveform. The proposed method was compared against a state of the art statistical speech synthesis system (HTS 2.3) in a perceptual evaluation, which reveled that the harmonics plus bursts method was perceived as significantly more natural than the purely statistical variant.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131510346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Tokuda, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku
This paper proposes a novel neural network structure for speech synthesis, in which spectrum, F0 and duration parameters are simultaneously modeled in a unified framework. In the conventional neural network approaches, spectrum and F0 parameters are predicted by neural networks while phone and/or state durations are given from other external duration predictors. In order to consistently model not only spectrum and F0 parameters but also durations, we adopt a special type of mixture density network (MDN) structure, which models utterance level probability density functions conditioned on the corresponding input feature sequence. This is achieved by modeling the conditional probability distribution of utterance level output features, given input features, with a hidden semi-Markov model, where its parameters are generated using a neural network trained with a log likelihood-based loss function. Variations of the proposed neural network structure are also discussed. Subjective listening test results show that the proposed approach improves the naturalness of synthesized speech.
{"title":"Temporal modeling in neural network based statistical parametric speech synthesis","authors":"K. Tokuda, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku","doi":"10.21437/SSW.2016-18","DOIUrl":"https://doi.org/10.21437/SSW.2016-18","url":null,"abstract":"This paper proposes a novel neural network structure for speech synthesis, in which spectrum, F0 and duration parameters are simultaneously modeled in a unified framework. In the conventional neural network approaches, spectrum and F0 parameters are predicted by neural networks while phone and/or state durations are given from other external duration predictors. In order to consistently model not only spectrum and F0 parameters but also durations, we adopt a special type of mixture density network (MDN) structure, which models utterance level probability density functions conditioned on the corresponding input feature sequence. This is achieved by modeling the conditional probability distribution of utterance level output features, given input features, with a hidden semi-Markov model, where its parameters are generated using a neural network trained with a log likelihood-based loss function. Variations of the proposed neural network structure are also discussed. Subjective listening test results show that the proposed approach improves the naturalness of synthesized speech.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130564579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigates how listeners judge the similarity of voice converted voices using a talker discrimination task. The data used is from the Voice Conversion Challenge 2016. 17 participants from around the world took part in building voice converted voices from a shared data set of source and target speakers. This paper describes the evaluation of similarity for four of the source-target pairs (two intra-gender and two cross-gender) in more detail. Multidimensional scaling was performed to illustrate where each system was perceived to be in an acoustic space compared to the source and target speakers and to each other.
{"title":"Multidimensional scaling of systems in the Voice Conversion Challenge 2016","authors":"M. Wester, Zhizheng Wu, J. Yamagishi","doi":"10.21437/SSW.2016-7","DOIUrl":"https://doi.org/10.21437/SSW.2016-7","url":null,"abstract":"This study investigates how listeners judge the similarity of voice converted voices using a talker discrimination task. The data used is from the Voice Conversion Challenge 2016. 17 participants from around the world took part in building voice converted voices from a shared data set of source and target speakers. This paper describes the evaluation of similarity for four of the source-target pairs (two intra-gender and two cross-gender) in more detail. Multidimensional scaling was performed to illustrate where each system was perceived to be in an acoustic space compared to the source and target speakers and to each other.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131238331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are interested in emphasis for text to speech synthesis. In speech to speech translation, emphasising the correct words is important to convey the underlying meaning of a message. In this paper, we propose to use a generalised command-response (CR) model of intonation to generate emphasis in synthetic speech. We first analyse the differences in the model parameters between emphasised words in an acted emphasis scenario and their neutral counterpart. We investigate word level intonation modelling using simple random forest as a basis framework, to predict the parameters of the model in the specific case of emphasised word. Based on the linguistic context of the words we want to emphasise, we attempt at recovering emphasis pattern in the intonation in originally neutral synthetic speech by gen-erating word-level model parameters with similar context. The method is presented and initial results are given, on synthetic speech.
{"title":"Emphasis recreation for TTS using intonation atoms","authors":"Pierre-Edouard Honnet, Philip N. Garner","doi":"10.21437/SSW.2016-3","DOIUrl":"https://doi.org/10.21437/SSW.2016-3","url":null,"abstract":"We are interested in emphasis for text to speech synthesis. In speech to speech translation, emphasising the correct words is important to convey the underlying meaning of a message. In this paper, we propose to use a generalised command-response (CR) model of intonation to generate emphasis in synthetic speech. We first analyse the differences in the model parameters between emphasised words in an acted emphasis scenario and their neutral counterpart. We investigate word level intonation modelling using simple random forest as a basis framework, to predict the parameters of the model in the specific case of emphasised word. Based on the linguistic context of the words we want to emphasise, we attempt at recovering emphasis pattern in the intonation in originally neutral synthetic speech by gen-erating word-level model parameters with similar context. The method is presented and initial results are given, on synthetic speech.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115632274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}