首页 > 最新文献

Speech Synthesis Workshop最新文献

英文 中文
Non-filter waveform generation from cepstrum using spectral phase reconstruction 利用频谱相位重建从倒谱产生非滤波波形
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-5
Yasuhiro Hamada, Nobutaka Ono, S. Sagayama
This paper discusses non-filter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-filter model in text-to-speech (TTS) systems. As the primary purpose of the use of filters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the source-filter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed “ phase reconstruction ” from the power spectrogram. Given cepstral features and fundamental frequency ( F 0 ) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F 0 . The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive filters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) filter. Results show the proposed method performed better than the MLSA filter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.
本文讨论了在文本到语音(TTS)系统中,使用频谱相位重建作为替代传统源-滤波器模型的一种替代方法,从倒谱特征生成非滤波器波形。由于使用滤波器的主要目的被认为是从期望的频谱形状产生波形,源滤波器框架的一种可能的替代方案是通过利用最近开发的功率谱图的“相位重建”直接将设计的频谱转换为波形。将倒谱特征和基频(f0)作为TTS系统的期望频谱,通过将倒谱特征转换为线性尺度功率谱并乘以f0的音高结构来计算听众要听到的频谱。通过谱相位重构,从功率谱中生成信号波形。该方法的一个优点是不受递归滤波器中尖锐共振引起的不期望的振幅和长时间衰减的影响。在初步实验中,我们使用该方法和mel-log谱近似(MLSA)滤波器比较了合成语音的时间和增益特性。结果表明,该方法在合成语音的两个特征上都优于MLSA滤波器,表明该方法具有理想的语音合成性能。
{"title":"Non-filter waveform generation from cepstrum using spectral phase reconstruction","authors":"Yasuhiro Hamada, Nobutaka Ono, S. Sagayama","doi":"10.21437/SSW.2016-5","DOIUrl":"https://doi.org/10.21437/SSW.2016-5","url":null,"abstract":"This paper discusses non-filter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-filter model in text-to-speech (TTS) systems. As the primary purpose of the use of filters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the source-filter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed “ phase reconstruction ” from the power spectrogram. Given cepstral features and fundamental frequency ( F 0 ) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F 0 . The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive filters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) filter. Results show the proposed method performed better than the MLSA filter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122312985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform 基于小波变换的不同时间尺度F0神经网络情绪语音转换
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-23
Zhaojie Luo, T. Takiguchi, Y. Ariki
An artificial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an emotional voice, because the time sequence of F0 for an emotional voice changes drastically. There-fore, in this paper, we propose an effective method that uses the continuous wavelet transform (CWT) to decompose F0 into different temporal scales that can be well trained by NNs for prosody modeling in emotional voice conversion. Meanwhile, the proposed method uses deep belief networks (DBNs) to pre-train the NNs that convert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the prosody for an emotional voice at the same time, and was able to outperform other state-of-the-art methods for emotional voice conversion.
人工神经网络是训练语音转换任务特征的重要模型之一。通常,神经网络(NNs)在处理非线性特征方面非常有效,例如表示频谱特征的梅尔倒谱系数(MCC)。然而,对于神经网络来说,一个简单的基频(F0)表示是不足以处理情感语音的,因为情感语音的F0时间序列变化很大。因此,本文提出了一种有效的方法,即利用连续小波变换(CWT)将F0分解为不同的时间尺度,这些时间尺度可以被神经网络很好地训练,用于情绪语音转换中的韵律建模。同时,该方法利用深度信念网络(dbn)对转换频谱特征的神经网络进行预训练。利用这些方法,所提出的方法可以同时改变情感语音的频谱和韵律,并且能够优于其他最先进的情感语音转换方法。
{"title":"Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform","authors":"Zhaojie Luo, T. Takiguchi, Y. Ariki","doi":"10.21437/SSW.2016-23","DOIUrl":"https://doi.org/10.21437/SSW.2016-23","url":null,"abstract":"An artificial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an emotional voice, because the time sequence of F0 for an emotional voice changes drastically. There-fore, in this paper, we propose an effective method that uses the continuous wavelet transform (CWT) to decompose F0 into different temporal scales that can be well trained by NNs for prosody modeling in emotional voice conversion. Meanwhile, the proposed method uses deep belief networks (DBNs) to pre-train the NNs that convert spectral features. By utilizing these approaches, the proposed method can change the spectrum and the prosody for an emotional voice at the same time, and was able to outperform other state-of-the-art methods for emotional voice conversion.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127658025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
On the impact of phoneme alignment in DNN-based speech synthesis 基于dnn的语音合成中音位对齐的影响
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-32
Mei Li, Zhizheng Wu, Lei Xie
Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and time-consuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.
近年来,深度神经网络(dnn)显著提高了统计参数语音合成(SPSS)中声学建模的性能。然而,在目前的实现中,在训练基于dnn的语音合成系统时,需要将语音转录本与相应的语音帧对齐以获得语音分割,称为音素对齐。这种对齐通常是通过基于隐马尔可夫模型(hmm)的强制对齐来实现的,因为手动对齐是劳动密集型和耗时的。在这项工作中,我们研究了音素对齐对基于dnn的语音合成系统的影响。具体来说,我们比较了不同的基于dnn的语音合成系统的性能,这些系统使用手动对齐和基于HMM的强制对齐三种类型的标签:HMM单电话,三电话和全文。对合成语音的自然度进行客观和主观评价,比较不同对齐方式的性能。
{"title":"On the impact of phoneme alignment in DNN-based speech synthesis","authors":"Mei Li, Zhizheng Wu, Lei Xie","doi":"10.21437/SSW.2016-32","DOIUrl":"https://doi.org/10.21437/SSW.2016-32","url":null,"abstract":"Recently, deep neural networks (DNNs) have significantly improved the performance of acoustic modeling in statistical parametric speech synthesis (SPSS). However, in current implementations, when training a DNN-based speech synthesis system, phonetic transcripts are required to be aligned with the corresponding speech frames to obtain the phonetic segmentation, called phoneme alignment. Such an alignment is usually obtained by forced alignment based on hidden Markov models (HMMs) since manual alignment is labor-intensive and time-consuming. In this work, we study the impact of phoneme alignment on the DNN-based speech synthesis system. Specifically, we compare the performances of different DNN-based speech synthesis systems, which use manual alignment and HMM-based forced alignment from three types of labels: HMM mono-phone, tri-phone and full-context. Objective and subjective evaluations are conducted in term of the naturalness of synthesized speech to compare the performances of different alignments.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"306 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114818922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech 研究基于rnn的抗噪声文本到语音的语音增强方法
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-24
Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, J. Yamagishi
Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers.
深度学习已成功应用于语音处理。在本文中,我们提出了一种使用多个扬声器的语音合成架构。一些隐藏层由所有扬声器共享,而每个扬声器都有一个特定的输出层。客观实验和感知实验证明,与单说话人模型相比,该方案具有更好的效果。此外,我们还通过在多输出分支上添加新的输出层(a层)来解决扬声器插值问题。将识别代码与多个扬声器的声学特征一起注入该层。实验表明,a层可以有效地学习插值说话人之间的声学特征。
{"title":"Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech","authors":"Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, J. Yamagishi","doi":"10.21437/SSW.2016-24","DOIUrl":"https://doi.org/10.21437/SSW.2016-24","url":null,"abstract":"Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- \u0000gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130688252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 277
Wide Passband Design for Cosine-Modulated Filter Banks in Sinusoidal Speech Synthesis 正弦语音合成中余弦调制滤波器组的宽频带设计
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-29
Nobuyuki Nishizawa, Tomonori Yazaki
A new filter design strategy to shorten the length of the filter is introduced for sinusoidal speech synthesis using cosine-modulated filter banks. Multiple sinusoidal waveforms for speech synthesis can be effectively synthesized by using pseudo-quadrature mirror filter (pseudo-QMF) banks, which are constructed by cosine modulation of the coefficients of a low-pass filter. This is because stable sinusoids are represented as sparse vectors on the subband domain of the pseudo-QMF banks and computation for the filter banks can be effectively performed with fast algorithms for discrete cosine transformation (DCT). However, the pseudo-QMF banks require relatively long filters to reduce noise caused by aliasing. In this study, a wider passband design with a perfect reconstruction (PR) QMF bank is introduced. The properties of experimentally designed filters indicated that the length of the filters can be reduced from 448 taps to 384 taps for 32-subband systems with less than -96dB errors where the computational cost for speech synthesis does not significantly increase.
介绍了一种新的滤波器设计策略,缩短了余弦调制滤波器组用于正弦语音合成的滤波器长度。利用低通滤波器系数的余弦调制构造的伪正交镜像滤波器组(pseudo-quadrature mirror filter, pseudo-QMF)可以有效地合成用于语音合成的多个正弦波形。这是因为稳定的正弦波被表示为伪qmf组的子带域上的稀疏向量,并且滤波器组的计算可以通过快速的离散余弦变换(DCT)算法有效地执行。然而,伪qmf组需要相对较长的滤波器来降低混叠引起的噪声。本文介绍了一种具有完美重构(PR) QMF库的宽通带设计。实验设计的滤波器性能表明,在误差小于-96dB的32子带系统中,滤波器的长度可以从448个抽头减少到384个抽头,并且语音合成的计算成本不会显著增加。
{"title":"Wide Passband Design for Cosine-Modulated Filter Banks in Sinusoidal Speech Synthesis","authors":"Nobuyuki Nishizawa, Tomonori Yazaki","doi":"10.21437/SSW.2016-29","DOIUrl":"https://doi.org/10.21437/SSW.2016-29","url":null,"abstract":"A new filter design strategy to shorten the length of the filter is introduced for sinusoidal speech synthesis using cosine-modulated filter banks. Multiple sinusoidal waveforms for speech synthesis can be effectively synthesized by using pseudo-quadrature mirror filter (pseudo-QMF) banks, which are constructed by cosine modulation of the coefficients of a low-pass filter. This is because stable sinusoids are represented as sparse vectors on the subband domain of the pseudo-QMF banks and computation for the filter banks can be effectively performed with fast algorithms for discrete cosine transformation (DCT). However, the pseudo-QMF banks require relatively long filters to reduce noise caused by aliasing. In this study, a wider passband design with a perfect reconstruction (PR) QMF bank is introduced. The properties of experimentally designed filters indicated that the length of the filters can be reduced from 448 taps to 384 taps for 32-subband systems with less than -96dB errors where the computational cost for speech synthesis does not significantly increase.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132685915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Utterance Selection Techniques for TTS Systems Using Found Speech 基于发现语音的TTS系统话语选择技术
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-30
P. Baljekar, A. Black
The goal in this paper is to investigate data selection techniques for found speech. Found speech unlike clean, phonetically-balanced datasets recorded specifically for synthesis contain a lot of noise which might not get labeled well and it might contain utterances with varying channel conditions. These channel variations and other noise distortions might sometimes be useful in terms of adding diverse data to our training set, however in other cases it might be detrimental to the system. The ap-proach outlined in this work investigates various metrics to detect noisy data which degrade the performance of the system on a held-out test set. We assume a seed set of 100 utterances to which we then incrementally add in a fixed set of utterances and find which metrics can capture the misaligned and noisy data. We report results on three datasets, an artificially degraded set of clean speech, a single speaker database of found speech and a multi - speaker database of found speech. All of our experiments are carried out on male speakers. We also show compa-rable results are obtained on a female multi-speaker corpus.
本文的目的是研究发现语音的数据选择技术。与专门为合成而记录的干净、语音平衡的数据集不同,发现的语音包含许多噪声,这些噪声可能没有得到很好的标记,并且可能包含具有不同通道条件的话语。这些信道变化和其他噪声失真有时在向我们的训练集添加不同的数据方面可能是有用的,但在其他情况下,它可能对系统有害。在这项工作中概述的方法研究了各种指标来检测降低系统性能的噪声数据。我们假设有100个话语的种子集,然后我们逐渐添加一组固定的话语,并找到哪些指标可以捕获不一致和有噪声的数据。我们报告了三个数据集的结果,一个人为退化的干净语音集,一个单说话人的发现语音数据库和一个多说话人的发现语音数据库。我们所有的实验都是在男性扬声器上进行的。我们还展示了在女性多说话语料库上获得的可比结果。
{"title":"Utterance Selection Techniques for TTS Systems Using Found Speech","authors":"P. Baljekar, A. Black","doi":"10.21437/SSW.2016-30","DOIUrl":"https://doi.org/10.21437/SSW.2016-30","url":null,"abstract":"The goal in this paper is to investigate data selection techniques for found speech. Found speech unlike clean, phonetically-balanced datasets recorded specifically for synthesis contain a lot of noise which might not get labeled well and it might contain utterances with varying channel conditions. These channel variations and other noise distortions might sometimes be useful in terms of adding diverse data to our training set, however in other cases it might be detrimental to the system. The ap-proach outlined in this work investigates various metrics to detect noisy data which degrade the performance of the system on a held-out test set. We assume a seed set of 100 utterances to which we then incrementally add in a fixed set of utterances and find which metrics can capture the misaligned and noisy data. We report results on three datasets, an artificially degraded set of clean speech, a single speaker database of found speech and a multi - speaker database of found speech. All of our experiments are carried out on male speakers. We also show compa-rable results are obtained on a female multi-speaker corpus.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123826261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
A hybrid harmonics-and-bursts modelling approach to speech synthesis 语音合成的混合谐波和爆发建模方法
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-34
J. Beskow, Harald Berthelsen
Statistical speech synthesis systems rely on a parametric speech generation model, typically some sort of vocoder. Vocoders are great for voiced speech because they offer independent control over voice source (e.g. pitch) and vocal tract filter (e.g. vowel quality) through control parameters that typically vary smoothly in time and lend themselves well to statistical modelling. Voiceless sounds and transients such as plosives and fricatives on the other hand exhibit fundamentally different spectro-temporal behaviour. Here the benefits of the vocoder are not as clear. In this paper, we investigate a hybrid approach to modeling the speech signal, where speech is decomposed into an harmonic part and a noise burst part through spectrogram kernel filtering. The harmonic part is modeled using vocoder and statistical parameter generation, while the burst part is modeled by concatenation. The two channels are then mixed together to form the final synthesized waveform. The proposed method was compared against a state of the art statistical speech synthesis system (HTS 2.3) in a perceptual evaluation, which reveled that the harmonics plus bursts method was perceived as significantly more natural than the purely statistical variant.
统计语音合成系统依赖于参数语音生成模型,通常是某种声码器。声码器对于浊音来说非常有用,因为它们通过控制参数提供对声源(如音高)和声道滤波器(如元音质量)的独立控制,这些参数通常随时间平滑变化,并且可以很好地用于统计建模。另一方面,无声音和瞬变音,如爆破音和摩擦音,则表现出根本不同的光谱-时间行为。在这里声码器的好处不是很清楚。本文研究了一种混合建模语音信号的方法,该方法通过谱图核滤波将语音分解为谐波部分和噪声突发部分。谐波部分采用声码器和统计参数生成建模,突发部分采用串接建模。然后将两个通道混合在一起形成最终的合成波形。在感知评估中,将所提出的方法与最先进的统计语音合成系统(HTS 2.3)进行了比较,结果表明谐波加突发方法被认为比纯粹的统计变体更自然。
{"title":"A hybrid harmonics-and-bursts modelling approach to speech synthesis","authors":"J. Beskow, Harald Berthelsen","doi":"10.21437/SSW.2016-34","DOIUrl":"https://doi.org/10.21437/SSW.2016-34","url":null,"abstract":"Statistical speech synthesis systems rely on a parametric speech generation model, typically some sort of vocoder. Vocoders are great for voiced speech because they offer independent control over voice source (e.g. pitch) and vocal tract filter (e.g. vowel quality) through control parameters that typically vary smoothly in time and lend themselves well to statistical modelling. Voiceless sounds and transients such as plosives and fricatives on the other hand exhibit fundamentally different spectro-temporal behaviour. Here the benefits of the vocoder are not as clear. In this paper, we investigate a hybrid approach to modeling the speech signal, where speech is decomposed into an harmonic part and a noise burst part through spectrogram kernel filtering. The harmonic part is modeled using vocoder and statistical parameter generation, while the burst part is modeled by concatenation. The two channels are then mixed together to form the final synthesized waveform. The proposed method was compared against a state of the art statistical speech synthesis system (HTS 2.3) in a perceptual evaluation, which reveled that the harmonics plus bursts method was perceived as significantly more natural than the purely statistical variant.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131510346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Temporal modeling in neural network based statistical parametric speech synthesis 基于神经网络的统计参数语音合成中的时间建模
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-18
K. Tokuda, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku
This paper proposes a novel neural network structure for speech synthesis, in which spectrum, F0 and duration parameters are simultaneously modeled in a unified framework. In the conventional neural network approaches, spectrum and F0 parameters are predicted by neural networks while phone and/or state durations are given from other external duration predictors. In order to consistently model not only spectrum and F0 parameters but also durations, we adopt a special type of mixture density network (MDN) structure, which models utterance level probability density functions conditioned on the corresponding input feature sequence. This is achieved by modeling the conditional probability distribution of utterance level output features, given input features, with a hidden semi-Markov model, where its parameters are generated using a neural network trained with a log likelihood-based loss function. Variations of the proposed neural network structure are also discussed. Subjective listening test results show that the proposed approach improves the naturalness of synthesized speech.
本文提出了一种新的用于语音合成的神经网络结构,该结构将频谱、F0和持续时间参数同时建模在一个统一的框架中。在传统的神经网络方法中,频谱和F0参数由神经网络预测,而电话和/或状态持续时间由其他外部持续时间预测器给出。为了对频谱和F0参数以及持续时间进行一致的建模,我们采用了一种特殊类型的混合密度网络(MDN)结构,该结构根据相应的输入特征序列对话语级概率密度函数进行建模。这是通过使用隐式半马尔可夫模型对给定输入特征的话语级输出特征的条件概率分布进行建模来实现的,其中其参数是使用基于对数似然损失函数训练的神经网络生成的。本文还讨论了所提出的神经网络结构的变化。主观听力测试结果表明,该方法提高了合成语音的自然度。
{"title":"Temporal modeling in neural network based statistical parametric speech synthesis","authors":"K. Tokuda, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku","doi":"10.21437/SSW.2016-18","DOIUrl":"https://doi.org/10.21437/SSW.2016-18","url":null,"abstract":"This paper proposes a novel neural network structure for speech synthesis, in which spectrum, F0 and duration parameters are simultaneously modeled in a unified framework. In the conventional neural network approaches, spectrum and F0 parameters are predicted by neural networks while phone and/or state durations are given from other external duration predictors. In order to consistently model not only spectrum and F0 parameters but also durations, we adopt a special type of mixture density network (MDN) structure, which models utterance level probability density functions conditioned on the corresponding input feature sequence. This is achieved by modeling the conditional probability distribution of utterance level output features, given input features, with a hidden semi-Markov model, where its parameters are generated using a neural network trained with a log likelihood-based loss function. Variations of the proposed neural network structure are also discussed. Subjective listening test results show that the proposed approach improves the naturalness of synthesized speech.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130564579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Multidimensional scaling of systems in the Voice Conversion Challenge 2016 2016语音转换挑战赛中系统的多维缩放
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-7
M. Wester, Zhizheng Wu, J. Yamagishi
This study investigates how listeners judge the similarity of voice converted voices using a talker discrimination task. The data used is from the Voice Conversion Challenge 2016. 17 participants from around the world took part in building voice converted voices from a shared data set of source and target speakers. This paper describes the evaluation of similarity for four of the source-target pairs (two intra-gender and two cross-gender) in more detail. Multidimensional scaling was performed to illustrate where each system was perceived to be in an acoustic space compared to the source and target speakers and to each other.
本研究探讨了听者如何通过说话者辨别任务来判断声音转换后的声音的相似性。使用的数据来自2016年语音转换挑战赛。来自世界各地的17名参与者参与了从源和目标说话人的共享数据集建立语音转换语音的工作。本文更详细地描述了四种源-目标对(两种内性别和两种跨性别)的相似性评价。进行了多维缩放,以说明与源和目标扬声器以及彼此相比,每个系统在声学空间中的感知位置。
{"title":"Multidimensional scaling of systems in the Voice Conversion Challenge 2016","authors":"M. Wester, Zhizheng Wu, J. Yamagishi","doi":"10.21437/SSW.2016-7","DOIUrl":"https://doi.org/10.21437/SSW.2016-7","url":null,"abstract":"This study investigates how listeners judge the similarity of voice converted voices using a talker discrimination task. The data used is from the Voice Conversion Challenge 2016. 17 participants from around the world took part in building voice converted voices from a shared data set of source and target speakers. This paper describes the evaluation of similarity for four of the source-target pairs (two intra-gender and two cross-gender) in more detail. Multidimensional scaling was performed to illustrate where each system was perceived to be in an acoustic space compared to the source and target speakers and to each other.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131238331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Emphasis recreation for TTS using intonation atoms 利用语调原子对TTS进行重音重建
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-3
Pierre-Edouard Honnet, Philip N. Garner
We are interested in emphasis for text to speech synthesis. In speech to speech translation, emphasising the correct words is important to convey the underlying meaning of a message. In this paper, we propose to use a generalised command-response (CR) model of intonation to generate emphasis in synthetic speech. We first analyse the differences in the model parameters between emphasised words in an acted emphasis scenario and their neutral counterpart. We investigate word level intonation modelling using simple random forest as a basis framework, to predict the parameters of the model in the specific case of emphasised word. Based on the linguistic context of the words we want to emphasise, we attempt at recovering emphasis pattern in the intonation in originally neutral synthetic speech by gen-erating word-level model parameters with similar context. The method is presented and initial results are given, on synthetic speech.
我们感兴趣的是文本到语音合成的重点。在语音翻译中,强调正确的单词对于传达信息的潜在含义很重要。在本文中,我们提出使用一种广义命令响应(CR)语调模型来生成合成语音中的重音。我们首先分析了在动作强调场景中被强调词和它们的中性对应词之间模型参数的差异。我们研究了使用简单随机森林作为基础框架的词级语调建模,以预测模型在强调词的特定情况下的参数。基于所要强调的词的语言语境,我们试图通过生成具有相似语境的词级模型参数来恢复原中性合成语音语调中的重音模式。介绍了该方法,并给出了合成语音的初步结果。
{"title":"Emphasis recreation for TTS using intonation atoms","authors":"Pierre-Edouard Honnet, Philip N. Garner","doi":"10.21437/SSW.2016-3","DOIUrl":"https://doi.org/10.21437/SSW.2016-3","url":null,"abstract":"We are interested in emphasis for text to speech synthesis. In speech to speech translation, emphasising the correct words is important to convey the underlying meaning of a message. In this paper, we propose to use a generalised command-response (CR) model of intonation to generate emphasis in synthetic speech. We first analyse the differences in the model parameters between emphasised words in an acted emphasis scenario and their neutral counterpart. We investigate word level intonation modelling using simple random forest as a basis framework, to predict the parameters of the model in the specific case of emphasised word. Based on the linguistic context of the words we want to emphasise, we attempt at recovering emphasis pattern in the intonation in originally neutral synthetic speech by gen-erating word-level model parameters with similar context. The method is presented and initial results are given, on synthetic speech.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115632274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Speech Synthesis Workshop
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1