首页 > 最新文献

Speech Synthesis Workshop最新文献

英文 中文
Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis 基于递归神经网络隐藏状态的上下文表示用于统计参数语音合成
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-28
Sivanand Achanta, Rambabu Banoth, Ayushi Pandey, Anandaswarup Vadapalli, S. Gangashetty
In this paper, we propose to use hidden state vector ob-tained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that, supplementing the conventional text features with a continuous frame-level acoustically guided representation would improve the acoustic modeling. The hidden state from an RNN trained to predict acoustic features is used as the additional contextual information. A dataset consisting of 2 Indian languages (Telugu and Hindi) from Blizzard challenge 2015 was used in our experiments. Both the subjective listening tests and the objective scores indicate that the proposed approach per-forms significantly better than the baseline DNN system.
在本文中,我们提出使用从递归神经网络(RNN)中获得的隐藏状态向量作为基于深度神经网络(DNN)的统计参数语音合成的上下文向量表示。而在典型的基于深度神经网络的系统中,从电话级别到话语级别存在文本特征的层次结构,它们通常采用1-hot-k编码表示。我们的假设是,用连续的帧级声学引导表示补充传统的文本特征将改善声学建模。通过训练来预测声学特征的RNN的隐藏状态被用作附加的上下文信息。我们的实验使用了暴雪挑战赛2015中包含两种印度语言(泰卢固语和印地语)的数据集。主观听力测试和客观得分都表明,该方法的表现明显优于基线DNN系统。
{"title":"Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis","authors":"Sivanand Achanta, Rambabu Banoth, Ayushi Pandey, Anandaswarup Vadapalli, S. Gangashetty","doi":"10.21437/SSW.2016-28","DOIUrl":"https://doi.org/10.21437/SSW.2016-28","url":null,"abstract":"In this paper, we propose to use hidden state vector ob-tained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that, supplementing the conventional text features with a continuous frame-level acoustically guided representation would improve the acoustic modeling. The hidden state from an RNN trained to predict acoustic features is used as the additional contextual information. A dataset consisting of 2 Indian languages (Telugu and Hindi) from Blizzard challenge 2015 was used in our experiments. Both the subjective listening tests and the objective scores indicate that the proposed approach per-forms significantly better than the baseline DNN system.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129464246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
WikiSpeech - enabling open source text-to-speech for Wikipedia WikiSpeech -为维基百科启用开源文本到语音
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-16
J. Andersson, S. Berlin, André Costa, Harald Berthelsen, Hanna Lindgren, N. Lindberg, J. Beskow, Jens Edlund, Joakim Gustafson
We present WikiSpeech, an ambitious joint project aiming to (1) make open source text-to-speech available through Wikimedia Foundation’s server architecture; (2) utilize the large and active Wikipedia user base to achieve continuously improving text-to-speech; (3) improve existing and develop new crowdsourcing methods for text-to-speech; and (4) develop new and adapt current evaluation methods so that they are well suited for the particular use case of reading Wikipedia articles out loud while at the same time capable of harnessing the huge user base made available by Wikipedia. At its inauguration, the project is backed by The Swedish Post and Telecom Authority and headed by Wikimedia Sverige, STTS and KTH, but in the long run, the project aims at broad multinational involvement. The vision of the project is freely available text-to-speech for all Wikipedia languages (currently 293). In this paper, we present the project itself and its first steps: requirements, initial architecture, and initial steps to include crowdsourcing and evaluation.
我们提出WikiSpeech,这是一个雄心勃勃的联合项目,旨在(1)通过维基媒体基金会的服务器架构提供开源的文本到语音;(2)利用维基百科庞大而活跃的用户群,实现文本到语音的不断改进;(3)完善现有的文本转语音众包方式,并开发新的众包方式;(4)开发新的和调整现有的评估方法,使它们非常适合大声阅读维基百科文章的特定用例,同时能够利用维基百科提供的庞大用户群。在启动时,该项目得到了瑞典邮政和电信管理局的支持,并由维基媒体公司、STTS和KTH领导,但从长远来看,该项目旨在广泛的跨国参与。该项目的愿景是免费提供所有维基百科语言的文本到语音(目前有293种)。在本文中,我们介绍了项目本身及其第一步:需求,初始架构,以及包括众包和评估的初始步骤。
{"title":"WikiSpeech - enabling open source text-to-speech for Wikipedia","authors":"J. Andersson, S. Berlin, André Costa, Harald Berthelsen, Hanna Lindgren, N. Lindberg, J. Beskow, Jens Edlund, Joakim Gustafson","doi":"10.21437/SSW.2016-16","DOIUrl":"https://doi.org/10.21437/SSW.2016-16","url":null,"abstract":"We present WikiSpeech, an ambitious joint project aiming to (1) make open source text-to-speech available through Wikimedia Foundation’s server architecture; (2) utilize the large and active Wikipedia user base to achieve continuously improving text-to-speech; (3) improve existing and develop new crowdsourcing methods for text-to-speech; and (4) develop new and adapt current evaluation methods so that they are well suited for the particular use case of reading Wikipedia articles out loud while at the same time capable of harnessing the huge user base made available by Wikipedia. At its inauguration, the project is backed by The Swedish Post and Telecom Authority and headed by Wikimedia Sverige, STTS and KTH, but in the long run, the project aims at broad multinational involvement. The vision of the project is freely available text-to-speech for all Wikipedia languages (currently 293). In this paper, we present the project itself and its first steps: requirements, initial architecture, and initial steps to include crowdsourcing and evaluation.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123949288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Open-Source Consumer-Grade Indic Text To Speech 开源消费级索引文本到语音
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-31
Andrew Wilkinson, A. Parlikar, Sunayana Sitaram, Tim White, A. Black, Suresh Bazaj
Open-source text-to-speech (TTS) software has enabled the development of voices in multiple languages, including many high-resource languages, such as English and European languages. However, building voices for low-resource languages is still challenging. We describe the development of TTS systems for 12 Indian languages using the Festvox framework, for which we developed a common frontend for Indian languages. Voices for eight of these 12 languages are available for use with Flite, a lightweight, fast run-time synthesizer, and the Android Flite app available in the Google Play store. Recently, the baseline Punjabi TTS voice was built end-to-end in a month by two undergraduate students (without any prior knowledge of TTS) with help from two of the authors of this paper. The framework can be used to build a baseline Indic TTS voice in two weeks, once a text corpus is selected and a suitable native speaker is identified.
开源的文本到语音(TTS)软件使得多种语言的语音开发成为可能,包括许多资源丰富的语言,如英语和欧洲语言。然而,为低资源语言构建语音仍然具有挑战性。我们描述了使用Festvox框架开发12种印度语言的TTS系统,并为此开发了一个印度语言的通用前端。这12种语言中有8种的声音可以通过Flite使用,Flite是一款轻量级、快速的运行时合成器,Android Flite应用程序可以在Google Play商店中使用。最近,在本文两位作者的帮助下,两名本科生(之前对TTS没有任何了解)在一个月内建立了旁遮普语TTS基线语音。一旦选择了文本语料库并确定了合适的母语使用者,该框架可用于在两周内构建基线印度TTS语音。
{"title":"Open-Source Consumer-Grade Indic Text To Speech","authors":"Andrew Wilkinson, A. Parlikar, Sunayana Sitaram, Tim White, A. Black, Suresh Bazaj","doi":"10.21437/SSW.2016-31","DOIUrl":"https://doi.org/10.21437/SSW.2016-31","url":null,"abstract":"Open-source text-to-speech (TTS) software has enabled the development of voices in multiple languages, including many high-resource languages, such as English and European languages. However, building voices for low-resource languages is still challenging. We describe the development of TTS systems for 12 Indian languages using the Festvox framework, for which we developed a common frontend for Indian languages. Voices for eight of these 12 languages are available for use with Flite, a lightweight, fast run-time synthesizer, and the Android Flite app available in the Google Play store. Recently, the baseline Punjabi TTS voice was built end-to-end in a month by two undergraduate students (without any prior knowledge of TTS) with help from two of the authors of this paper. The framework can be used to build a baseline Indic TTS voice in two weeks, once a text corpus is selected and a suitable native speaker is identified.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114595027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Prosodic and Spectral iVectors for Expressive Speech Synthesis 表达性语音合成的韵律和谱向量
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-10
Igor Jauk, A. Bonafonte
This work presents a study on the suitability of prosodic andacoustic features, with a special focus on i-vectors, in expressivespeech analysis and synthesis. For each utterance of two dif-ferent databases, a laboratory recorded emotional acted speech,and an audiobook, several prosodic and acoustic features are ex-tracted. Among them, i-vectors are built not only on the MFCCbase, but also on F0, power and syllable durations. Then, un-supervised clustering is performed using different feature com-binations. The resulting clusters are evaluated calculating clus-ter entropy for labeled portions of the databases. Additionally,synthetic voices are trained, applying speaker adaptive training,from the clusters built from the audiobook. The voices are eval-uated in a perceptual test where the participants have to edit anaudiobook paragraph using the synthetic voices.The objective results suggest that i-vectors are very use-ful for the audiobook, where different speakers (book charac-ters) are imitated. On the other hand, for the laboratory record-ings, traditional prosodic features outperform i-vectors. Also,a closer analysis of the created clusters suggest that differentspeakers use different prosodic and acoustic means to conveyemotions. The perceptual results suggest that the proposed i-vector based feature combinations can be used for audiobookclustering and voice training.
这项工作提出了一项关于韵律和声学特征的适用性的研究,特别关注i向量,在表达语音分析和合成中。对于实验室记录的情感行为语音和有声读物两个不同数据库中的每个话语,提取了几个韵律和声学特征。其中,i向量不仅建立在MFCCbase上,还建立在F0、功率和音节时长上。然后,使用不同的特征组合进行无监督聚类。计算数据库标记部分的聚类熵来评估结果聚类。此外,合成的声音被训练,应用演讲者自适应训练,从从有声读物建立的集群。这些声音是在一个感知测试中评估的,参与者必须使用合成的声音编辑一个有声读物段落。客观结果表明,i-vector对于模仿不同说话者(书中的角色)的有声读物非常有用。另一方面,对于实验室记录,传统的韵律特征优于i向量。此外,对所创造的集群进行更仔细的分析表明,不同的说话者使用不同的韵律和声学手段来传达情感。感知结果表明,所提出的基于i向量的特征组合可以用于有声读物聚类和语音训练。
{"title":"Prosodic and Spectral iVectors for Expressive Speech Synthesis","authors":"Igor Jauk, A. Bonafonte","doi":"10.21437/SSW.2016-10","DOIUrl":"https://doi.org/10.21437/SSW.2016-10","url":null,"abstract":"This work presents a study on the suitability of prosodic andacoustic features, with a special focus on i-vectors, in expressivespeech analysis and synthesis. For each utterance of two dif-ferent databases, a laboratory recorded emotional acted speech,and an audiobook, several prosodic and acoustic features are ex-tracted. Among them, i-vectors are built not only on the MFCCbase, but also on F0, power and syllable durations. Then, un-supervised clustering is performed using different feature com-binations. The resulting clusters are evaluated calculating clus-ter entropy for labeled portions of the databases. Additionally,synthetic voices are trained, applying speaker adaptive training,from the clusters built from the audiobook. The voices are eval-uated in a perceptual test where the participants have to edit anaudiobook paragraph using the synthetic voices.The objective results suggest that i-vectors are very use-ful for the audiobook, where different speakers (book charac-ters) are imitated. On the other hand, for the laboratory record-ings, traditional prosodic features outperform i-vectors. Also,a closer analysis of the created clusters suggest that differentspeakers use different prosodic and acoustic means to conveyemotions. The perceptual results suggest that the proposed i-vector based feature combinations can be used for audiobookclustering and voice training.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130595296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Nonaudible murmur enhancement based on statistical voice conversion and noise suppression with external noise monitoring 基于统计语音转换和外部噪声监测的噪声抑制的不可听杂音增强
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-9
Y. Tajiri, T. Toda
This paper presents a method for making nonaudible murmur (NAM) enhancement based on statistical voice conversion (VC) robust against external noise. NAM, which is an extremely soft whispered voice, is a promising medium for silent speech communication thanks to its faint volume. Although such a soft voice can still be detected with a special body-conductive microphone, its quality significantly degrades compared to that of air-conductive voices. It has been shown that the statistical VC technique is capable of significantly improving quality of NAM by converting it into the air-conductive voices. However, this technique is not helpful under noisy conditions because a detected NAM signal easily suffers from external noise, and acoustic mismatches are caused between such a noisy NAM signal and a previously trained conversion model. To address this issue, in this paper we apply our proposed noise suppression method based on external noise monitoring to the statistical NAM enhancement. Moreover, a known noise superimposition method is further applied in order to alleviate the effects of residual noise components on the conversion accuracy. The experimental results demonstrate that the proposed method yields significant improvements in the conversion accuracy compared to the conventional method.
提出了一种基于统计语音转换(VC)的非可听杂音增强方法,使其对外界噪声具有鲁棒性。“NAM”是一种非常柔和的耳语,由于音量很小,因此被认为是无声语言交流的理想媒介。虽然用特殊的身体传导性麦克风仍然可以检测到这种柔和的声音,但与空气传导性声音相比,其质量明显下降。研究表明,统计VC技术能够通过将非声源转换为空气传导性语音来显著提高非声源的质量。然而,这种技术在有噪声的条件下是没有用的,因为检测到的非均匀运动信号容易受到外部噪声的影响,并且这种有噪声的非均匀运动信号与先前训练的转换模型之间会产生声学不匹配。为了解决这一问题,本文将基于外部噪声监测的噪声抑制方法应用于统计NAM增强。此外,为了减轻残余噪声分量对转换精度的影响,进一步采用了已知的噪声叠加方法。实验结果表明,与传统方法相比,该方法在转换精度上有显著提高。
{"title":"Nonaudible murmur enhancement based on statistical voice conversion and noise suppression with external noise monitoring","authors":"Y. Tajiri, T. Toda","doi":"10.21437/SSW.2016-9","DOIUrl":"https://doi.org/10.21437/SSW.2016-9","url":null,"abstract":"This paper presents a method for making nonaudible murmur (NAM) enhancement based on statistical voice conversion (VC) robust against external noise. NAM, which is an extremely soft whispered voice, is a promising medium for silent speech communication thanks to its faint volume. Although such a soft voice can still be detected with a special body-conductive microphone, its quality significantly degrades compared to that of air-conductive voices. It has been shown that the statistical VC technique is capable of significantly improving quality of NAM by converting it into the air-conductive voices. However, this technique is not helpful under noisy conditions because a detected NAM signal easily suffers from external noise, and acoustic mismatches are caused between such a noisy NAM signal and a previously trained conversion model. To address this issue, in this paper we apply our proposed noise suppression method based on external noise monitoring to the statistical NAM enhancement. Moreover, a known noise superimposition method is further applied in order to alleviate the effects of residual noise components on the conversion accuracy. The experimental results demonstrate that the proposed method yields significant improvements in the conversion accuracy compared to the conventional method.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114400296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Jerk Minimization for Acoustic-To-Articulatory Inversion 声学-发音反转的震动最小化
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-14
Avni Rajpal, H. Patil
The effortless speech production in humans requires coordinated movements of the articulators such as lips, tongue, jaw, velum, etc. Therefore, measured trajectories obtained are smooth and slowly-varying. However, the trajectories estimated from acoustic-to-articulatory inversion (AAI) are found to be jagged . Thus, energy minimization is used as smoothness constraint for improving performance of the AAI. Besides energy minimization, jerk (i.e., rate of change of acceleration) is known for quantification of smoothness in case of human motor movements. Human motors are organized to achieve intended goal with smoothest possible movements, under the constraint of minimum accelerative transients. In this paper, we propose jerk minimization as an alternative smoothness criterion for frame-based acoustic-to-articulatory inversion. The resultant trajectories obtained are smooth in the sense that for articulator-specific window size, they will have minimum jerk. The results using this criterion were found to be comparable with inversion schemes based on existing energy minimization criteria for achieving smoothness.
人类要想轻松地说话,就需要唇、舌、颚、腭等发音器官的协调运动。因此,测量得到的轨迹是光滑的和缓慢变化的。然而,从声学到发音反转(AAI)估计的轨迹被发现是锯齿状的。因此,将能量最小化作为AAI的平滑约束来提高其性能。除了能量最小化之外,抽搐(即加速度变化率)在人类运动的情况下被称为平滑的量化。人体运动的组织是为了在最小加速度瞬态的约束下,以尽可能平稳的运动达到预定目标。在本文中,我们提出了抖动最小化作为基于帧的声学-发音反转的备选平滑标准。由此得到的轨迹是平滑的,因为对于发音器特定的窗口大小,它们将具有最小的抖动。发现使用该准则的结果与基于现有能量最小化准则的反演方案相当,以实现平滑。
{"title":"Jerk Minimization for Acoustic-To-Articulatory Inversion","authors":"Avni Rajpal, H. Patil","doi":"10.21437/SSW.2016-14","DOIUrl":"https://doi.org/10.21437/SSW.2016-14","url":null,"abstract":"The effortless speech production in humans requires coordinated movements of the articulators such as lips, tongue, jaw, velum, etc. Therefore, measured trajectories obtained are smooth and slowly-varying. However, the trajectories estimated from acoustic-to-articulatory inversion (AAI) are found to be jagged . Thus, energy minimization is used as smoothness constraint for improving performance of the AAI. Besides energy minimization, jerk (i.e., rate of change of acceleration) is known for quantification of smoothness in case of human motor movements. Human motors are organized to achieve intended goal with smoothest possible movements, under the constraint of minimum accelerative transients. In this paper, we propose jerk minimization as an alternative smoothness criterion for frame-based acoustic-to-articulatory inversion. The resultant trajectories obtained are smooth in the sense that for articulator-specific window size, they will have minimum jerk. The results using this criterion were found to be comparable with inversion schemes based on existing energy minimization criteria for achieving smoothness.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117138581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model 基于α-插值模型的多输出RNN-LSTM多扬声器语音合成
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-19
Santiago Pascual, A. Bonafonte
Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers.
深度学习已成功应用于语音处理。在本文中,我们提出了一种使用多个扬声器的语音合成架构。一些隐藏层由所有扬声器共享,而每个扬声器都有一个特定的输出层。客观实验和感知实验证明,与单说话人模型相比,该方案具有更好的效果。此外,我们还通过在多输出分支上添加新的输出层(a层)来解决扬声器插值问题。将识别代码与多个扬声器的声学特征一起注入该层。实验表明,a层可以有效地学习插值说话人之间的声学特征。
{"title":"Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model","authors":"Santiago Pascual, A. Bonafonte","doi":"10.21437/SSW.2016-19","DOIUrl":"https://doi.org/10.21437/SSW.2016-19","url":null,"abstract":"Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- \u0000gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127704141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Non-intrusive Quality Assessment of Synthesized Speech using Spectral Features and Support Vector Regression 基于谱特征和支持向量回归的合成语音非侵入性质量评估
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-21
Meet H. Soni, H. Patil
In this paper, we propose a new quality assessment method for synthesized speech. Unlike previous approaches which uses Hidden Markov Model (HMM) trained on natural utterances as a reference model to predict the quality of synthesized speech, proposed approach uses knowledge about synthesized speech while training the model. The previous approach has been successfully applied in the quality assessment of synthesized speech for the German language. However, it gave poor results for English language databases such as Blizzard Challenge 2008 and 2009 databases. The problem of quality assessment of synthesized speech is posed as a regression problem. The mapping between statistical properties of spectral features extracted from the speech signal and corresponding speech quality score (MOS) was found using Support Vector Regression (SVR). All the experiments were done on Blizzard Challenge Databases of the year 2008, 2009, 2010 and 2012. The results of experiments show that by including knowledge about synthesized speech while training, the performance of quality assessment system can be improved. Moreover, the accuracy of quality assessment system heavily depends on the kind of synthesis system used for signal generation. On Blizzard 2008 and 2009 database, proposed approach gives correlation of 0.28 and 0.49 , respectively, for about 17 % data used in training. Previous approach gives correlation of 0.3 and 0.09 , respectively, using spectral features. For Blizzard 2012 database, proposed approach gives correlation of 0.8 by using 12 % of available data in training.
本文提出了一种新的合成语音质量评价方法。与以往的方法不同,该方法使用基于自然话语训练的隐马尔可夫模型作为参考模型来预测合成语音的质量,该方法在训练模型的同时使用合成语音的相关知识。该方法已成功地应用于德语合成语音的质量评价中。然而,在暴雪挑战赛2008年和2009年的英语数据库中,它给出了糟糕的结果。将合成语音的质量评价问题归结为一个回归问题。利用支持向量回归(SVR)找到从语音信号中提取的频谱特征的统计属性与相应的语音质量分数(MOS)之间的映射关系。所有的实验都是在2008年、2009年、2010年和2012年的暴雪挑战数据库上完成的。实验结果表明,在训练过程中加入有关合成语音的知识,可以提高质量评估系统的性能。此外,质量评估系统的准确性在很大程度上取决于用于信号产生的合成系统的类型。在暴雪2008年和2009年的数据库中,对于训练中使用的约17%的数据,所提出的方法的相关系数分别为0.28和0.49。先前的方法使用光谱特征分别给出0.3和0.09的相关性。对于暴雪2012数据库,本文提出的方法在训练中使用12%的可用数据,得到0.8的相关性。
{"title":"Non-intrusive Quality Assessment of Synthesized Speech using Spectral Features and Support Vector Regression","authors":"Meet H. Soni, H. Patil","doi":"10.21437/SSW.2016-21","DOIUrl":"https://doi.org/10.21437/SSW.2016-21","url":null,"abstract":"In this paper, we propose a new quality assessment method for synthesized speech. Unlike previous approaches which uses Hidden Markov Model (HMM) trained on natural utterances as a reference model to predict the quality of synthesized speech, proposed approach uses knowledge about synthesized speech while training the model. The previous approach has been successfully applied in the quality assessment of synthesized speech for the German language. However, it gave poor results for English language databases such as Blizzard Challenge 2008 and 2009 databases. The problem of quality assessment of synthesized speech is posed as a regression problem. The mapping between statistical properties of spectral features extracted from the speech signal and corresponding speech quality score (MOS) was found using Support Vector Regression (SVR). All the experiments were done on Blizzard Challenge Databases of the year 2008, 2009, 2010 and 2012. The results of experiments show that by including knowledge about synthesized speech while training, the performance of quality assessment system can be improved. Moreover, the accuracy of quality assessment system heavily depends on the kind of synthesis system used for signal generation. On Blizzard 2008 and 2009 database, proposed approach gives correlation of 0.28 and 0.49 , respectively, for about 17 % data used in training. Previous approach gives correlation of 0.3 and 0.09 , respectively, using spectral features. For Blizzard 2012 database, proposed approach gives correlation of 0.8 by using 12 % of available data in training.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116724346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Mandarin Prosodic Phrase Prediction based on Syntactic Trees 基于句法树的汉语韵律短语预测
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-26
Zhengchen Zhang, Fuxiang Wu, Chenyu Yang, M. Dong, Fu-qiu Zhou
Prosodic phrases (PPs) are important for Mandarin Text-To-Speech systems. Most of the existing PP detection methods need large manually annotated corpora to learn the models. In this paper, we propose a rule based method to predict the PP boundaries employing the syntactic information of a sentence. The method is based on the ob-servation that a prosodic phrase is a meaningful segment of a sentence with length restrictions. A syntactic structure allows to segment a sentence according to grammars. We add some length restrictions to the segmentations to predict the PP boundaries. An F-Score of 0.693 was obtained in the experiments, which is about 0.02 higher than the one got by a Conditional Random Field based method.
韵律短语在汉语文本到语音系统中是非常重要的。现有的PP检测方法大多需要大量的人工标注语料库来学习模型。本文提出了一种基于规则的基于句子句法信息的PP边界预测方法。该方法基于对韵律短语是有长度限制的句子的有意义片段的观察。句法结构允许根据语法对句子进行分段。我们在分割中加入一些长度限制来预测PP边界。实验得到的F-Score为0.693,比基于条件随机场的方法的F-Score提高了约0.02。
{"title":"Mandarin Prosodic Phrase Prediction based on Syntactic Trees","authors":"Zhengchen Zhang, Fuxiang Wu, Chenyu Yang, M. Dong, Fu-qiu Zhou","doi":"10.21437/SSW.2016-26","DOIUrl":"https://doi.org/10.21437/SSW.2016-26","url":null,"abstract":"Prosodic phrases (PPs) are important for Mandarin Text-To-Speech systems. Most of the existing PP detection methods need large manually annotated corpora to learn the models. In this paper, we propose a rule based method to predict the PP boundaries employing the syntactic information of a sentence. The method is based on the ob-servation that a prosodic phrase is a meaningful segment of a sentence with length restrictions. A syntactic structure allows to segment a sentence according to grammars. We add some length restrictions to the segmentations to predict the PP boundaries. An F-Score of 0.693 was obtained in the experiments, which is about 0.02 higher than the one got by a Conditional Random Field based method.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126297231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Non-filter waveform generation from cepstrum using spectral phase reconstruction 利用频谱相位重建从倒谱产生非滤波波形
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-5
Yasuhiro Hamada, Nobutaka Ono, S. Sagayama
This paper discusses non-filter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-filter model in text-to-speech (TTS) systems. As the primary purpose of the use of filters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the source-filter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed “ phase reconstruction ” from the power spectrogram. Given cepstral features and fundamental frequency ( F 0 ) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F 0 . The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive filters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) filter. Results show the proposed method performed better than the MLSA filter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.
本文讨论了在文本到语音(TTS)系统中,使用频谱相位重建作为替代传统源-滤波器模型的一种替代方法,从倒谱特征生成非滤波器波形。由于使用滤波器的主要目的被认为是从期望的频谱形状产生波形,源滤波器框架的一种可能的替代方案是通过利用最近开发的功率谱图的“相位重建”直接将设计的频谱转换为波形。将倒谱特征和基频(f0)作为TTS系统的期望频谱,通过将倒谱特征转换为线性尺度功率谱并乘以f0的音高结构来计算听众要听到的频谱。通过谱相位重构,从功率谱中生成信号波形。该方法的一个优点是不受递归滤波器中尖锐共振引起的不期望的振幅和长时间衰减的影响。在初步实验中,我们使用该方法和mel-log谱近似(MLSA)滤波器比较了合成语音的时间和增益特性。结果表明,该方法在合成语音的两个特征上都优于MLSA滤波器,表明该方法具有理想的语音合成性能。
{"title":"Non-filter waveform generation from cepstrum using spectral phase reconstruction","authors":"Yasuhiro Hamada, Nobutaka Ono, S. Sagayama","doi":"10.21437/SSW.2016-5","DOIUrl":"https://doi.org/10.21437/SSW.2016-5","url":null,"abstract":"This paper discusses non-filter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-filter model in text-to-speech (TTS) systems. As the primary purpose of the use of filters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the source-filter framework is to directly convert the designed spectrum into a waveform by utilizing a recently developed “ phase reconstruction ” from the power spectrogram. Given cepstral features and fundamental frequency ( F 0 ) as desired spectrum from a TTS system, the spectrum to be heard by the listener is calculated by converting the cepstral features into a linear-scale power spectrum and multiplying with the pitch structure of F 0 . The signal waveform is generated from the power spectrogram by spectral phase reconstruction. An advantageous property of the proposed method is that it is free from undesired amplitude and long time decay often caused by sharp resonances in recursive filters. In preliminary experiments, we compared temporal and gain characteristics of the synthesized speech using the proposed method and mel-log spectrum approximation (MLSA) filter. Results show the proposed method performed better than the MLSA filter in the both characteristics of the synthesized speech, and imply a desirable properties of the proposed method for speech synthesis.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122312985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Speech Synthesis Workshop
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1