首页 > 最新文献

Speech Synthesis Workshop最新文献

英文 中文
Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis 基于深度神经网络的语音合成中不同成分的说话人自适应
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-25
Shinji Takaki, Sangjin Kim, J. Yamagishi
In this paper, we investigate the effectiveness of speaker adaptation for various essential components in deep neural network based speech synthesis, including acoustic models, acoustic feature extraction, and post-filters. In general, a speaker adaptation technique, e.g., maximum likelihood linear regression (MLLR) for HMMs or learning hidden unit contributions (LHUC) for DNNs, is applied to an acoustic modeling part to change voice characteristics or speaking styles. However, since we have proposed a multiple DNN-based speech synthesis system, in which several components are represented based on feed-forward DNNs, a speaker adaptation technique can be applied not only to the acoustic modeling part but also to other components represented by DNNs. In experiments using a small amount of adaptation data, we performed adaptation based on LHUC and simple additional fine tuning for DNN-based acoustic models, deep auto-encoder based feature extraction, and DNN-based post-filter models and compared them with HMM-based speech synthesis systems using MLLR.
在本文中,我们研究了基于深度神经网络的语音合成中各种重要组成部分的说话人自适应的有效性,包括声学模型、声学特征提取和后滤波器。一般来说,说话人自适应技术,如hmm的最大似然线性回归(MLLR)或dnn的学习隐藏单元贡献(LHUC),被应用于声学建模部分来改变语音特征或说话风格。然而,由于我们提出了一个基于多个dnn的语音合成系统,其中多个组件基于前馈dnn表示,因此扬声器自适应技术不仅可以应用于声学建模部分,还可以应用于dnn表示的其他组件。在使用少量自适应数据的实验中,我们对基于dnn的声学模型、基于深度自编码器的特征提取和基于dnn的后滤波模型进行了基于LHUC和简单附加微调的自适应,并将它们与基于hmm的基于MLLR的语音合成系统进行了比较。
{"title":"Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis","authors":"Shinji Takaki, Sangjin Kim, J. Yamagishi","doi":"10.21437/SSW.2016-25","DOIUrl":"https://doi.org/10.21437/SSW.2016-25","url":null,"abstract":"In this paper, we investigate the effectiveness of speaker adaptation for various essential components in deep neural network based speech synthesis, including acoustic models, acoustic feature extraction, and post-filters. In general, a speaker adaptation technique, e.g., maximum likelihood linear regression (MLLR) for HMMs or learning hidden unit contributions (LHUC) for DNNs, is applied to an acoustic modeling part to change voice characteristics or speaking styles. However, since we have proposed a multiple DNN-based speech synthesis system, in which several components are represented based on feed-forward DNNs, a speaker adaptation technique can be applied not only to the acoustic modeling part but also to other components represented by DNNs. In experiments using a small amount of adaptation data, we performed adaptation based on LHUC and simple additional fine tuning for DNN-based acoustic models, deep auto-encoder based feature extraction, and DNN-based post-filter models and compared them with HMM-based speech synthesis systems using MLLR.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133798947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
WikiSpeech - enabling open source text-to-speech for Wikipedia WikiSpeech -为维基百科启用开源文本到语音
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-16
J. Andersson, S. Berlin, André Costa, Harald Berthelsen, Hanna Lindgren, N. Lindberg, J. Beskow, Jens Edlund, Joakim Gustafson
We present WikiSpeech, an ambitious joint project aiming to (1) make open source text-to-speech available through Wikimedia Foundation’s server architecture; (2) utilize the large and active Wikipedia user base to achieve continuously improving text-to-speech; (3) improve existing and develop new crowdsourcing methods for text-to-speech; and (4) develop new and adapt current evaluation methods so that they are well suited for the particular use case of reading Wikipedia articles out loud while at the same time capable of harnessing the huge user base made available by Wikipedia. At its inauguration, the project is backed by The Swedish Post and Telecom Authority and headed by Wikimedia Sverige, STTS and KTH, but in the long run, the project aims at broad multinational involvement. The vision of the project is freely available text-to-speech for all Wikipedia languages (currently 293). In this paper, we present the project itself and its first steps: requirements, initial architecture, and initial steps to include crowdsourcing and evaluation.
我们提出WikiSpeech,这是一个雄心勃勃的联合项目,旨在(1)通过维基媒体基金会的服务器架构提供开源的文本到语音;(2)利用维基百科庞大而活跃的用户群,实现文本到语音的不断改进;(3)完善现有的文本转语音众包方式,并开发新的众包方式;(4)开发新的和调整现有的评估方法,使它们非常适合大声阅读维基百科文章的特定用例,同时能够利用维基百科提供的庞大用户群。在启动时,该项目得到了瑞典邮政和电信管理局的支持,并由维基媒体公司、STTS和KTH领导,但从长远来看,该项目旨在广泛的跨国参与。该项目的愿景是免费提供所有维基百科语言的文本到语音(目前有293种)。在本文中,我们介绍了项目本身及其第一步:需求,初始架构,以及包括众包和评估的初始步骤。
{"title":"WikiSpeech - enabling open source text-to-speech for Wikipedia","authors":"J. Andersson, S. Berlin, André Costa, Harald Berthelsen, Hanna Lindgren, N. Lindberg, J. Beskow, Jens Edlund, Joakim Gustafson","doi":"10.21437/SSW.2016-16","DOIUrl":"https://doi.org/10.21437/SSW.2016-16","url":null,"abstract":"We present WikiSpeech, an ambitious joint project aiming to (1) make open source text-to-speech available through Wikimedia Foundation’s server architecture; (2) utilize the large and active Wikipedia user base to achieve continuously improving text-to-speech; (3) improve existing and develop new crowdsourcing methods for text-to-speech; and (4) develop new and adapt current evaluation methods so that they are well suited for the particular use case of reading Wikipedia articles out loud while at the same time capable of harnessing the huge user base made available by Wikipedia. At its inauguration, the project is backed by The Swedish Post and Telecom Authority and headed by Wikimedia Sverige, STTS and KTH, but in the long run, the project aims at broad multinational involvement. The vision of the project is freely available text-to-speech for all Wikipedia languages (currently 293). In this paper, we present the project itself and its first steps: requirements, initial architecture, and initial steps to include crowdsourcing and evaluation.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123949288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Open-Source Consumer-Grade Indic Text To Speech 开源消费级索引文本到语音
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-31
Andrew Wilkinson, A. Parlikar, Sunayana Sitaram, Tim White, A. Black, Suresh Bazaj
Open-source text-to-speech (TTS) software has enabled the development of voices in multiple languages, including many high-resource languages, such as English and European languages. However, building voices for low-resource languages is still challenging. We describe the development of TTS systems for 12 Indian languages using the Festvox framework, for which we developed a common frontend for Indian languages. Voices for eight of these 12 languages are available for use with Flite, a lightweight, fast run-time synthesizer, and the Android Flite app available in the Google Play store. Recently, the baseline Punjabi TTS voice was built end-to-end in a month by two undergraduate students (without any prior knowledge of TTS) with help from two of the authors of this paper. The framework can be used to build a baseline Indic TTS voice in two weeks, once a text corpus is selected and a suitable native speaker is identified.
开源的文本到语音(TTS)软件使得多种语言的语音开发成为可能,包括许多资源丰富的语言,如英语和欧洲语言。然而,为低资源语言构建语音仍然具有挑战性。我们描述了使用Festvox框架开发12种印度语言的TTS系统,并为此开发了一个印度语言的通用前端。这12种语言中有8种的声音可以通过Flite使用,Flite是一款轻量级、快速的运行时合成器,Android Flite应用程序可以在Google Play商店中使用。最近,在本文两位作者的帮助下,两名本科生(之前对TTS没有任何了解)在一个月内建立了旁遮普语TTS基线语音。一旦选择了文本语料库并确定了合适的母语使用者,该框架可用于在两周内构建基线印度TTS语音。
{"title":"Open-Source Consumer-Grade Indic Text To Speech","authors":"Andrew Wilkinson, A. Parlikar, Sunayana Sitaram, Tim White, A. Black, Suresh Bazaj","doi":"10.21437/SSW.2016-31","DOIUrl":"https://doi.org/10.21437/SSW.2016-31","url":null,"abstract":"Open-source text-to-speech (TTS) software has enabled the development of voices in multiple languages, including many high-resource languages, such as English and European languages. However, building voices for low-resource languages is still challenging. We describe the development of TTS systems for 12 Indian languages using the Festvox framework, for which we developed a common frontend for Indian languages. Voices for eight of these 12 languages are available for use with Flite, a lightweight, fast run-time synthesizer, and the Android Flite app available in the Google Play store. Recently, the baseline Punjabi TTS voice was built end-to-end in a month by two undergraduate students (without any prior knowledge of TTS) with help from two of the authors of this paper. The framework can be used to build a baseline Indic TTS voice in two weeks, once a text corpus is selected and a suitable native speaker is identified.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114595027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Prosodic and Spectral iVectors for Expressive Speech Synthesis 表达性语音合成的韵律和谱向量
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-10
Igor Jauk, A. Bonafonte
This work presents a study on the suitability of prosodic andacoustic features, with a special focus on i-vectors, in expressivespeech analysis and synthesis. For each utterance of two dif-ferent databases, a laboratory recorded emotional acted speech,and an audiobook, several prosodic and acoustic features are ex-tracted. Among them, i-vectors are built not only on the MFCCbase, but also on F0, power and syllable durations. Then, un-supervised clustering is performed using different feature com-binations. The resulting clusters are evaluated calculating clus-ter entropy for labeled portions of the databases. Additionally,synthetic voices are trained, applying speaker adaptive training,from the clusters built from the audiobook. The voices are eval-uated in a perceptual test where the participants have to edit anaudiobook paragraph using the synthetic voices.The objective results suggest that i-vectors are very use-ful for the audiobook, where different speakers (book charac-ters) are imitated. On the other hand, for the laboratory record-ings, traditional prosodic features outperform i-vectors. Also,a closer analysis of the created clusters suggest that differentspeakers use different prosodic and acoustic means to conveyemotions. The perceptual results suggest that the proposed i-vector based feature combinations can be used for audiobookclustering and voice training.
这项工作提出了一项关于韵律和声学特征的适用性的研究,特别关注i向量,在表达语音分析和合成中。对于实验室记录的情感行为语音和有声读物两个不同数据库中的每个话语,提取了几个韵律和声学特征。其中,i向量不仅建立在MFCCbase上,还建立在F0、功率和音节时长上。然后,使用不同的特征组合进行无监督聚类。计算数据库标记部分的聚类熵来评估结果聚类。此外,合成的声音被训练,应用演讲者自适应训练,从从有声读物建立的集群。这些声音是在一个感知测试中评估的,参与者必须使用合成的声音编辑一个有声读物段落。客观结果表明,i-vector对于模仿不同说话者(书中的角色)的有声读物非常有用。另一方面,对于实验室记录,传统的韵律特征优于i向量。此外,对所创造的集群进行更仔细的分析表明,不同的说话者使用不同的韵律和声学手段来传达情感。感知结果表明,所提出的基于i向量的特征组合可以用于有声读物聚类和语音训练。
{"title":"Prosodic and Spectral iVectors for Expressive Speech Synthesis","authors":"Igor Jauk, A. Bonafonte","doi":"10.21437/SSW.2016-10","DOIUrl":"https://doi.org/10.21437/SSW.2016-10","url":null,"abstract":"This work presents a study on the suitability of prosodic andacoustic features, with a special focus on i-vectors, in expressivespeech analysis and synthesis. For each utterance of two dif-ferent databases, a laboratory recorded emotional acted speech,and an audiobook, several prosodic and acoustic features are ex-tracted. Among them, i-vectors are built not only on the MFCCbase, but also on F0, power and syllable durations. Then, un-supervised clustering is performed using different feature com-binations. The resulting clusters are evaluated calculating clus-ter entropy for labeled portions of the databases. Additionally,synthetic voices are trained, applying speaker adaptive training,from the clusters built from the audiobook. The voices are eval-uated in a perceptual test where the participants have to edit anaudiobook paragraph using the synthetic voices.The objective results suggest that i-vectors are very use-ful for the audiobook, where different speakers (book charac-ters) are imitated. On the other hand, for the laboratory record-ings, traditional prosodic features outperform i-vectors. Also,a closer analysis of the created clusters suggest that differentspeakers use different prosodic and acoustic means to conveyemotions. The perceptual results suggest that the proposed i-vector based feature combinations can be used for audiobookclustering and voice training.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130595296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Nonaudible murmur enhancement based on statistical voice conversion and noise suppression with external noise monitoring 基于统计语音转换和外部噪声监测的噪声抑制的不可听杂音增强
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-9
Y. Tajiri, T. Toda
This paper presents a method for making nonaudible murmur (NAM) enhancement based on statistical voice conversion (VC) robust against external noise. NAM, which is an extremely soft whispered voice, is a promising medium for silent speech communication thanks to its faint volume. Although such a soft voice can still be detected with a special body-conductive microphone, its quality significantly degrades compared to that of air-conductive voices. It has been shown that the statistical VC technique is capable of significantly improving quality of NAM by converting it into the air-conductive voices. However, this technique is not helpful under noisy conditions because a detected NAM signal easily suffers from external noise, and acoustic mismatches are caused between such a noisy NAM signal and a previously trained conversion model. To address this issue, in this paper we apply our proposed noise suppression method based on external noise monitoring to the statistical NAM enhancement. Moreover, a known noise superimposition method is further applied in order to alleviate the effects of residual noise components on the conversion accuracy. The experimental results demonstrate that the proposed method yields significant improvements in the conversion accuracy compared to the conventional method.
提出了一种基于统计语音转换(VC)的非可听杂音增强方法,使其对外界噪声具有鲁棒性。“NAM”是一种非常柔和的耳语,由于音量很小,因此被认为是无声语言交流的理想媒介。虽然用特殊的身体传导性麦克风仍然可以检测到这种柔和的声音,但与空气传导性声音相比,其质量明显下降。研究表明,统计VC技术能够通过将非声源转换为空气传导性语音来显著提高非声源的质量。然而,这种技术在有噪声的条件下是没有用的,因为检测到的非均匀运动信号容易受到外部噪声的影响,并且这种有噪声的非均匀运动信号与先前训练的转换模型之间会产生声学不匹配。为了解决这一问题,本文将基于外部噪声监测的噪声抑制方法应用于统计NAM增强。此外,为了减轻残余噪声分量对转换精度的影响,进一步采用了已知的噪声叠加方法。实验结果表明,与传统方法相比,该方法在转换精度上有显著提高。
{"title":"Nonaudible murmur enhancement based on statistical voice conversion and noise suppression with external noise monitoring","authors":"Y. Tajiri, T. Toda","doi":"10.21437/SSW.2016-9","DOIUrl":"https://doi.org/10.21437/SSW.2016-9","url":null,"abstract":"This paper presents a method for making nonaudible murmur (NAM) enhancement based on statistical voice conversion (VC) robust against external noise. NAM, which is an extremely soft whispered voice, is a promising medium for silent speech communication thanks to its faint volume. Although such a soft voice can still be detected with a special body-conductive microphone, its quality significantly degrades compared to that of air-conductive voices. It has been shown that the statistical VC technique is capable of significantly improving quality of NAM by converting it into the air-conductive voices. However, this technique is not helpful under noisy conditions because a detected NAM signal easily suffers from external noise, and acoustic mismatches are caused between such a noisy NAM signal and a previously trained conversion model. To address this issue, in this paper we apply our proposed noise suppression method based on external noise monitoring to the statistical NAM enhancement. Moreover, a known noise superimposition method is further applied in order to alleviate the effects of residual noise components on the conversion accuracy. The experimental results demonstrate that the proposed method yields significant improvements in the conversion accuracy compared to the conventional method.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114400296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Jerk Minimization for Acoustic-To-Articulatory Inversion 声学-发音反转的震动最小化
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-14
Avni Rajpal, H. Patil
The effortless speech production in humans requires coordinated movements of the articulators such as lips, tongue, jaw, velum, etc. Therefore, measured trajectories obtained are smooth and slowly-varying. However, the trajectories estimated from acoustic-to-articulatory inversion (AAI) are found to be jagged . Thus, energy minimization is used as smoothness constraint for improving performance of the AAI. Besides energy minimization, jerk (i.e., rate of change of acceleration) is known for quantification of smoothness in case of human motor movements. Human motors are organized to achieve intended goal with smoothest possible movements, under the constraint of minimum accelerative transients. In this paper, we propose jerk minimization as an alternative smoothness criterion for frame-based acoustic-to-articulatory inversion. The resultant trajectories obtained are smooth in the sense that for articulator-specific window size, they will have minimum jerk. The results using this criterion were found to be comparable with inversion schemes based on existing energy minimization criteria for achieving smoothness.
人类要想轻松地说话,就需要唇、舌、颚、腭等发音器官的协调运动。因此,测量得到的轨迹是光滑的和缓慢变化的。然而,从声学到发音反转(AAI)估计的轨迹被发现是锯齿状的。因此,将能量最小化作为AAI的平滑约束来提高其性能。除了能量最小化之外,抽搐(即加速度变化率)在人类运动的情况下被称为平滑的量化。人体运动的组织是为了在最小加速度瞬态的约束下,以尽可能平稳的运动达到预定目标。在本文中,我们提出了抖动最小化作为基于帧的声学-发音反转的备选平滑标准。由此得到的轨迹是平滑的,因为对于发音器特定的窗口大小,它们将具有最小的抖动。发现使用该准则的结果与基于现有能量最小化准则的反演方案相当,以实现平滑。
{"title":"Jerk Minimization for Acoustic-To-Articulatory Inversion","authors":"Avni Rajpal, H. Patil","doi":"10.21437/SSW.2016-14","DOIUrl":"https://doi.org/10.21437/SSW.2016-14","url":null,"abstract":"The effortless speech production in humans requires coordinated movements of the articulators such as lips, tongue, jaw, velum, etc. Therefore, measured trajectories obtained are smooth and slowly-varying. However, the trajectories estimated from acoustic-to-articulatory inversion (AAI) are found to be jagged . Thus, energy minimization is used as smoothness constraint for improving performance of the AAI. Besides energy minimization, jerk (i.e., rate of change of acceleration) is known for quantification of smoothness in case of human motor movements. Human motors are organized to achieve intended goal with smoothest possible movements, under the constraint of minimum accelerative transients. In this paper, we propose jerk minimization as an alternative smoothness criterion for frame-based acoustic-to-articulatory inversion. The resultant trajectories obtained are smooth in the sense that for articulator-specific window size, they will have minimum jerk. The results using this criterion were found to be comparable with inversion schemes based on existing energy minimization criteria for achieving smoothness.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117138581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model 基于α-插值模型的多输出RNN-LSTM多扬声器语音合成
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-19
Santiago Pascual, A. Bonafonte
Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers.
深度学习已成功应用于语音处理。在本文中,我们提出了一种使用多个扬声器的语音合成架构。一些隐藏层由所有扬声器共享,而每个扬声器都有一个特定的输出层。客观实验和感知实验证明,与单说话人模型相比,该方案具有更好的效果。此外,我们还通过在多输出分支上添加新的输出层(a层)来解决扬声器插值问题。将识别代码与多个扬声器的声学特征一起注入该层。实验表明,a层可以有效地学习插值说话人之间的声学特征。
{"title":"Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model","authors":"Santiago Pascual, A. Bonafonte","doi":"10.21437/SSW.2016-19","DOIUrl":"https://doi.org/10.21437/SSW.2016-19","url":null,"abstract":"Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with sin- \u0000gle speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127704141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Non-intrusive Quality Assessment of Synthesized Speech using Spectral Features and Support Vector Regression 基于谱特征和支持向量回归的合成语音非侵入性质量评估
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-21
Meet H. Soni, H. Patil
In this paper, we propose a new quality assessment method for synthesized speech. Unlike previous approaches which uses Hidden Markov Model (HMM) trained on natural utterances as a reference model to predict the quality of synthesized speech, proposed approach uses knowledge about synthesized speech while training the model. The previous approach has been successfully applied in the quality assessment of synthesized speech for the German language. However, it gave poor results for English language databases such as Blizzard Challenge 2008 and 2009 databases. The problem of quality assessment of synthesized speech is posed as a regression problem. The mapping between statistical properties of spectral features extracted from the speech signal and corresponding speech quality score (MOS) was found using Support Vector Regression (SVR). All the experiments were done on Blizzard Challenge Databases of the year 2008, 2009, 2010 and 2012. The results of experiments show that by including knowledge about synthesized speech while training, the performance of quality assessment system can be improved. Moreover, the accuracy of quality assessment system heavily depends on the kind of synthesis system used for signal generation. On Blizzard 2008 and 2009 database, proposed approach gives correlation of 0.28 and 0.49 , respectively, for about 17 % data used in training. Previous approach gives correlation of 0.3 and 0.09 , respectively, using spectral features. For Blizzard 2012 database, proposed approach gives correlation of 0.8 by using 12 % of available data in training.
本文提出了一种新的合成语音质量评价方法。与以往的方法不同,该方法使用基于自然话语训练的隐马尔可夫模型作为参考模型来预测合成语音的质量,该方法在训练模型的同时使用合成语音的相关知识。该方法已成功地应用于德语合成语音的质量评价中。然而,在暴雪挑战赛2008年和2009年的英语数据库中,它给出了糟糕的结果。将合成语音的质量评价问题归结为一个回归问题。利用支持向量回归(SVR)找到从语音信号中提取的频谱特征的统计属性与相应的语音质量分数(MOS)之间的映射关系。所有的实验都是在2008年、2009年、2010年和2012年的暴雪挑战数据库上完成的。实验结果表明,在训练过程中加入有关合成语音的知识,可以提高质量评估系统的性能。此外,质量评估系统的准确性在很大程度上取决于用于信号产生的合成系统的类型。在暴雪2008年和2009年的数据库中,对于训练中使用的约17%的数据,所提出的方法的相关系数分别为0.28和0.49。先前的方法使用光谱特征分别给出0.3和0.09的相关性。对于暴雪2012数据库,本文提出的方法在训练中使用12%的可用数据,得到0.8的相关性。
{"title":"Non-intrusive Quality Assessment of Synthesized Speech using Spectral Features and Support Vector Regression","authors":"Meet H. Soni, H. Patil","doi":"10.21437/SSW.2016-21","DOIUrl":"https://doi.org/10.21437/SSW.2016-21","url":null,"abstract":"In this paper, we propose a new quality assessment method for synthesized speech. Unlike previous approaches which uses Hidden Markov Model (HMM) trained on natural utterances as a reference model to predict the quality of synthesized speech, proposed approach uses knowledge about synthesized speech while training the model. The previous approach has been successfully applied in the quality assessment of synthesized speech for the German language. However, it gave poor results for English language databases such as Blizzard Challenge 2008 and 2009 databases. The problem of quality assessment of synthesized speech is posed as a regression problem. The mapping between statistical properties of spectral features extracted from the speech signal and corresponding speech quality score (MOS) was found using Support Vector Regression (SVR). All the experiments were done on Blizzard Challenge Databases of the year 2008, 2009, 2010 and 2012. The results of experiments show that by including knowledge about synthesized speech while training, the performance of quality assessment system can be improved. Moreover, the accuracy of quality assessment system heavily depends on the kind of synthesis system used for signal generation. On Blizzard 2008 and 2009 database, proposed approach gives correlation of 0.28 and 0.49 , respectively, for about 17 % data used in training. Previous approach gives correlation of 0.3 and 0.09 , respectively, using spectral features. For Blizzard 2012 database, proposed approach gives correlation of 0.8 by using 12 % of available data in training.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116724346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Mandarin Prosodic Phrase Prediction based on Syntactic Trees 基于句法树的汉语韵律短语预测
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-26
Zhengchen Zhang, Fuxiang Wu, Chenyu Yang, M. Dong, Fu-qiu Zhou
Prosodic phrases (PPs) are important for Mandarin Text-To-Speech systems. Most of the existing PP detection methods need large manually annotated corpora to learn the models. In this paper, we propose a rule based method to predict the PP boundaries employing the syntactic information of a sentence. The method is based on the ob-servation that a prosodic phrase is a meaningful segment of a sentence with length restrictions. A syntactic structure allows to segment a sentence according to grammars. We add some length restrictions to the segmentations to predict the PP boundaries. An F-Score of 0.693 was obtained in the experiments, which is about 0.02 higher than the one got by a Conditional Random Field based method.
韵律短语在汉语文本到语音系统中是非常重要的。现有的PP检测方法大多需要大量的人工标注语料库来学习模型。本文提出了一种基于规则的基于句子句法信息的PP边界预测方法。该方法基于对韵律短语是有长度限制的句子的有意义片段的观察。句法结构允许根据语法对句子进行分段。我们在分割中加入一些长度限制来预测PP边界。实验得到的F-Score为0.693,比基于条件随机场的方法的F-Score提高了约0.02。
{"title":"Mandarin Prosodic Phrase Prediction based on Syntactic Trees","authors":"Zhengchen Zhang, Fuxiang Wu, Chenyu Yang, M. Dong, Fu-qiu Zhou","doi":"10.21437/SSW.2016-26","DOIUrl":"https://doi.org/10.21437/SSW.2016-26","url":null,"abstract":"Prosodic phrases (PPs) are important for Mandarin Text-To-Speech systems. Most of the existing PP detection methods need large manually annotated corpora to learn the models. In this paper, we propose a rule based method to predict the PP boundaries employing the syntactic information of a sentence. The method is based on the ob-servation that a prosodic phrase is a meaningful segment of a sentence with length restrictions. A syntactic structure allows to segment a sentence according to grammars. We add some length restrictions to the segmentations to predict the PP boundaries. An F-Score of 0.693 was obtained in the experiments, which is about 0.02 higher than the one got by a Conditional Random Field based method.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126297231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Automatic, model-based detection of pause-less phrase boundaries from fundamental frequency and duration features 从基本频率和持续时间特征中自动,基于模型的无停顿短语边界检测
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-1
Mahsa Sadat Elyasi Langarani, J. V. Santen
Prosodic phrase boundaries (PBs) are a key aspect of spoken communication. In automatic PB detection, it is common to use local acoustic features, textual features, or a combination of both. Most approaches – regardless of features used – succeed in detecting major PBs (break score “4” in ToBI annotation, typically involving a pause) while detection of intermediate PBs (break score “3” in ToBI annotation) is still challenging. In this study we investigate the detection of intermediate, “pause-less” PBs using prosodic models, using a new corpus character-ized by strong prosodic dynamics and an existing (CMU) corpus. We show how using duration and fundamental frequency modeling can improve detection of these PBs, as measured by the F1 score, compared to Festival, which only uses textual features to detect PBs. We believe that this study contributes to our understanding of the prosody of phrase breaks.
韵律短语边界(PBs)是口语交际的一个重要方面。在自动PB检测中,通常使用局部声学特征、文本特征或两者的组合。大多数方法——不管使用什么特征——都能成功地检测到主要的PBs (ToBI注释中的中断分数为“4”,通常包括暂停),而检测到中间的PBs (ToBI注释中的中断分数为“3”)仍然具有挑战性。在这项研究中,我们利用韵律模型,利用一个以强韵律动态为特征的新语料库和一个现有的(CMU)语料库,研究了中间“无停顿”PBs的检测。与Festival相比,我们展示了如何使用持续时间和基频建模来改进这些PBs的检测(通过F1分数测量),Festival仅使用文本特征来检测PBs。我们认为这项研究有助于我们对断句韵律的理解。
{"title":"Automatic, model-based detection of pause-less phrase boundaries from fundamental frequency and duration features","authors":"Mahsa Sadat Elyasi Langarani, J. V. Santen","doi":"10.21437/SSW.2016-1","DOIUrl":"https://doi.org/10.21437/SSW.2016-1","url":null,"abstract":"Prosodic phrase boundaries (PBs) are a key aspect of spoken communication. In automatic PB detection, it is common to use local acoustic features, textual features, or a combination of both. Most approaches – regardless of features used – succeed in detecting major PBs (break score “4” in ToBI annotation, typically involving a pause) while detection of intermediate PBs (break score “3” in ToBI annotation) is still challenging. In this study we investigate the detection of intermediate, “pause-less” PBs using prosodic models, using a new corpus character-ized by strong prosodic dynamics and an existing (CMU) corpus. We show how using duration and fundamental frequency modeling can improve detection of these PBs, as measured by the F1 score, compared to Festival, which only uses textual features to detect PBs. We believe that this study contributes to our understanding of the prosody of phrase breaks.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116566992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Synthesis Workshop
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1