首页 > 最新文献

Speech Synthesis Workshop最新文献

英文 中文
Archiving pushed Inferences from Sensor Data Streams 归档从传感器数据流推送推断
Pub Date : 2018-04-09 DOI: 10.5220/0003116000380046
J. Brunsmann
Although pervasively deployed, sensors are currently neither highly interconnected nor very intelligent, since they do not know each other and produce only raw data streams. This lack of interoperability and high-level reasoning capabilities are major obstacles for exploiting the full potential of sensor data streams. Since interoperability and reasoning processes require a common understanding, RDF based linked sensor data is used in the semantic sensor web to articulate the meaning of sensor data. This paper shows how to derive higher levels of streamed sensor data understanding by constructing reasoning knowledge with SPARQL. In addition, it is demonstrated how to push these inferences to interested clients in different application domains like social media streaming, weather observation and intelligent product lifecycle maintenance. Finally, the paper describes how real-time pushing of inferences enables provenance tracking and how archiving of inferred events could support further decision making processes.
尽管传感器被广泛部署,但它们目前既不高度互联,也不非常智能,因为它们彼此不认识,只产生原始数据流。缺乏互操作性和高级推理能力是开发传感器数据流全部潜力的主要障碍。由于互操作性和推理过程需要一个共同的理解,因此在语义传感器web中使用基于RDF的链接传感器数据来阐明传感器数据的含义。本文展示了如何通过SPARQL构造推理知识来获得更高层次的流传感器数据理解。此外,还演示了如何将这些推断推送给不同应用领域(如社交媒体流、天气观测和智能产品生命周期维护)的感兴趣的客户。最后,本文描述了实时推送推断如何实现来源跟踪,以及推断事件的存档如何支持进一步的决策过程。
{"title":"Archiving pushed Inferences from Sensor Data Streams","authors":"J. Brunsmann","doi":"10.5220/0003116000380046","DOIUrl":"https://doi.org/10.5220/0003116000380046","url":null,"abstract":"Although pervasively deployed, sensors are currently neither highly interconnected nor very intelligent, since they do not know each other and produce only raw data streams. This lack of interoperability and high-level reasoning capabilities are major obstacles for exploiting the full potential of sensor data streams. Since interoperability and reasoning processes require a common understanding, RDF based linked sensor data is used in the semantic sensor web to articulate the meaning of sensor data. This paper shows how to derive higher levels of streamed sensor data understanding by constructing reasoning knowledge with SPARQL. In addition, it is demonstrated how to push these inferences to interested clients in different application domains like social media streaming, weather observation and intelligent product lifecycle maintenance. Finally, the paper describes how real-time pushing of inferences enables provenance tracking and how archiving of inferred events could support further decision making processes.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126596096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Parallel and cascaded deep neural networks for text-to-speech synthesis 用于文本到语音合成的并行和级联深度神经网络
Pub Date : 2016-09-15 DOI: 10.21437/SSW.2016-17
M. Ribeiro, O. Watts, J. Yamagishi
An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.
研究了用于语音合成的级联和并行深度神经网络。在这些系统中,超音段语言特征(音节级及以上)与音段特征(音素级及以下)分开处理。网络的超分段组件学习高级语言单元的紧凑分布表示,而不受任何分段影响。然后使用级联或并行方法将这些表示集成到帧级系统中。在级联网络中,超分段表示被用作帧级网络的输入。在并行网络中,分段特征和超分段特征分别进行处理,并在后期进行连接。这些实验是用一组标准的高维语言特征和手工修剪的语言特征进行的。可以观察到,分层系统始终优于基线前馈系统。类似地,并行网络优于级联网络。
{"title":"Parallel and cascaded deep neural networks for text-to-speech synthesis","authors":"M. Ribeiro, O. Watts, J. Yamagishi","doi":"10.21437/SSW.2016-17","DOIUrl":"https://doi.org/10.21437/SSW.2016-17","url":null,"abstract":"An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121952802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Merlin: An Open Source Neural Network Speech Synthesis System 一个开源的神经网络语音合成系统
Pub Date : 2016-09-15 DOI: 10.21437/SSW.2016-33
Zhizheng Wu, O. Watts, Simon King
We introduce the Merlin speech synthesis toolkit for neural network-based speech synthesis. The system takes linguistic features as input, and employs neural networks to predict acoustic features, which are then passed to a vocoder to produce the speech waveform. Various neural network architectures are implemented, including a standard feedforward neural network, mixture density neural network, recurrent neural network (RNN), long short-term memory (LSTM) recurrent neural network, amongst others. The toolkit is Open Source, written in Python, and is extensible. This paper briefly describes the system, and provides some benchmarking results on a freely-available corpus.
介绍了用于神经网络语音合成的Merlin语音合成工具箱。该系统以语言特征作为输入,并采用神经网络来预测声学特征,然后将其传递给声码器以产生语音波形。实现了各种神经网络架构,包括标准前馈神经网络、混合密度神经网络、循环神经网络(RNN)、长短期记忆(LSTM)循环神经网络等。该工具包是开源的,用Python编写,并且是可扩展的。本文简要介绍了该系统,并在一个免费语料库上提供了一些基准测试结果。
{"title":"Merlin: An Open Source Neural Network Speech Synthesis System","authors":"Zhizheng Wu, O. Watts, Simon King","doi":"10.21437/SSW.2016-33","DOIUrl":"https://doi.org/10.21437/SSW.2016-33","url":null,"abstract":"We introduce the Merlin speech synthesis toolkit for neural network-based speech synthesis. The system takes linguistic features as input, and employs neural networks to predict acoustic features, which are then passed to a vocoder to produce the speech waveform. Various neural network architectures are implemented, including a standard feedforward neural network, mixture density neural network, recurrent neural network (RNN), long short-term memory (LSTM) recurrent neural network, amongst others. The toolkit is Open Source, written in Python, and is extensible. This paper briefly describes the system, and provides some benchmarking results on a freely-available corpus.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125094344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 320
A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora 基于HMM、DNN和RNN的超大型说话人相关语料库语音合成系统性能比较研究
Pub Date : 2016-09-15 DOI: 10.21437/SSW.2016-20
Xin Wang, Shinji Takaki, J. Yamagishi
This study investigates the impact of the amount of training data on the performance of parametric speech synthesis systems. A Japanese corpus with 100 hours’ audio recordings of a male voice and another corpus with 50 hours’ recordings of a female voice were utilized to train systems based on hidden Markov model (HMM), feed-forward neural network and recurrent neural network (RNN). The results show that the improvement on the accuracy of the predicted spectral features gradually diminishes as the amount of training data increases. However, different from the “diminishing returns” in the spectral stream, the accuracy of the predicted F0 trajectory by the HMM and RNN systems tends to consistently benefit from the increasing amount of training data.
本研究探讨了训练数据量对参数化语音合成系统性能的影响。利用100小时的日语男声语料库和50小时的女声语料库对基于隐马尔可夫模型(HMM)、前馈神经网络和循环神经网络(RNN)的系统进行训练。结果表明,随着训练数据量的增加,光谱特征预测精度的提高逐渐降低。然而,与谱流中的“收益递减”不同,HMM和RNN系统预测F0轨迹的准确性往往会随着训练数据量的增加而持续受益。
{"title":"A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora","authors":"Xin Wang, Shinji Takaki, J. Yamagishi","doi":"10.21437/SSW.2016-20","DOIUrl":"https://doi.org/10.21437/SSW.2016-20","url":null,"abstract":"This study investigates the impact of the amount of training data on the performance of parametric speech synthesis systems. A Japanese corpus with 100 hours’ audio recordings of a male voice and another corpus with 50 hours’ recordings of a female voice were utilized to train systems based on hidden Markov model (HMM), feed-forward neural network and recurrent neural network (RNN). The results show that the improvement on the accuracy of the predicted spectral features gradually diminishes as the amount of training data increases. However, different from the “diminishing returns” in the spectral stream, the accuracy of the predicted F0 trajectory by the HMM and RNN systems tends to consistently benefit from the increasing amount of training data.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134506279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Wideband Harmonic Model: Alignment and Noise Modeling for High Quality Speech Synthesis 宽带谐波模型:高质量语音合成的对准和噪声建模
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-37
Slava Shechtman, A. Sorin
Speech sinusoidal modeling has been successfully applied to a broad range of speech analysis, synthesis and modification tasks. However, developing a high fidelity full band sinusoidal model that preserves its high quality on speech transformation still remains an open research problem. Such a system can be extremely useful for high quality speech synthesis. In this paper we present an enhanced harmonic model representation for voiced/mixed wide band speech that is capable of high quality speech reconstruction and transformation in the parametric domain. Two key elements of the proposed model are a proper phase alignment and a decomposition of a speech frame to "deterministic" and dense "stochastic" harmonic model representations that can be separately manipulated. The coupling of stochastic harmonic representation with the deterministic one is performed by means of intra-frame periodic energy envelope, estimated at analysis time and preserved during original/transformed speech reconstruction. In addition, we present a compact representation of the stochastic harmonic component, so that the proposed model has less parameters than the regular full band harmonic model, with better Signal to Reconstruction Error performance. On top of that, the improved phase alignment of the proposed model provides better phase coherency in transformed speech, resulting in better quality of speech transformations. We demonstrate the subjective and objective performance of the new model on speech reconstruction and pitch modification tasks. Performance of the proposed model within unit selection TTS is also presented.
语音正弦建模已经成功地应用于广泛的语音分析、合成和修改任务。然而,如何开发一种保真度高的全频带正弦模型,并在语音变换中保持其高质量,仍然是一个有待研究的问题。这样的系统对于高质量的语音合成非常有用。本文提出了一种增强的浊音/混合宽带语音谐波模型表示,能够在参数域进行高质量的语音重构和变换。提出的模型的两个关键要素是适当的相位对齐和将语音帧分解为可以单独操作的“确定性”和密集的“随机”谐波模型表示。随机谐波表示与确定性谐波表示的耦合是通过帧内周期能量包络实现的,在分析时估计,并在原始/变换后的语音重建过程中保留。此外,我们提出了一种紧凑的随机谐波分量表示,使得所提出的模型比常规的全频带谐波模型具有更少的参数,具有更好的信重构误差性能。在此基础上,改进的相位对准模型在转换后的语音中提供了更好的相位一致性,从而提高了语音转换的质量。我们展示了新模型在语音重建和音高修改任务上的主客观性能。本文还介绍了该模型在单元选择TTS中的性能。
{"title":"Wideband Harmonic Model: Alignment and Noise Modeling for High Quality Speech Synthesis","authors":"Slava Shechtman, A. Sorin","doi":"10.21437/SSW.2016-37","DOIUrl":"https://doi.org/10.21437/SSW.2016-37","url":null,"abstract":"Speech sinusoidal modeling has been successfully applied to a broad range of speech analysis, synthesis and modification tasks. However, developing a high fidelity full band sinusoidal model that preserves its high quality on speech transformation still remains an open research problem. Such a system can be extremely useful for high quality speech synthesis. In this paper we present an enhanced harmonic model representation for voiced/mixed wide band speech that is capable of high quality speech reconstruction and transformation in the parametric domain. Two key elements of the proposed model are a proper phase alignment and a decomposition of a speech frame to \"deterministic\" and dense \"stochastic\" harmonic model representations that can be separately manipulated. The coupling of stochastic harmonic representation with the deterministic one is performed by means of intra-frame periodic energy envelope, estimated at analysis time and preserved during original/transformed speech reconstruction. In addition, we present a compact representation of the stochastic harmonic component, so that the proposed model has less parameters than the regular full band harmonic model, with better Signal to Reconstruction Error performance. On top of that, the improved phase alignment of the proposed model provides better phase coherency in transformed speech, resulting in better quality of speech transformations. We demonstrate the subjective and objective performance of the new model on speech reconstruction and pitch modification tasks. Performance of the proposed model within unit selection TTS is also presented.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122785255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Synthesising Filled Pauses: Representation and Datamixing 合成填充停顿:表示和数据混合
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-2
R. Dall, M. Tomalin, M. Wester
Filled pauses occur frequently in spontaneous human speech, yet modern text-to-speech synthesis systems rarely model these disfluencies overtly, and consequently they do not output convincing synthetic filled pauses. This paper presents a text-to-speech system that is specifically designed to model these particular disfluencies more efffectively. A preparatory investigation shows that a synthetic voice trained exclusively on spontaneous speech is perceived to be inferior in quality to a voice trained entirely on read speech, even though the latter does not handle filled pauses well. This motivates an investigation into the phonetic representation of filled pauses which show that, in a preference test, the use of a distinct phone for filled pauses is preferred over the standard /V/ phone and the alternative /@/ phone. In addition, we present a variety of data-mixing techniques to combine the strengths of standard synthesis systems trained on read speech corpora with the supplementary advantages offered by systems trained on spontaneous speech. In a MUSHRA-style test, it is found that the best overall quality is obtained by combining the two types of corpora using a source marking technique. Specifically, general speech is synthesised with a standard mark, while filled pauses are synthesised with a spontaneous mark, which has the added benefit of also producing filled pauses that are comparatively well synthesised.
填充停顿经常发生在自发的人类语音中,但现代文本到语音合成系统很少公开模拟这些不流畅,因此它们不能输出令人信服的合成填充停顿。本文提出了一个文本到语音的系统,专门设计用于更有效地模拟这些特定的不流畅。一项初步调查表明,完全训练自发语音的合成语音在质量上被认为不如完全训练朗读语音的合成语音,即使后者不能很好地处理填充停顿。这激发了对填充停顿的语音表征的调查,结果表明,在偏好测试中,使用不同的电话来填充停顿比使用标准的/V/ phone和可选的/@/ phone更受欢迎。此外,我们提出了各种数据混合技术,以结合在读语音语料库上训练的标准合成系统的优势和在自发语音上训练的系统提供的补充优势。在mushra风格的测试中,使用源标记技术将两种类型的语料库组合在一起,可以获得最佳的整体质量。具体来说,一般语音是用标准标记合成的,而填充停顿是用自发标记合成的,这还有一个额外的好处,即生成相对较好的填充停顿。
{"title":"Synthesising Filled Pauses: Representation and Datamixing","authors":"R. Dall, M. Tomalin, M. Wester","doi":"10.21437/SSW.2016-2","DOIUrl":"https://doi.org/10.21437/SSW.2016-2","url":null,"abstract":"Filled pauses occur frequently in spontaneous human speech, yet modern text-to-speech synthesis systems rarely model these disfluencies overtly, and consequently they do not output convincing synthetic filled pauses. This paper presents a text-to-speech system that is specifically designed to model these particular disfluencies more efffectively. A preparatory investigation shows that a synthetic voice trained exclusively on spontaneous speech is perceived to be inferior in quality to a voice trained entirely on read speech, even though the latter does not handle filled pauses well. This motivates an investigation into the phonetic representation of filled pauses which show that, in a preference test, the use of a distinct phone for filled pauses is preferred over the standard /V/ phone and the alternative /@/ phone. In addition, we present a variety of data-mixing techniques to combine the strengths of standard synthesis systems trained on read speech corpora with the supplementary advantages offered by systems trained on spontaneous speech. In a MUSHRA-style test, it is found that the best overall quality is obtained by combining the two types of corpora using a source marking technique. Specifically, general speech is synthesised with a standard mark, while filled pauses are synthesised with a spontaneous mark, which has the added benefit of also producing filled pauses that are comparatively well synthesised.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126110168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
How to select a good voice for TTS 如何为TTS选择一个好的声音
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-15
Sunhee Kim
Even though the perceived quality of a speaker’s natural voice does not necessarily guarantee the quality of synthesized speech, it is required to select a certain number of candidates based on their natural voice before moving to the evaluation stage of synthesized sentences. This paper describes a male speaker selection procedure for unit selection synthesis systems in English and Japanese based on perceptive evaluation and acoustic measurements of the speakers’ natural voice. A perceptive evaluation is performed on eight professional voice talents of each language. A total of twenty native-speaker listeners are recruited in both languages and each listener is asked to rate on eight analytical factors by using a five-scale score and rank three best speakers. Acoustic measurement focuses on the voice quality by extracting two measures from Long Term Average Spectrum (LTAS), the so-called Speakers Formant (SPF), which corresponds to the peak intensity between 3 kHz and 4 kHz, and the Alpha ratio (AR), which is the lower level difference between 0 and 1 kHz and 1 and 4 kHz ranges. The perceptive evaluation results show a very strong correlation between the total score and the preference in both languages, 0.9183 in English and 0.8589 in Japanese. The correlations between the perceptive evaluation and acoustic measurements are moderate with respect to SPF and AR, 0.473 and -0.494 in English, and 0.288 and -0.263 in Japanese.
尽管说话人自然语音的感知质量不一定保证合成语音的质量,但在进入合成句子的评价阶段之前,需要根据说话人的自然语音选择一定数量的候选人。本文介绍了一种基于对说话人自然声音的感知评价和声学测量的英语和日语单元选择合成系统中男性说话人的选择过程。对每种语言的8位专业语音人才进行了感性评价。总共招募了20名以两种语言为母语的听众,每位听众被要求用5个等级的分数对8个分析因素进行评分,并选出3名最好的演讲者。声学测量侧重于通过从长期平均频谱(LTAS)中提取两个度量来衡量语音质量,即所谓的扬声器峰峰(SPF),它对应于3khz和4khz之间的峰值强度,以及α比(AR),它是0和1khz之间以及1和4khz之间的较低电平差。感知评价结果显示总分与两种语言偏好之间的相关性非常强,英语为0.9183,日语为0.8589。在SPF和AR方面,感知评价与声学测量之间的相关性适中,英语为0.473和-0.494,日语为0.288和-0.263。
{"title":"How to select a good voice for TTS","authors":"Sunhee Kim","doi":"10.21437/SSW.2016-15","DOIUrl":"https://doi.org/10.21437/SSW.2016-15","url":null,"abstract":"Even though the perceived quality of a speaker’s natural voice does not necessarily guarantee the quality of synthesized speech, it is required to select a certain number of candidates based on their natural voice before moving to the evaluation stage of synthesized sentences. This paper describes a male speaker selection procedure for unit selection synthesis systems in English and Japanese based on perceptive evaluation and acoustic measurements of the speakers’ natural voice. A perceptive evaluation is performed on eight professional voice talents of each language. A total of twenty native-speaker listeners are recruited in both languages and each listener is asked to rate on eight analytical factors by using a five-scale score and rank three best speakers. Acoustic measurement focuses on the voice quality by extracting two measures from Long Term Average Spectrum (LTAS), the so-called Speakers Formant (SPF), which corresponds to the peak intensity between 3 kHz and 4 kHz, and the Alpha ratio (AR), which is the lower level difference between 0 and 1 kHz and 1 and 4 kHz ranges. The perceptive evaluation results show a very strong correlation between the total score and the preference in both languages, 0.9183 in English and 0.8589 in Japanese. The correlations between the perceptive evaluation and acoustic measurements are moderate with respect to SPF and AR, 0.473 and -0.494 in English, and 0.288 and -0.263 in Japanese.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130257527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis 基于情感分析的文本情感预测及其表达性语音合成
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-4
Eva Vanmassenhove, João P. Cabral, F. Haider
The generation of expressive speech is a great challenge for text-to-speech synthesis in audiobooks. One of the most important factors is the variation in speech emotion or voice style. In this work, we developed a method to predict the emotion from a sentence so that we can convey it through the synthetic voice. It consists of combining a standard emotion-lexicon based technique with the polarity-scores (positive/negative polarity) provided by a less fine-grained sentiment analysis tool, in order to get more accurate emotion-labels. The primary goal of this emotion prediction tool was to select the type of voice (one of the emotions or neutral) given the input sentence to a state-of-the-art HMM-based Text-to-Speech (TTS) system. In addition, we also combined the emotion prediction from text with a speech clustering method to select the utterances with emotion during the process of building the emotional corpus for the speech synthesizer. Speech clustering is a popular approach to divide the speech data into subsets associated with different voice styles. The challenge here is to determine the clusters that map out the basic emotions from an audiobook corpus that contains high variety of speaking styles, in a way that minimizes the need for human annotation. The evaluation of emotion classification from text showed that, in general, our system can obtain accuracy results close to that of human annotators. Results also indicate that this technique is useful in the selection of utterances with emotion for building expressive synthetic voices.
表达性语音的生成是有声读物文本-语音合成的一大挑战。其中一个最重要的因素是言语情绪或声音风格的变化。在这项工作中,我们开发了一种方法来预测句子中的情绪,以便我们可以通过合成语音来传达它。它将基于标准情感词典的技术与由不太细粒度的情感分析工具提供的极性分数(积极/消极极性)相结合,以获得更准确的情感标签。这种情绪预测工具的主要目标是为最先进的基于hmm的文本到语音(TTS)系统选择输入句子的声音类型(一种情绪或中性)。此外,在为语音合成器构建情感语料库的过程中,我们还将文本情感预测与语音聚类方法相结合,选择具有情感的话语。语音聚类是一种常用的将语音数据划分为不同语音风格的子集的方法。这里的挑战是确定从包含多种说话风格的有声读物语料库中映射基本情绪的集群,以最小化人工注释的需要。对文本情感分类的评价表明,总体而言,我们的系统可以获得接近人类注释器的准确率结果。结果还表明,该技术在选择带有情感的话语以构建具有表现力的合成语音方面是有用的。
{"title":"Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis","authors":"Eva Vanmassenhove, João P. Cabral, F. Haider","doi":"10.21437/SSW.2016-4","DOIUrl":"https://doi.org/10.21437/SSW.2016-4","url":null,"abstract":"The generation of expressive speech is a great challenge for text-to-speech synthesis in audiobooks. One of the most important factors is the variation in speech emotion or voice style. In this work, we developed a method to predict the emotion from a sentence so that we can convey it through the synthetic voice. It consists of combining a standard emotion-lexicon based technique with the polarity-scores (positive/negative polarity) provided by a less fine-grained sentiment analysis tool, in order to get more accurate emotion-labels. The primary goal of this emotion prediction tool was to select the type of voice (one of the emotions or neutral) given the input sentence to a state-of-the-art HMM-based Text-to-Speech (TTS) system. In addition, we also combined the emotion prediction from text with a speech clustering method to select the utterances with emotion during the process of building the emotional corpus for the speech synthesizer. Speech clustering is a popular approach to divide the speech data into subsets associated with different voice styles. The challenge here is to determine the clusters that map out the basic emotions from an audiobook corpus that contains high variety of speaking styles, in a way that minimizes the need for human annotation. The evaluation of emotion classification from text showed that, in general, our system can obtain accuracy results close to that of human annotators. Results also indicate that this technique is useful in the selection of utterances with emotion for building expressive synthetic voices.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117295652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text 代码混合文本合成的跨语言系统实验
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-13
Sunayana Sitaram, Sai Krishna Rallabandi, Shruti Rijhwani, A. Black
Most Text to Speech (TTS) systems today assume that the input is in a single language written in its native script, which is the language that the TTS database is recorded in. However, due to the rise in conversational data available from social media, phenomena such as code-mixing, in which multiple languages are used together in the same conversation or sentence are now seen in text. TTS systems capable of synthesizing such text need to be able to handle multiple languages at the same time, and may also need to deal with noisy input. Previously, we proposed a framework to synthesize code-mixed text by using a TTS database in a single language, identifying the language that each word was from, normalizing spellings of a language written in a non-standardized script and mapping the phonetic space of mixed language to the language that the TTS database was recorded in. We extend this cross-lingual approach to more language pairs, and improve upon our language identification technique. We conduct listening tests to determine which of the two languages being mixed should be used as the target language. We perform experiments for code-mixed Hindi-English and German-English and conduct listening tests with bilingual speakers of these languages. From our subjective experiments we find that listeners have a strong preference for cross-lingual systems with Hindi as the target language for code-mixed Hindi and English text. We also find that listeners prefer cross-lingual systems in English that can synthesize German text for codemixed German and English text.
目前,大多数文本到语音(TTS)系统都假定输入是用其本地脚本编写的单一语言,TTS数据库就是用这种语言记录的。然而,由于社交媒体提供的会话数据的增加,现在在文本中可以看到代码混合等现象,即在同一对话或句子中同时使用多种语言。能够合成此类文本的TTS系统需要能够同时处理多种语言,并且可能还需要处理有噪声的输入。之前,我们提出了一个框架,通过使用单一语言的TTS数据库来合成代码混合文本,识别每个单词来自的语言,规范化以非标准化脚本编写的语言的拼写,并将混合语言的语音空间映射到TTS数据库记录的语言。我们将这种跨语言方法扩展到更多的语言对,并改进我们的语言识别技术。我们进行听力测试,以确定混合的两种语言中哪一种应该用作目的语。我们对代码混合的印地语-英语和德语-英语进行实验,并对这些语言的双语使用者进行听力测试。从我们的主观实验中,我们发现听众对以印地语为目标语言的跨语言系统有强烈的偏好,以代码混合的印地语和英语文本。我们还发现,听众更喜欢英语的跨语言系统,它可以将德语文本合成为德语和英语的代码混合文本。
{"title":"Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text","authors":"Sunayana Sitaram, Sai Krishna Rallabandi, Shruti Rijhwani, A. Black","doi":"10.21437/SSW.2016-13","DOIUrl":"https://doi.org/10.21437/SSW.2016-13","url":null,"abstract":"Most Text to Speech (TTS) systems today assume that the input is in a single language written in its native script, which is the language that the TTS database is recorded in. However, due to the rise in conversational data available from social media, phenomena such as code-mixing, in which multiple languages are used together in the same conversation or sentence are now seen in text. TTS systems capable of synthesizing such text need to be able to handle multiple languages at the same time, and may also need to deal with noisy input. Previously, we proposed a framework to synthesize code-mixed text by using a TTS database in a single language, identifying the language that each word was from, normalizing spellings of a language written in a non-standardized script and mapping the phonetic space of mixed language to the language that the TTS database was recorded in. We extend this cross-lingual approach to more language pairs, and improve upon our language identification technique. We conduct listening tests to determine which of the two languages being mixed should be used as the target language. We perform experiments for code-mixed Hindi-English and German-English and conduct listening tests with bilingual speakers of these languages. From our subjective experiments we find that listeners have a strong preference for cross-lingual systems with Hindi as the target language for code-mixed Hindi and English text. We also find that listeners prefer cross-lingual systems in English that can synthesize German text for codemixed German and English text.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131490180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis 基于递归神经网络隐藏状态的上下文表示用于统计参数语音合成
Pub Date : 2016-09-13 DOI: 10.21437/SSW.2016-28
Sivanand Achanta, Rambabu Banoth, Ayushi Pandey, Anandaswarup Vadapalli, S. Gangashetty
In this paper, we propose to use hidden state vector ob-tained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that, supplementing the conventional text features with a continuous frame-level acoustically guided representation would improve the acoustic modeling. The hidden state from an RNN trained to predict acoustic features is used as the additional contextual information. A dataset consisting of 2 Indian languages (Telugu and Hindi) from Blizzard challenge 2015 was used in our experiments. Both the subjective listening tests and the objective scores indicate that the proposed approach per-forms significantly better than the baseline DNN system.
在本文中,我们提出使用从递归神经网络(RNN)中获得的隐藏状态向量作为基于深度神经网络(DNN)的统计参数语音合成的上下文向量表示。而在典型的基于深度神经网络的系统中,从电话级别到话语级别存在文本特征的层次结构,它们通常采用1-hot-k编码表示。我们的假设是,用连续的帧级声学引导表示补充传统的文本特征将改善声学建模。通过训练来预测声学特征的RNN的隐藏状态被用作附加的上下文信息。我们的实验使用了暴雪挑战赛2015中包含两种印度语言(泰卢固语和印地语)的数据集。主观听力测试和客观得分都表明,该方法的表现明显优于基线DNN系统。
{"title":"Contextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis","authors":"Sivanand Achanta, Rambabu Banoth, Ayushi Pandey, Anandaswarup Vadapalli, S. Gangashetty","doi":"10.21437/SSW.2016-28","DOIUrl":"https://doi.org/10.21437/SSW.2016-28","url":null,"abstract":"In this paper, we propose to use hidden state vector ob-tained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that, supplementing the conventional text features with a continuous frame-level acoustically guided representation would improve the acoustic modeling. The hidden state from an RNN trained to predict acoustic features is used as the additional contextual information. A dataset consisting of 2 Indian languages (Telugu and Hindi) from Blizzard challenge 2015 was used in our experiments. Both the subjective listening tests and the objective scores indicate that the proposed approach per-forms significantly better than the baseline DNN system.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129464246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Speech Synthesis Workshop
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1