2012 8th International Symposium on Chinese Spoken Language Processing最新文献

英文中文

A fast two-microphone noise reduction algorithm based on power level ratio for mobile phone 基于功率级比的手机双麦克风快速降噪算法

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423512

Jian Zhang, Risheng Xia, Zhonghua Fu, Junfeng Li, Yonghong Yan

As an indispensable instrument in today's daily life, mobile phone that is used in diverse environments suffers from the speech quality degradation due to the presence of background noises. In this paper, we propose a novel two-microphone noise reduction system based on the power level ratio (PLR) of the observed signals. In the system, a primary microphone is placed close to the talker's mouth and an auxiliary microphone is placed away. The proposed noise reduction algorithm first calculates the ratio of the power of observed signals at the two microphones, and subsequently calculates the spectral gain function based on the power level ratio using the sigmoid function. Experimental results demonstrate that this proposed algorithm yields the much higher speech quality than the state-of-the-art noise-reduction algorithms, and more importantly involves much less computational cost which makes it feasible for mobile phone.

手机作为当今生活中不可缺少的工具，在各种环境下使用，由于背景噪声的存在，导致语音质量下降。本文提出了一种基于观测信号功率电平比(PLR)的双传声器降噪系统。在这个系统中，一个主麦克风被放置在靠近说话人嘴的地方，一个辅助麦克风被放置在远离说话人嘴的地方。本文提出的降噪算法首先计算两个传声器处观测信号的功率比，然后利用s型函数计算基于功率级比的频谱增益函数。实验结果表明，该算法比现有的降噪算法产生更高的语音质量，更重要的是计算成本更低，使其适用于移动电话。

引用次数: 16

Detection and emphatic realization of contrastive word pairs for expressive text-to-speech synthesis 文本-语音合成中对比词对的检测和重点实现

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423493

Chun Xing Li, Zhiyong Wu, Fanbo Meng, H. Meng, Lianhong Cai

This paper addresses the problem of automatic detection of contrastive word pairs and their acoustic realization in emphasis for expressive text-to-speech (TTS) synthesis in English. Support vector machines (SVMs) have been used to automatically detect contrastive word pairs from lexical features, syntactic dependencies and semantic relations. A much better performance is achieved by adding accent ratio and word identity features. Hidden Markov model (HMM) based speech synthesis is then used to generate emphatic speeches by putting emphasis on the detected contrastive word pairs. Subjective experiments show that most of the listeners consider putting emphasis on contrastive word pairs is more acceptable than on non-contrastive word pairs. This indicates the importance of the accurate detection of contrastive word pairs.

本文重点研究了英语文本到语音表达合成中对比词对的自动检测及其声学实现问题。支持向量机(svm)已被用于从词汇特征、句法依赖和语义关系等方面自动检测对比词对。通过添加重音比和单词身份特征，可以获得更好的性能。基于隐马尔可夫模型(HMM)的语音合成将重点放在检测到的对比词对上，从而生成强调语音。主观实验表明，大多数听者认为强调对比词对比强调非对比词对更容易接受。这表明准确检测对比词对的重要性。

引用次数: 4

Alleviating the small sample-size problem in i-vector based speaker verification 缓解基于i向量的说话人验证中的小样本问题

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423527

Wei Rao, M. Mak

This paper investigates the small sample-size problem in i-vector based speaker verification systems. The idea of i-vectors is to represent the characteristics of speakers in the factors of a factor analyzer. Because the factor loading matrix defines the possible speaker and channel-variability of i-vectors, it is important to suppress the unwanted channel variability. Linear discriminant analysis (LDA), within-class covariance normalization (WCCN), and probabilistic LDA are commonly used for such purpose. These methods, however, require training data comprising many speakers each providing sufficient recording sessions for good performance. Performance will suffer when the number of speakers and/or number of sessions per speaker are too small. This paper compares four approaches to addressing this small sample-size problem: (1) preprocessing the i-vectors by PCA before applying LDA (PCA+LDA), (2) replacing the matrix inverse in LDA by pseudo-inverse, (3) applying multi-way LDA by exploiting the microphone and speaker labels of the training data, and (4) increasing the matrix rank in LDA by generating more i-vectors using utterance partitioning. Results based on NIST 2010 SRE suggests that utterance partitioning performs the best, followed by multi-way LDA and PCA+LDA.

本文研究了基于i向量的说话人验证系统中的小样本问题。i向量的思想是在因子分析仪的因子中表示说话人的特征。由于因子加载矩阵定义了i向量的可能扬声器和通道可变性，因此抑制不需要的通道可变性非常重要。线性判别分析(LDA)、类内协方差归一化(WCCN)和概率LDA通常用于此目的。然而，这些方法需要由许多发言者组成的训练数据，每个发言者提供足够的录音会话以获得良好的表现。当演讲者的数量和/或每个演讲者的会话数量太少时，性能将受到影响。本文比较了四种解决小样本问题的方法:(1)在应用LDA之前先用PCA预处理i向量(PCA+LDA)，(2)用伪逆代替LDA中的矩阵逆，(3)利用训练数据的麦克风和扬声器标签应用多路LDA，(4)利用话语划分生成更多的i向量来提高LDA中的矩阵秩。基于NIST 2010 SRE的结果表明，话语分割效果最好，其次是多路LDA和PCA+LDA。

{"title":"Alleviating the small sample-size problem in i-vector based speaker verification","authors":"Wei Rao, M. Mak","doi":"10.1109/ISCSLP.2012.6423527","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423527","url":null,"abstract":"This paper investigates the small sample-size problem in i-vector based speaker verification systems. The idea of i-vectors is to represent the characteristics of speakers in the factors of a factor analyzer. Because the factor loading matrix defines the possible speaker and channel-variability of i-vectors, it is important to suppress the unwanted channel variability. Linear discriminant analysis (LDA), within-class covariance normalization (WCCN), and probabilistic LDA are commonly used for such purpose. These methods, however, require training data comprising many speakers each providing sufficient recording sessions for good performance. Performance will suffer when the number of speakers and/or number of sessions per speaker are too small. This paper compares four approaches to addressing this small sample-size problem: (1) preprocessing the i-vectors by PCA before applying LDA (PCA+LDA), (2) replacing the matrix inverse in LDA by pseudo-inverse, (3) applying multi-way LDA by exploiting the microphone and speaker labels of the training data, and (4) increasing the matrix rank in LDA by generating more i-vectors using utterance partitioning. Results based on NIST 2010 SRE suggests that utterance partitioning performs the best, followed by multi-way LDA and PCA+LDA.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127683899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech 基于合成语音主观评价结果的改进单元选择语音合成方法

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423524

Xian-Jun Xia, Zhenhua Ling, Chen-Yu Yang, Lirong Dai

This paper presents an improved unit selection and waveform concatenation speech synthesis method by gathering and utilizing human feedbacks on synthetic speech. Firstly, a set of texts are synthesized by the baseline unit selection synthesis system. Each prosodic word within the synthetic speech is then evaluated as a natural one or an unnatural one by listeners. In our proposed method, these natural synthetic segments are treated as virtual candidate units to extend the original speech corpus for unit selection. A new speech synthesis system is constructed using this extended speech corpus. A synthetic error detector based on SVM classifier is also built using the natural and unnatural synthetic speech. At synthesis time, the input text is synthesized using the baseline system and the extended system simultaneously. The two unit selection results are evaluated by the trained synthetic error detector to determine the optimal one. Experimental results prove the effectiveness of our proposed method in improving the naturalness of synthetic speech on a task of synthesizing place names.

本文提出了一种改进的单元选择和波形拼接语音合成方法，该方法通过收集和利用人对合成语音的反馈。首先，利用基线单元选择合成系统合成一组文本;然后，听者将合成语音中的每个韵律词评估为自然的或不自然的。在我们提出的方法中，将这些自然合成的片段作为虚拟候选单元来扩展原始语音语料库进行单元选择。利用该扩展语料库构建了一个新的语音合成系统。利用自然和非自然合成语音，构建了基于支持向量机分类器的合成错误检测器。在合成时，输入文本同时使用基线系统和扩展系统进行合成。通过训练后的综合误差检测器对两个单元的选择结果进行评估，以确定最优单元。实验结果证明了该方法在提高地名合成语音的自然度方面的有效性。

{"title":"Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech","authors":"Xian-Jun Xia, Zhenhua Ling, Chen-Yu Yang, Lirong Dai","doi":"10.1109/ISCSLP.2012.6423524","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423524","url":null,"abstract":"This paper presents an improved unit selection and waveform concatenation speech synthesis method by gathering and utilizing human feedbacks on synthetic speech. Firstly, a set of texts are synthesized by the baseline unit selection synthesis system. Each prosodic word within the synthetic speech is then evaluated as a natural one or an unnatural one by listeners. In our proposed method, these natural synthetic segments are treated as virtual candidate units to extend the original speech corpus for unit selection. A new speech synthesis system is constructed using this extended speech corpus. A synthetic error detector based on SVM classifier is also built using the natural and unnatural synthetic speech. At synthesis time, the input text is synthesized using the baseline system and the extended system simultaneously. The two unit selection results are evaluated by the trained synthetic error detector to determine the optimal one. Experimental results prove the effectiveness of our proposed method in improving the naturalness of synthetic speech on a task of synthesizing place names.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130753248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A comparative study of perception of tone 2 and tone 3 in Mandarin by native speakers and Japanese learners 本族语和日语学习者对普通话声调二和声调三感知的比较研究

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423540

T. Zou, Jinsong Zhang, Wen Cao

This paper investigated Mandarin Tone 2-Tone 3 perceptual space in isolated syllables and disyllables of native speakers and Japanese learners. In two experiments, we examined the listeners' use of pitch height and the position of turning point as cues of tone identity. The result showed that, in isolated syllables, Chinese perceived these two tones in categorical fashion. Pitch height was more important than the turning point as a cue. Within a certain range of pitch height, there was a complementary relationship between these two variables. The perceptual result of Japanese subjects did not show apparent categorical pattern. In disyllables, for Chinese subjects, the contextual influence on the boundary position in Tone 2-half Tone 3 continuum was not significant, but the boundary position in pitch height and turning point Tone 2-Tone 3 continuum shifted significantly in different tonal context. Comparing to Chinese subjects, Japanese subjects' perceptual ranges of Tone 3 in isolated syllables and disyllables were narrower, and it's more difficult for them to identify these two tones in disyllables.

本文研究了母语者和日语学习者的普通话声调2-声调3对孤立音节和双音节的感知空间。在两个实验中，我们考察了听者使用音调高度和拐点位置作为音调识别的线索。结果表明，在孤立的音节中，中国人对这两种音调的感知是绝对的。作为提示，俯仰高度比拐点更重要。在一定的俯仰高度范围内，这两个变量之间存在互补关系。日本被试的知觉结果没有明显的范畴模式。在汉语被试的双音节中，语境对声调2-半声调3连续体边界位置的影响不显著，但声调高度和声调2-声调3连续体拐点的边界位置在不同的声调语境中发生了显著的变化。与汉语受试者相比，日语受试者在孤立音节和双音节中对Tone 3的感知范围更窄，在双音节中识别这两个音调的难度更大。

{"title":"A comparative study of perception of tone 2 and tone 3 in Mandarin by native speakers and Japanese learners","authors":"T. Zou, Jinsong Zhang, Wen Cao","doi":"10.1109/ISCSLP.2012.6423540","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423540","url":null,"abstract":"This paper investigated Mandarin Tone 2-Tone 3 perceptual space in isolated syllables and disyllables of native speakers and Japanese learners. In two experiments, we examined the listeners' use of pitch height and the position of turning point as cues of tone identity. The result showed that, in isolated syllables, Chinese perceived these two tones in categorical fashion. Pitch height was more important than the turning point as a cue. Within a certain range of pitch height, there was a complementary relationship between these two variables. The perceptual result of Japanese subjects did not show apparent categorical pattern. In disyllables, for Chinese subjects, the contextual influence on the boundary position in Tone 2-half Tone 3 continuum was not significant, but the boundary position in pitch height and turning point Tone 2-Tone 3 continuum shifted significantly in different tonal context. Comparing to Chinese subjects, Japanese subjects' perceptual ranges of Tone 3 in isolated syllables and disyllables were narrower, and it's more difficult for them to identify these two tones in disyllables.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133378858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Experiments on unsupervised statistical parametric speech synthesis 无监督统计参数语音合成实验

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423518

Jinfu Ni, Y. Shiga, H. Kawai, H. Kashioka

In order to build web-based voicefonts, an unsupervised method is needed to automate the extraction of acoustic and linguistic properties of speech. This paper addresses the impact of automatic speech transcription on statistical parametric speech synthesis based on a single speaker's 100 hour speech corpus, focusing particularly on two factors of affecting speech quality: transcript accuracy and size of training dataset. Experimental results indicate that for an unsupervised method to achieve fair (MOS 3) voice quality, 1.5 hours of speech are necessary for phone accuracy over 80% and 3.5 hours necessary for phone accuracy down to 65%. Improvement in MOS quality turns out not to be significant when more than 4 hours of speech are used. The usage of automatic transcripts certainly leads to voice degradation. One of the mechanisms behind this is that transcript errors cause mismatches between speech segments and phone labels that significantly distort the structures of decision trees in resultant HMM-based voices.

为了构建基于网络的语音字体，需要一种无监督的方法来自动提取语音的声学和语言特性。本文讨论了基于单个说话人100小时语音语料库的自动语音转录对统计参数语音合成的影响，特别关注影响语音质量的两个因素:转录精度和训练数据集的大小。实验结果表明，对于一个无监督的方法，要达到公平的(MOS 3)语音质量，需要1.5小时的语音才能使电话准确率达到80%以上，3.5小时的语音才能使电话准确率降低到65%。当使用超过4小时的语音时，MOS质量的改善并不显着。使用自动转录肯定会导致语音退化。这背后的机制之一是，转录错误导致语音片段和电话标签之间的不匹配，从而严重扭曲了基于hmm的语音中决策树的结构。

引用次数: 0

Analyzing semantic orientation of terms using Affinity Propagation 使用关联传播分析术语的语义方向

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423494

Yan Li, Si Li, Weiran Xu, Jun Guo

The aim of term semantic orientation analysis is to mine the sentiment polarity of words and phrases from their contexts. This paper presents a novel algorithm called Affinity Propagation to analyze semantic orientations of terms. Specifically, we build an informative graph from text corpus using an efficient Word Activation Force model and regard each term as a node in the graph. Then we propagate opinionated information over the whole graph using only a small number of seed terms. We finally utilize affinity vectors rather than context vectors to detect term polarities and construct the polarity lexicons. Evaluations on our proposed algorithm show its advantages over the state-of-the-art algorithms. And further improvements can be obtained by combining Affinity Propagation with Pointwise Mutual Information.

术语语义倾向分析的目的是从语境中挖掘词语和短语的情感极性。本文提出了一种新的术语语义方向分析算法——关联传播算法。具体而言，我们使用高效的单词激活力模型从文本语料库中构建信息图，并将每个术语视为图中的一个节点。然后我们只用少量的种子项在整个图上传播自以为是的信息。最后，我们利用亲和向量而不是上下文向量来检测词极性并构建极性词典。对我们提出的算法的评估表明它比最先进的算法有优势。将亲和传播与点互信息相结合，进一步改进了算法。

引用次数: 0

Preliminary study on the interlanguage speech intelligibility benefit for English-Mandarin bilingual l2 learners 中际语言语可理解性对英汉双语第二语言学习者的益处初步研究

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423487

Guo Li, P. Mok

Previous studies into interlanguage speech intelligibility benefit (ISIB) have focused on the influence of subjects' native language (L1) on the phonetic production and perception in their second language (L2). However, no research so far has examined the effect of the listeners' exposure and training in a second language (L2) on their understanding of L2-accented native language (L1). This paper aims to address this issue with subjects whose L1 is English, L2 is Mandarin. Characteristics of Mandarin-accented English include the devoicing of word-final consonants, and the insufficient distinction of the vowel pairs /i:/ - /i/ and /ε/ - /æ/. These features could negatively affect listeners' understanding of contrastive word pairs. In this study, 9 native Mandarin listeners, 9 monolingual English listeners and 9 English-Mandarin bilinguals were asked to listen to recordings of Mandarin-accented English and identify minimal pairs involving the above consonant and vowel contrasts. Results show that among all three groups of subjects, native Mandarin listeners scored the highest accuracy, but English listeners with training in Mandarin and monolingual English speakers had similar scores. These findings support the existence of ISIB for Mandarin, and call for further study on bilingual L2 learners.

以往关于中介语语音可理解性效益的研究主要集中在母语对第二语言语音产生和感知的影响上。然而，到目前为止，还没有研究调查听者在第二语言(L2)中的暴露和训练对他们理解L2口音的母语(L1)的影响。本文旨在以母语为英语，第二语言为普通话的学生为研究对象来解决这一问题。普通话口音英语的特点包括词尾辅音的发音，以及元音对/i:/ - /i/和/ε/ - /æ/的区分不够。这些特征会对听者对对比词对的理解产生负面影响。在本研究中，9名母语为普通话的听众、9名单语英语的听众和9名英汉双语的听众被要求听带有普通话口音的英语录音，并识别包含上述辅音和元音对比的最小对。结果显示，在所有三组受试者中，以普通话为母语的听众得分最高，但接受过普通话培训的英语听众和只会说英语的人得分相似。这些研究结果支持了汉语学习者的ISIB的存在，并呼吁对双语L2学习者进行进一步的研究。

{"title":"Preliminary study on the interlanguage speech intelligibility benefit for English-Mandarin bilingual l2 learners","authors":"Guo Li, P. Mok","doi":"10.1109/ISCSLP.2012.6423487","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423487","url":null,"abstract":"Previous studies into interlanguage speech intelligibility benefit (ISIB) have focused on the influence of subjects' native language (L1) on the phonetic production and perception in their second language (L2). However, no research so far has examined the effect of the listeners' exposure and training in a second language (L2) on their understanding of L2-accented native language (L1). This paper aims to address this issue with subjects whose L1 is English, L2 is Mandarin. Characteristics of Mandarin-accented English include the devoicing of word-final consonants, and the insufficient distinction of the vowel pairs /i:/ - /i/ and /ε/ - /æ/. These features could negatively affect listeners' understanding of contrastive word pairs. In this study, 9 native Mandarin listeners, 9 monolingual English listeners and 9 English-Mandarin bilinguals were asked to listen to recordings of Mandarin-accented English and identify minimal pairs involving the above consonant and vowel contrasts. Results show that among all three groups of subjects, native Mandarin listeners scored the highest accuracy, but English listeners with training in Mandarin and monolingual English speakers had similar scores. These findings support the existence of ISIB for Mandarin, and call for further study on bilingual L2 learners.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127199293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hierarchical prosodic pattern selection based on Fujisaki model for natural mandarin speech synthesis 基于Fujisaki模型的汉语自然语音合成韵律模式分层选择

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423536

Yi-Chin Huang, Chung-Hsien Wu, Sz-Ting Weng

In this paper, a novel hierarchical prosodic unit selection method is proposed based on pitch contour pattern retrieval, in order to obtained natural pitch contour of the personalized synthetic voice. In this framework, a hierarchical prosodic unit based on Fujisaki model is used to take local pitch contour variation and global intonation of utterance into account. Furthermore, novel ways of integrating pitch contour pattern of prosodic units in the prosodic model are invents in order to improve the selection mechanism of the appropriate pitch contour. A novel prosodic unit selection method is proposed based on sentence retrieval, which not only uses the traditional linguistic cue as selection criterion, but also the shape of the pitch contour. Also, the codewords of pitch patterns in the training corpus and synthesized corpus were constructed by the proposed method and were used to map the relation between training codeword and synthesized corpus. Finally, the language model of pitch pattern is adopted to find the proper pitch pattern sequence of input text. The evaluation results demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of model-based method.

为了获得个性化合成语音的自然音高轮廓，提出了一种基于音高轮廓模式检索的分层韵律单元选择方法。在该框架中，采用基于Fujisaki模型的分层韵律单元来考虑局部音高轮廓变化和全局语音语调。此外，还发明了将韵律单元的音高轮廓模式整合到韵律模型中的新方法，以改进合适音高轮廓的选择机制。提出了一种新的基于句子检索的韵律单位选择方法，该方法不仅使用传统的语言线索作为选择标准，而且利用音高轮廓的形状作为选择标准。利用该方法构建了训练语料库和合成语料库中的音高模式码字，用于映射训练码字与合成语料库之间的关系。最后，采用音高模式语言模型寻找输入文本的合适音高模式序列。评价结果表明，与基于模型的方法相比，所提出的韵律模型大大提高了合成语音的语调自然度。

{"title":"Hierarchical prosodic pattern selection based on Fujisaki model for natural mandarin speech synthesis","authors":"Yi-Chin Huang, Chung-Hsien Wu, Sz-Ting Weng","doi":"10.1109/ISCSLP.2012.6423536","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423536","url":null,"abstract":"In this paper, a novel hierarchical prosodic unit selection method is proposed based on pitch contour pattern retrieval, in order to obtained natural pitch contour of the personalized synthetic voice. In this framework, a hierarchical prosodic unit based on Fujisaki model is used to take local pitch contour variation and global intonation of utterance into account. Furthermore, novel ways of integrating pitch contour pattern of prosodic units in the prosodic model are invents in order to improve the selection mechanism of the appropriate pitch contour. A novel prosodic unit selection method is proposed based on sentence retrieval, which not only uses the traditional linguistic cue as selection criterion, but also the shape of the pitch contour. Also, the codewords of pitch patterns in the training corpus and synthesized corpus were constructed by the proposed method and were used to map the relation between training codeword and synthesized corpus. Finally, the language model of pitch pattern is adopted to find the proper pitch pattern sequence of input text. The evaluation results demonstrate that the proposed prosodic model substantially improves naturalness of the intonation of the synthesized speech compared to that of model-based method.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"17 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125733164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Acoustic analysis of disguised voices with raised and lowered pitch 音调升高和降低的伪装声音的声学分析

2012 8th International Symposium on Chinese Spoken Language Processing

Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423510

Cuiling Zhang

Change of pitch is a common disguise type adopted by criminals in forensic voice comparison, which introduces substantial variance of acoustic properties and results in poorer performance of speaker recognition. This paper investigates the acoustic properties of disguised voices with raised and lowered pitch from 11 Chinese male speakers. Parameters including fundamental frequency, syllable duration, intensity, vowel formant frequencies, and long term average spectrum (LTAS) were measured and statistically compared with those of normal voice. The effect of voice disguise on speaker recognition by both human and machine is also evaluated. The results show that speakers have different ability of adjusting pitch. Pitch change results in corresponding change of other parameters and degradation of speaker recognition by parameter discrimination, auditory perception and automatic speaker recognition, but some systematic changes of parameters provide clues for forensic voice comparison.

音高变化是犯罪分子在法医语音比对中常用的一种伪装方式，它引入了很大的声学特性变化，导致说话人识别性能变差。本文研究了11位中国男性说话者变声的声学特性。测量基本频率、音节时长、强度、元音构成频率、长期平均频谱(LTAS)等参数，并与正常语音进行统计比较。同时对语音伪装对说话人识别和机器识别的影响进行了评价。结果表明，不同的说话人具有不同的音调调节能力。音调变化会导致其他参数的相应变化，从而导致参数判别、听觉感知和自动语音识别对说话人识别的降低，但一些参数的系统性变化为法医语音比对提供了线索。

引用次数: 15

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 8th International Symposium on Chinese Spoken Language Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀