首页 > 最新文献

2012 8th International Symposium on Chinese Spoken Language Processing最新文献

英文 中文
Pitch accent detection and prediction with DCT features and CRF model 基于DCT特征和CRF模型的重音检测与预测
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423504
Wenping Hu, Yao Qian, F. Soong
Automatic detection/prediction of pitch accent, which determines the existence of prominent syllable of a word and its corresponding pitch accent pattern, is crucial in making expressive Text-To-Speech (TTS) synthesis. To train a model to detect and predict pitch accent usually requires a large amount of annotated training data to be manually labeled by phonetically trained language experts, which is both time consuming and costly. In this paper, we propose a semi-automatic algorithm to do pitch accent modeling, where the existence of accentuation in the training data is labeled at the word level by native speaker (i.e., not phonetically trained language experts) and the type of a pitch accent is automatically detected with its vector quantized DCT coefficient patterns. A cascaded, two-stage approach, which separates predicting the pitch accent existence and determining corresponding pitch accent type, is proposed to process any unrestricted text input with Conditional Random Field (CRF) trained models. The evaluation results show that the new approach outperforms the conventional, single stage approach.
音高重音的自动检测/预测是实现文本到语音(TTS)表达合成的关键,它能确定单词中是否存在突出音节及其对应的音高重音模式。为了训练一个模型来检测和预测音高重音,通常需要由经过语音训练的语言专家手动标记大量带注释的训练数据,这既耗时又昂贵。在本文中,我们提出了一种半自动算法来进行音高重音建模,其中由母语人士(即非语音训练的语言专家)在单词级别标记训练数据中是否存在重音,并使用其矢量量化的DCT系数模式自动检测音高重音的类型。提出了一种级联的两阶段方法,将预测重音存在和确定相应的重音类型分离出来,用于条件随机场(CRF)训练模型处理任意不受限制的文本输入。评价结果表明,新方法优于传统的单阶段方法。
{"title":"Pitch accent detection and prediction with DCT features and CRF model","authors":"Wenping Hu, Yao Qian, F. Soong","doi":"10.1109/ISCSLP.2012.6423504","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423504","url":null,"abstract":"Automatic detection/prediction of pitch accent, which determines the existence of prominent syllable of a word and its corresponding pitch accent pattern, is crucial in making expressive Text-To-Speech (TTS) synthesis. To train a model to detect and predict pitch accent usually requires a large amount of annotated training data to be manually labeled by phonetically trained language experts, which is both time consuming and costly. In this paper, we propose a semi-automatic algorithm to do pitch accent modeling, where the existence of accentuation in the training data is labeled at the word level by native speaker (i.e., not phonetically trained language experts) and the type of a pitch accent is automatically detected with its vector quantized DCT coefficient patterns. A cascaded, two-stage approach, which separates predicting the pitch accent existence and determining corresponding pitch accent type, is proposed to process any unrestricted text input with Conditional Random Field (CRF) trained models. The evaluation results show that the new approach outperforms the conventional, single stage approach.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130912886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical clustering and robust identification for block-based autoregressive speech parameter estimation 基于块的自回归语音参数估计的层次聚类和鲁棒识别
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423482
Ruofei Chen, C. Chan
Given accurate system parameters like state transition matrix F and corruption mapping matrix H, clean speech autoregressive (AR) parameters can be effectively estimated from a series of noisy observations with Kalman filtering. In this paper, we address several fundamental issues to improve the linear dynamical system (LDS) based AR parameter estimation. A hierarchical time series clustering scheme is devised to truly group speech blocks with similar trajectories and corruption types. In addition, a correlated robust identification scheme using a posteriori signal-to-noise (SNR) mask is proposed to improve the identification accuracy. The effectiveness of the proposed clustering and identification scheme is evaluated in terms of spectral distortion between the Kalman estimates and the true clean speech parameters. Significant improvement is observed over the original matrix quantization (MQ) based approach. The proposed scheme is also successfully applied in a model-based speech enhancement application, and is expected to be effective in various codebook driven speech applications for robust identification purpose.
给定准确的系统参数,如状态转移矩阵F和损坏映射矩阵H,可以通过卡尔曼滤波从一系列噪声观测中有效地估计干净语音自回归(AR)参数。在本文中,我们解决了几个基本问题,以改进线性动力系统(LDS)的AR参数估计。设计了一种分层时间序列聚类方案,对具有相似轨迹和腐败类型的语音块进行真正的分组。此外,为了提高识别精度,提出了一种基于后验信噪比掩模的相关鲁棒识别方案。根据卡尔曼估计和真实干净语音参数之间的频谱失真来评估所提出的聚类和识别方案的有效性。与原始的基于矩阵量化(MQ)的方法相比,可以观察到显著的改进。该方案还成功地应用于基于模型的语音增强应用,并有望在各种码本驱动的语音应用中有效地实现鲁棒识别目的。
{"title":"Hierarchical clustering and robust identification for block-based autoregressive speech parameter estimation","authors":"Ruofei Chen, C. Chan","doi":"10.1109/ISCSLP.2012.6423482","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423482","url":null,"abstract":"Given accurate system parameters like state transition matrix F and corruption mapping matrix H, clean speech autoregressive (AR) parameters can be effectively estimated from a series of noisy observations with Kalman filtering. In this paper, we address several fundamental issues to improve the linear dynamical system (LDS) based AR parameter estimation. A hierarchical time series clustering scheme is devised to truly group speech blocks with similar trajectories and corruption types. In addition, a correlated robust identification scheme using a posteriori signal-to-noise (SNR) mask is proposed to improve the identification accuracy. The effectiveness of the proposed clustering and identification scheme is evaluated in terms of spectral distortion between the Kalman estimates and the true clean speech parameters. Significant improvement is observed over the original matrix quantization (MQ) based approach. The proposed scheme is also successfully applied in a model-based speech enhancement application, and is expected to be effective in various codebook driven speech applications for robust identification purpose.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133489508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic pitch accent detection using auto-context with acoustic features 自动音调重音检测使用自动上下文与声学特征
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423523
Junhong Zhao, Weiqiang Zhang, Hua Yuan, Jia Liu, Shanhong Xia
In prosody event detection field, many local acoustic features have been proposed for representing the prosody characteristics of speech unit. The context information that represents some possible regularities underlying neighboring prosody events, however, hasn't been used effectively. The main difficulty to utilize prosodic context is that it's hard to capture the long-distance sequential dependency. In order to solve this problem, we introduce a new learning approach: auto-context. In this algorithm, a classifier is first trained based on local acoustic features; the discriminative probabilities produced by the classifier are selected as context information for the next iteration. Then a new classifier is trained by using the selected context information and local acoustic features. Repeating using the updated probabilities as the context information for the next iteration, the algorithm can boost recognition ability during its iterative process until converged. The merit of this method is that it can choose context information in a flexible way, while reserving reliable context information and abandoning unreliable ones. The experimental results showed that the proposed method improved the accuracy by absolutely about 1% for pitch accent detection.
在韵律事件检测领域,人们提出了许多局部声学特征来表示语音单元的韵律特征。然而,表示邻近韵律事件背后的一些可能规律的上下文信息没有得到有效利用。利用韵律上下文的主要困难是难以捕捉长距离顺序依赖关系。为了解决这个问题,我们引入了一种新的学习方法:自动上下文。该算法首先基于局部声学特征训练分类器;分类器产生的判别概率被选择作为下一次迭代的上下文信息。然后利用选择的上下文信息和局部声学特征训练新的分类器。将更新后的概率作为下一次迭代的上下文信息进行重复,使得算法在迭代过程中不断提高识别能力,直至收敛。该方法的优点是可以灵活地选择上下文信息,同时保留可靠的上下文信息,放弃不可靠的上下文信息。实验结果表明,该方法对音高重音检测的准确率提高了1%左右。
{"title":"Automatic pitch accent detection using auto-context with acoustic features","authors":"Junhong Zhao, Weiqiang Zhang, Hua Yuan, Jia Liu, Shanhong Xia","doi":"10.1109/ISCSLP.2012.6423523","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423523","url":null,"abstract":"In prosody event detection field, many local acoustic features have been proposed for representing the prosody characteristics of speech unit. The context information that represents some possible regularities underlying neighboring prosody events, however, hasn't been used effectively. The main difficulty to utilize prosodic context is that it's hard to capture the long-distance sequential dependency. In order to solve this problem, we introduce a new learning approach: auto-context. In this algorithm, a classifier is first trained based on local acoustic features; the discriminative probabilities produced by the classifier are selected as context information for the next iteration. Then a new classifier is trained by using the selected context information and local acoustic features. Repeating using the updated probabilities as the context information for the next iteration, the algorithm can boost recognition ability during its iterative process until converged. The merit of this method is that it can choose context information in a flexible way, while reserving reliable context information and abandoning unreliable ones. The experimental results showed that the proposed method improved the accuracy by absolutely about 1% for pitch accent detection.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"3 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132792609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Effective sentence selection based on phone/model coverage maximization for speaker adaptation in HMM-based speech synthesis 基于电话/模型覆盖最大化的有效句子选择,用于基于hmm的语音合成中的说话人适应
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423469
C. Lin, Po Kai Huang, Chengyuan Lin, C. Kuo
Reducing the recording effort required in practical speaker adaptive text-to-speech applications would be very useful. In this paper, we present two sentence selection approaches based on a greedy algorithm; one is based on phone coverage and the other is based on model coverage. The former considers the phonetic information in speaker adaptation data, while the latter focuses on occurrences of Mel-cepstral and logF0 models in decision trees of the average voice model. To verify the efficacy of the proposed methods, we compare their performance with that of a random selection method in objective and subjective evaluations. The objective and subjective evaluation results demonstrate that both methods outperform the random selection method.
减少实际说话者自适应文本到语音应用所需的记录工作量将非常有用。本文提出了两种基于贪心算法的句子选择方法;一个是基于电话覆盖,另一个是基于型号覆盖。前者考虑说话人自适应数据中的语音信息,后者关注平均语音模型决策树中Mel-cepstral模型和logF0模型的出现情况。为了验证所提出方法的有效性,我们将其与随机选择方法在客观和主观评价方面的性能进行了比较。客观评价和主观评价结果表明,两种方法均优于随机选择方法。
{"title":"Effective sentence selection based on phone/model coverage maximization for speaker adaptation in HMM-based speech synthesis","authors":"C. Lin, Po Kai Huang, Chengyuan Lin, C. Kuo","doi":"10.1109/ISCSLP.2012.6423469","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423469","url":null,"abstract":"Reducing the recording effort required in practical speaker adaptive text-to-speech applications would be very useful. In this paper, we present two sentence selection approaches based on a greedy algorithm; one is based on phone coverage and the other is based on model coverage. The former considers the phonetic information in speaker adaptation data, while the latter focuses on occurrences of Mel-cepstral and logF0 models in decision trees of the average voice model. To verify the efficacy of the proposed methods, we compare their performance with that of a random selection method in objective and subjective evaluations. The objective and subjective evaluation results demonstrate that both methods outperform the random selection method.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133012669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diachronic contrastive analysis on read speech in broadcast news: Evidence from pitch and duration 广播新闻朗读语的历时对比分析:来自音高和时长的证据
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423498
Yu Zou, Yan Wang, W. He
Which diachronic phonetic changes happened in Mandarin Chinese by the past 100 years? This paper intends to analyze and compare the pitch and duration of read speech in broadcast news from a diachronic perspective. The research results show that the peaks of pitch are the highest, the pitch range is the widest in the 1970s, especially the upraising of the valleys of pitch are so frequent; and the 1950-60s is the second; during the three periods of 1980s, 1990s and 2000s, the peaks of the pitch gradually drift down and the pitch range becomes narrowed. The average word speed of the 1970s is the slowest, and the duration of syllables is the longest; and the 1950-60s is the second; during the other three periods, word speed speeds up and the duration of syllables becomes shortened. Furthermore, the prosodic features are not determined by the text in the different historical periods.
在过去的100年里,普通话发生了哪些历时性的语音变化?本文拟从历时的角度对广播新闻中朗读语音的音高和时长进行分析和比较。研究结果表明:20世纪70年代是音高峰值最高、音高范围最宽的时期,特别是音谷的上升频率最高;20世纪50-60年代是第二个;在20世纪80年代、90年代和2000年代三个时期,音高峰值逐渐下降,音高范围缩小。20世纪70年代的平均语速最慢,音节持续时间最长;20世纪50-60年代是第二个;在其他三个时期,语速加快,音节持续时间缩短。此外,不同历史时期文本的韵律特征也不尽相同。
{"title":"Diachronic contrastive analysis on read speech in broadcast news: Evidence from pitch and duration","authors":"Yu Zou, Yan Wang, W. He","doi":"10.1109/ISCSLP.2012.6423498","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423498","url":null,"abstract":"Which diachronic phonetic changes happened in Mandarin Chinese by the past 100 years? This paper intends to analyze and compare the pitch and duration of read speech in broadcast news from a diachronic perspective. The research results show that the peaks of pitch are the highest, the pitch range is the widest in the 1970s, especially the upraising of the valleys of pitch are so frequent; and the 1950-60s is the second; during the three periods of 1980s, 1990s and 2000s, the peaks of the pitch gradually drift down and the pitch range becomes narrowed. The average word speed of the 1970s is the slowest, and the duration of syllables is the longest; and the 1950-60s is the second; during the other three periods, word speed speeds up and the duration of syllables becomes shortened. Furthermore, the prosodic features are not determined by the text in the different historical periods.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128632344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Phrase-based data selection for language model adaptation in spoken language translation 口语翻译中基于短语的语言模型适应数据选择
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423483
Shixiang Lu, Wei Wei, Xiaoyin Fu, Lichun Fan, Bo Xu
In this paper, we propose an unsupervised phrase-based data selection model, address the problem of selecting no-domain-specific language model (LM) training data to build adapted LM for use. In spoken language translation (SLT) system, we aim at finding the LM training sentences which are similar to the translation task. Compared with the traditional bag-of-words models, the phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. Large-scale experimental results demonstrate that our approach significantly outperforms the state-of-the-art approaches on both LM perplexity and translation performance, respectively.
在本文中,我们提出了一种基于无监督短语的数据选择模型,解决了选择无特定领域语言模型(LM)训练数据以构建适应的LM的问题。在口语翻译(SLT)系统中,我们的目标是寻找与翻译任务相似的LM训练句子。与传统的词袋模型相比,基于短语的数据选择模型更有效,因为它在对短语的选择建模时捕获了上下文信息,而不是孤立地对单个词的选择进行建模。大规模实验结果表明,我们的方法在LM困惑度和翻译性能上分别显著优于最先进的方法。
{"title":"Phrase-based data selection for language model adaptation in spoken language translation","authors":"Shixiang Lu, Wei Wei, Xiaoyin Fu, Lichun Fan, Bo Xu","doi":"10.1109/ISCSLP.2012.6423483","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423483","url":null,"abstract":"In this paper, we propose an unsupervised phrase-based data selection model, address the problem of selecting no-domain-specific language model (LM) training data to build adapted LM for use. In spoken language translation (SLT) system, we aim at finding the LM training sentences which are similar to the translation task. Compared with the traditional bag-of-words models, the phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. Large-scale experimental results demonstrate that our approach significantly outperforms the state-of-the-art approaches on both LM perplexity and translation performance, respectively.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134552238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An analysis of vector Taylor series model compensation for non-stationary noise in speech recognition 语音识别中矢量泰勒级数模型对非平稳噪声的补偿分析
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423503
Duc Hoang Ha Nguyen, Xiong Xiao, Chng Eng Siong, Haizhou Li
In this paper, we investigate a feature conditioning method for the VTS-based model compensation. The VTS is a technique that predicts noisy acoustic model from clean acoustic model and noise model. It is noted that most of the previous studies use a single Gaussian noise model, which is unable to model noise statistics well, especially in non-stationary noisy environments. In this paper, we propose a combination of feature processing and VTS model compensation to handle non-stationary noise more efficiently. In the feature processing stage, the non-stationary characteristics of noise is reduced, hence the processed features is more suitable for VTS model compensation using single Gaussian noise model. Experimental analysis on the AURORA2 task shows that the proposed method has the potential to improve the performance of VTS method in non-stationary environments if good noise estimation is available.
本文研究了一种基于vts的模型补偿的特征调节方法。VTS是一种从净声模型和噪声模型预测噪声模型的技术。值得注意的是,以往的研究大多使用单一高斯噪声模型,无法很好地模拟噪声统计量,特别是在非平稳噪声环境中。本文提出了一种结合特征处理和VTS模型补偿的方法来更有效地处理非平稳噪声。在特征处理阶段,降低了噪声的非平稳特性,因此处理后的特征更适合使用单高斯噪声模型进行VTS模型补偿。对AURORA2任务的实验分析表明,如果有良好的噪声估计,该方法有可能提高VTS方法在非平稳环境下的性能。
{"title":"An analysis of vector Taylor series model compensation for non-stationary noise in speech recognition","authors":"Duc Hoang Ha Nguyen, Xiong Xiao, Chng Eng Siong, Haizhou Li","doi":"10.1109/ISCSLP.2012.6423503","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423503","url":null,"abstract":"In this paper, we investigate a feature conditioning method for the VTS-based model compensation. The VTS is a technique that predicts noisy acoustic model from clean acoustic model and noise model. It is noted that most of the previous studies use a single Gaussian noise model, which is unable to model noise statistics well, especially in non-stationary noisy environments. In this paper, we propose a combination of feature processing and VTS model compensation to handle non-stationary noise more efficiently. In the feature processing stage, the non-stationary characteristics of noise is reduced, hence the processed features is more suitable for VTS model compensation using single Gaussian noise model. Experimental analysis on the AURORA2 task shows that the proposed method has the potential to improve the performance of VTS method in non-stationary environments if good noise estimation is available.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125231948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Effects of carriers on Mandarin tone categorical perception 载体对普通话声调范畴知觉的影响
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423505
Dazuo Wang, Xiuxiu Wang, Gang Peng
This study investigated the effects of three different carriers on Mandarin tone perception. Three tone continua were constructed: Modified speech, synthesized speech, and nonspeech. Identification tests were conducted for the two speech continua, while discrimination tests were conducted for all the three continua. Results showed that category boundary position differed significantly between the modified speech and synthesized speech continua. Boundary position of the modified speech tone continuum was more toward the rising end than that of the synthesized speech tone continuum, suggesting that greater complexity reduces the overall pitch sensitivity. In the discrimination test, subjects generally exhibited the same pattern for the three continua, but with slightly lower discrimination accuracy for the nonspeech continuum, suggesting the effects of long-term tone language experience of Mandarin is carried over to nonspeech domain.
本研究考察了三种不同载体对普通话声调感知的影响。构建了三个声调连续体:修饰语音、合成语音和非语音。对两个语音连续体进行识别测试,对三个连续体进行识别测试。结果表明,修饰语音和合成语音连续体的范畴边界位置存在显著差异。修改后的语音音调连续体边界位置比合成的语音音调连续体边界位置更倾向于上升端,说明复杂性的增加降低了整体的音高敏感性。在识别测试中,被试对三个连续体的识别模式基本一致,但对非言语连续体的识别准确率略低,说明普通话长期声调语言经验的影响延续到非言语领域。
{"title":"Effects of carriers on Mandarin tone categorical perception","authors":"Dazuo Wang, Xiuxiu Wang, Gang Peng","doi":"10.1109/ISCSLP.2012.6423505","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423505","url":null,"abstract":"This study investigated the effects of three different carriers on Mandarin tone perception. Three tone continua were constructed: Modified speech, synthesized speech, and nonspeech. Identification tests were conducted for the two speech continua, while discrimination tests were conducted for all the three continua. Results showed that category boundary position differed significantly between the modified speech and synthesized speech continua. Boundary position of the modified speech tone continuum was more toward the rising end than that of the synthesized speech tone continuum, suggesting that greater complexity reduces the overall pitch sensitivity. In the discrimination test, subjects generally exhibited the same pattern for the three continua, but with slightly lower discrimination accuracy for the nonspeech continuum, suggesting the effects of long-term tone language experience of Mandarin is carried over to nonspeech domain.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125061786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improve mispronunciation detection with Tandem feature 使用Tandem功能改进发音错误检测
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423538
Hua Yuan, Junhong Zhao, Jia Liu
This paper presents a method to improve the mispronunciation detection performance for low-resource acoustic model. The 1h speech data is randomly selected from CU-CHLOE to imitate the low-resource non-native English situation. The Tandem feature derived from articulatory based Multi-Layer Perception (MLP) is employed to replace the traditional spectral feature (e.g. PLP). Further, motivated by similar pronunciation characteristics between Chinese speaking English and Mandarin, the Mandarin speech data is used to assist in training the multilingual articulatory MLPs. The Tandem feature is also combined with PLP to improve the performance. Finally, the phone recognition correctness (CORR) is improved by 3.84%, and the diagnosis accuracy (DA) is improved by 2.25% with the proposed method.
提出了一种提高低资源声学模型误音检测性能的方法。从CU-CHLOE中随机抽取1小时语音数据,模拟资源匮乏的非母语英语情境。采用基于发音的多层感知(MLP)衍生的串联特征来取代传统的频谱特征(如PLP)。此外,由于中国人说英语和普通话之间的发音特征相似,普通话语音数据被用于辅助多语言发音mlp的训练。Tandem功能还与PLP相结合,以提高性能。最后,该方法将手机识别正确率(CORR)提高了3.84%,诊断正确率(DA)提高了2.25%。
{"title":"Improve mispronunciation detection with Tandem feature","authors":"Hua Yuan, Junhong Zhao, Jia Liu","doi":"10.1109/ISCSLP.2012.6423538","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423538","url":null,"abstract":"This paper presents a method to improve the mispronunciation detection performance for low-resource acoustic model. The 1h speech data is randomly selected from CU-CHLOE to imitate the low-resource non-native English situation. The Tandem feature derived from articulatory based Multi-Layer Perception (MLP) is employed to replace the traditional spectral feature (e.g. PLP). Further, motivated by similar pronunciation characteristics between Chinese speaking English and Mandarin, the Mandarin speech data is used to assist in training the multilingual articulatory MLPs. The Tandem feature is also combined with PLP to improve the performance. Finally, the phone recognition correctness (CORR) is improved by 3.84%, and the diagnosis accuracy (DA) is improved by 2.25% with the proposed method.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126149165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Voice conversion using Bayesian mixture of Probabilistic Linear Regressions and dynamic kernel features 基于贝叶斯混合概率线性回归和动态核特征的语音转换
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423521
Na Li, Y. Qiao
Voice conversion can be formulated as finding a mapping function which transforms the features of a source speaker to those of the target speaker. Gaussian mixture model (GMM)-based conversion techniques [1, 2] have been widely used in voice conversion due to its effectiveness and efficiency. In a recent work [3], we generalized GMM-based mapping to Mixture of Probabilistic Linear Regressions (MPLR). But both GMM based mapping and MPLR are subjected to overfitting problem especially when the training utterances are sparse,and both ignore the inherent time-dependency among speech features. This paper addresses this problem by introducing dynamic kernel features and conducting Bayesian analysis for MPLR. The dynamic kernel features are calculated as kernel transformations of current, previous and next frames, which can model both the nonlinearities and dynamics in the features. We further develop Maximum a Posterior (MAP) inference to alleviate the overfitting problem by introducing prior on the parameters of kernel transformation. Our experimental results exhibit that the proposed methods achieve better performance compared to the MPLR based model.
语音转换可以表示为找到将源说话人的特征转换为目标说话人的特征的映射函数。基于高斯混合模型(Gaussian mixture model, GMM)的转换技术[1,2]因其有效性和高效性在语音转换中得到了广泛的应用。在最近的一项工作[3]中,我们将基于gmm的映射推广到混合概率线性回归(MPLR)。但是,基于GMM的映射和基于MPLR的映射都存在过拟合问题,特别是在训练语音稀疏的情况下,两者都忽略了语音特征之间固有的时间依赖性。本文通过引入动态核特征并对MPLR进行贝叶斯分析来解决这一问题。动态核特征是通过当前帧、前帧和下帧的核变换来计算的,它可以同时模拟特征中的非线性和动态。通过在核变换参数上引入先验,进一步发展了极大后验推理来缓解过拟合问题。实验结果表明,与基于MPLR的模型相比,所提出的方法具有更好的性能。
{"title":"Voice conversion using Bayesian mixture of Probabilistic Linear Regressions and dynamic kernel features","authors":"Na Li, Y. Qiao","doi":"10.1109/ISCSLP.2012.6423521","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423521","url":null,"abstract":"Voice conversion can be formulated as finding a mapping function which transforms the features of a source speaker to those of the target speaker. Gaussian mixture model (GMM)-based conversion techniques [1, 2] have been widely used in voice conversion due to its effectiveness and efficiency. In a recent work [3], we generalized GMM-based mapping to Mixture of Probabilistic Linear Regressions (MPLR). But both GMM based mapping and MPLR are subjected to overfitting problem especially when the training utterances are sparse,and both ignore the inherent time-dependency among speech features. This paper addresses this problem by introducing dynamic kernel features and conducting Bayesian analysis for MPLR. The dynamic kernel features are calculated as kernel transformations of current, previous and next frames, which can model both the nonlinearities and dynamics in the features. We further develop Maximum a Posterior (MAP) inference to alleviate the overfitting problem by introducing prior on the parameters of kernel transformation. Our experimental results exhibit that the proposed methods achieve better performance compared to the MPLR based model.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117246019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2012 8th International Symposium on Chinese Spoken Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1