首页 > 最新文献

2012 8th International Symposium on Chinese Spoken Language Processing最新文献

英文 中文
Synthesized stereo-based stochastic mapping with data selection for robust speech recognition 基于数据选择的合成立体随机映射鲁棒语音识别
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423542
Jun Du, Qiang Huo
In this paper, we present a synthesized stereo-based stochastic mapping approach for robust speech recognition. We extend the traditional stereo-based stochastic mapping (SSM) in two main aspects. First, the constraint of stereo-data, which is not practical in real applications, is relaxed by using HMM-based speech synthesis. Then we make feature mapping more focused on those incorrectly recognized samples via a data selection strategy. Experimental results on Aurora3 databases show that our approach can achieve consistently significant improvements of recognition performance in the well-matched (WM) condition among four different European languages.
在本文中,我们提出了一种基于合成立体随机映射的鲁棒语音识别方法。本文主要从两个方面对传统的基于立体的随机映射(SSM)进行了扩展。首先,利用基于hmm的语音合成技术,解决了在实际应用中难以实现的立体数据约束问题;然后,我们通过数据选择策略使特征映射更加集中在那些识别错误的样本上。在Aurora3数据库上的实验结果表明,我们的方法在四种不同的欧洲语言之间的良好匹配(WM)条件下,可以取得一致的显著的识别性能提高。
{"title":"Synthesized stereo-based stochastic mapping with data selection for robust speech recognition","authors":"Jun Du, Qiang Huo","doi":"10.1109/ISCSLP.2012.6423542","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423542","url":null,"abstract":"In this paper, we present a synthesized stereo-based stochastic mapping approach for robust speech recognition. We extend the traditional stereo-based stochastic mapping (SSM) in two main aspects. First, the constraint of stereo-data, which is not practical in real applications, is relaxed by using HMM-based speech synthesis. Then we make feature mapping more focused on those incorrectly recognized samples via a data selection strategy. Experimental results on Aurora3 databases show that our approach can achieve consistently significant improvements of recognition performance in the well-matched (WM) condition among four different European languages.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134163141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
The lossless adaptive arithmetic coding based on context for ITU-T G.719 at variable rate 基于上下文的ITU-T G.719可变速率无损自适应算术编码
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423462
Xuan Ji, Jing Wang, Hailong He, Jingming Kuang
This paper presents a novel technique of context-based adaptive arithmetic coding of the quantized MDCT coefficients and frequency band gains in audio compression. A key feature of the new technique is combining the context model in time domain and frequency domain, which used for the quantized norms' and MDCT coefficients' probability. With this new technique, we achieve a high degree of adaptation and redundancy reduction in the adaptive arithmetic coding. In addition, we employ an efficient variable rate algorithm for G.719. The variable rate algorithm is designed based on the baseline entropy coding method of G.719 and the proposed adaptive arithmetic coding technique respectively. For a set of audio samples used in the application, we achieve an average bit-rate saving of 7.2% while producing audio quality equal to that of the original G.719.
提出了一种基于上下文的量化MDCT系数和频带增益的自适应算法编码技术。新技术的一个关键特点是将时域和频域的上下文模型相结合,用于量化范数和MDCT系数的概率。利用这种新技术,我们实现了自适应算术编码的高度自适应和冗余减少。此外,我们对G.719采用了一种高效的可变速率算法。基于G.719的基线熵编码方法和提出的自适应算法编码技术,分别设计了可变速率算法。对于应用程序中使用的一组音频样本,我们实现了7.2%的平均比特率节省,同时产生的音频质量等于原始G.719。
{"title":"The lossless adaptive arithmetic coding based on context for ITU-T G.719 at variable rate","authors":"Xuan Ji, Jing Wang, Hailong He, Jingming Kuang","doi":"10.1109/ISCSLP.2012.6423462","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423462","url":null,"abstract":"This paper presents a novel technique of context-based adaptive arithmetic coding of the quantized MDCT coefficients and frequency band gains in audio compression. A key feature of the new technique is combining the context model in time domain and frequency domain, which used for the quantized norms' and MDCT coefficients' probability. With this new technique, we achieve a high degree of adaptation and redundancy reduction in the adaptive arithmetic coding. In addition, we employ an efficient variable rate algorithm for G.719. The variable rate algorithm is designed based on the baseline entropy coding method of G.719 and the proposed adaptive arithmetic coding technique respectively. For a set of audio samples used in the application, we achieve an average bit-rate saving of 7.2% while producing audio quality equal to that of the original G.719.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134400787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structured modeling based on generalized variable parameter HMMs and speaker adaptation 基于广义变参数hmm和说话人自适应的结构化建模
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423526
Yang Li, Xunying Liu, Lan Wang
It is a challenging task that to handle ambient variable acoustic factors in automatic speech recognition (ASR) system. The ambient variable noise and the distinct acoustic factors among speakers are two key issues for recognition task. To solve these problems, we present a new framework for robust speech recognition based on structured modeling, using generalized variable parameter HMMs (GVP-HMMs) and unsupervised speaker adaptation (SA) to compensate the mismatch from environment and speaker variability. GVP-HMMs can explicitly approximate the continuous trajectory of Gaussian component mean, variance and linear transformation parameter with a polynomial function against the varying noise level. In recognition stage, MLLR transform captures general relationship between the original model set and the current speaker, which could help in removing the effects of unwanted speaker factors. The effectiveness of the proposed approach is confirmed by evaluation experiment on a medium vocabulary Mandarin recognition task.
在自动语音识别系统中,环境变声因素的处理是一项具有挑战性的任务。环境噪声的变化和说话人之间不同的声学因素是识别任务的两个关键问题。为了解决这些问题,我们提出了一种基于结构化建模的鲁棒语音识别框架,利用广义变参数hmm (gvp - hmm)和无监督说话人自适应(SA)来补偿环境和说话人变化的不匹配。gvp - hmm可以用多项式函数显式逼近高斯分量均值、方差和线性变换参数随噪声水平变化的连续轨迹。在识别阶段,MLLR变换捕捉原始模型集与当前说话人之间的一般关系,有助于消除不需要的说话人因素的影响。在一个中等词汇量的普通话识别任务上进行了评价实验,验证了该方法的有效性。
{"title":"Structured modeling based on generalized variable parameter HMMs and speaker adaptation","authors":"Yang Li, Xunying Liu, Lan Wang","doi":"10.1109/ISCSLP.2012.6423526","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423526","url":null,"abstract":"It is a challenging task that to handle ambient variable acoustic factors in automatic speech recognition (ASR) system. The ambient variable noise and the distinct acoustic factors among speakers are two key issues for recognition task. To solve these problems, we present a new framework for robust speech recognition based on structured modeling, using generalized variable parameter HMMs (GVP-HMMs) and unsupervised speaker adaptation (SA) to compensate the mismatch from environment and speaker variability. GVP-HMMs can explicitly approximate the continuous trajectory of Gaussian component mean, variance and linear transformation parameter with a polynomial function against the varying noise level. In recognition stage, MLLR transform captures general relationship between the original model set and the current speaker, which could help in removing the effects of unwanted speaker factors. The effectiveness of the proposed approach is confirmed by evaluation experiment on a medium vocabulary Mandarin recognition task.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116544351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
An improved tone labeling and prediction method with non-uniform segmentation of F0 contour 一种改进的F0轮廓非均匀分割的音调标记与预测方法
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423467
Xingyu Na, Xiang Xie, Jingming Kuang, Yaling He
This paper proposes a tone labeling technique for tonal language speech synthesis. Non-uniform segmentation using Viterbi alignment is introduced to determine the boundaries to get F0 symbols, which are used as tonal label to eliminate the mismatch between tone patterns and F0 contours of training data. During context clustering, the tendency of adjacent F0 state distributions are captured by the state-based phonetic trees. Means of tone model states are directly quantized to get full tonal label in the synthesis stage. Both objective and subjective experiment results show that the proposed technique can improve the perceptual prosody of synthetic speech of non-professional speakers.
提出了一种声调标注技术,用于声调语言语音合成。采用Viterbi对齐的非均匀分割方法确定边界,得到F0符号作为音调标记,消除训练数据的音调模式与F0轮廓不匹配的问题。在上下文聚类过程中,基于状态的语音树捕获相邻F0状态分布的趋势。在合成阶段直接对音调模型状态的均值进行量化,得到完整的音调标签。客观和主观实验结果都表明,该方法可以提高非专业说话者合成语音的感知韵律。
{"title":"An improved tone labeling and prediction method with non-uniform segmentation of F0 contour","authors":"Xingyu Na, Xiang Xie, Jingming Kuang, Yaling He","doi":"10.1109/ISCSLP.2012.6423467","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423467","url":null,"abstract":"This paper proposes a tone labeling technique for tonal language speech synthesis. Non-uniform segmentation using Viterbi alignment is introduced to determine the boundaries to get F0 symbols, which are used as tonal label to eliminate the mismatch between tone patterns and F0 contours of training data. During context clustering, the tendency of adjacent F0 state distributions are captured by the state-based phonetic trees. Means of tone model states are directly quantized to get full tonal label in the synthesis stage. Both objective and subjective experiment results show that the proposed technique can improve the perceptual prosody of synthetic speech of non-professional speakers.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126275026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Context dependant phone mapping for cross-lingual acoustic modeling 上下文依赖电话映射跨语言声学建模
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423496
Van Hai Do, Xiong Xiao, Chng Eng Siong, Haizhou Li
This paper presents a novel method for acoustic modeling with limited training data. The idea is to leverage on a well-trained acoustic model of a source language. In this paper, a conventional HMM/GMM triphone acoustic model of the source language is used to derive likelihood scores for each feature vector of the target language. These scores are then mapped to triphones of the target language using neural networks. We conduct a case study where Malay is the source language while English (Aurora-4 task) is the target language. Experimental results on the Aurora-4 (clean test set) show that by using only 7, 16, and 55 minutes of English training data, we achieve 21.58%, 17.97%, and 12.93% word error rate, respectively. These results outperform the conventional HMM/GMM and hybrid systems significantly.
本文提出了一种基于有限训练数据的声学建模新方法。这个想法是利用训练有素的源语言声学模型。本文采用传统的源语言HMM/GMM三音器声学模型,对目标语言的每个特征向量进行似然评分。然后用神经网络将这些分数映射到目标语言的三音。我们进行了一个案例研究,其中马来语是源语言,而英语(Aurora-4任务)是目标语言。在Aurora-4 (clean test set)上的实验结果表明,仅使用7分钟、16分钟和55分钟的英语训练数据,我们的单词错误率分别达到21.58%、17.97%和12.93%。这些结果明显优于传统的HMM/GMM和混合系统。
{"title":"Context dependant phone mapping for cross-lingual acoustic modeling","authors":"Van Hai Do, Xiong Xiao, Chng Eng Siong, Haizhou Li","doi":"10.1109/ISCSLP.2012.6423496","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423496","url":null,"abstract":"This paper presents a novel method for acoustic modeling with limited training data. The idea is to leverage on a well-trained acoustic model of a source language. In this paper, a conventional HMM/GMM triphone acoustic model of the source language is used to derive likelihood scores for each feature vector of the target language. These scores are then mapped to triphones of the target language using neural networks. We conduct a case study where Malay is the source language while English (Aurora-4 task) is the target language. Experimental results on the Aurora-4 (clean test set) show that by using only 7, 16, and 55 minutes of English training data, we achieve 21.58%, 17.97%, and 12.93% word error rate, respectively. These results outperform the conventional HMM/GMM and hybrid systems significantly.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"261 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116030975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Text-Dependent Speaker Recognition with long-term features based on functional data analysis 基于功能数据分析的具有长期特征的文本依赖说话人识别
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423461
Chenhao Zhang, T. Zheng, Ruxin Chen
Text-Dependent Speaker Recognition (TDSR) is widely used nowadays. The short-term features like Mel-Frequency Cepstral Coefficient (MFCC) have been the dominant features used in traditional Dynamic Time Warping (DTW) based TDSR systems. The short-term features capture better local portion of the significant temporal dynamics but worse in overall sentence statistical characteristics. Functional Data Analysis (FDA) has been proven to show significant advantage in exploring the statistic information of data, so in this paper, a long-term feature extraction based on MFCC and FDA theory is proposed, where the extraction procedure consists of the following steps: Firstly, the FDA theory is applied after the MFCC feature extraction; Secondly, for the purpose of compressing the redundant data information, new feature based on the Functional Principle Component Analysis (FPCA) is generated; Thirdly, the distance between train features and test features is calculated for the use of the recognition procedure. Compared with the existing MFCC plus DTW method, experimental results show that the new features extracted with the proposed method plus the cosine similarity measure demonstrates better performance.
基于文本的说话人识别(TDSR)是目前应用广泛的一种识别方法。在传统的基于动态时间翘曲(DTW)的TDSR系统中,mel -频率倒谱系数(MFCC)等短期特征一直是主要特征。短期特征较好地捕捉了显著时间动态的局部部分,但较差地捕捉了整体句子统计特征。功能数据分析(Functional Data Analysis, FDA)在挖掘数据统计信息方面具有显著优势,因此本文提出了一种基于MFCC和FDA理论的长期特征提取方法,提取过程包括以下几个步骤:首先,在MFCC特征提取后应用FDA理论;其次,为了压缩冗余数据信息,基于功能主成分分析(FPCA)生成新的特征;第三,计算训练特征和测试特征之间的距离,以便使用识别程序。实验结果表明,与现有的MFCC + DTW方法相比,该方法结合余弦相似度度量提取的新特征具有更好的性能。
{"title":"Text-Dependent Speaker Recognition with long-term features based on functional data analysis","authors":"Chenhao Zhang, T. Zheng, Ruxin Chen","doi":"10.1109/ISCSLP.2012.6423461","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423461","url":null,"abstract":"Text-Dependent Speaker Recognition (TDSR) is widely used nowadays. The short-term features like Mel-Frequency Cepstral Coefficient (MFCC) have been the dominant features used in traditional Dynamic Time Warping (DTW) based TDSR systems. The short-term features capture better local portion of the significant temporal dynamics but worse in overall sentence statistical characteristics. Functional Data Analysis (FDA) has been proven to show significant advantage in exploring the statistic information of data, so in this paper, a long-term feature extraction based on MFCC and FDA theory is proposed, where the extraction procedure consists of the following steps: Firstly, the FDA theory is applied after the MFCC feature extraction; Secondly, for the purpose of compressing the redundant data information, new feature based on the Functional Principle Component Analysis (FPCA) is generated; Thirdly, the distance between train features and test features is calculated for the use of the recognition procedure. Compared with the existing MFCC plus DTW method, experimental results show that the new features extracted with the proposed method plus the cosine similarity measure demonstrates better performance.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116304579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An improved steady segment based decoding algorithm by using response probability for LVCSR 基于响应概率的LVCSR稳定段译码改进算法
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423525
Zhanlei Yang, Wenju Liu, Hao Chao
This paper proposes a novel decoding algorithm by integrating both steady speech segments and observations' location information into conventional path extension framework. First, speech segments which possess stable spectrum are extracted. Second, a preliminarily improved algorithm is given by modifying traditional inter-HMM extension framework using the detected steady segments. Then, at probability calculation stage, response probability (RP), which represents location information of observations within acoustic feature space, is further incorporated into decoding. Thus, RP directs the decoder to enhance/weaken path candidates that get through the front end steady-segment-based decoding. Experiments conducted on Mandarin speech recognition show that character error rate of proposed algorithm achieves a 4.6% relative reduction when compared with a system in which only steady segment is used, and run time factor achieves a 10.0% relative reduction when compared with a system in which only RP is used.
本文提出了一种新的解码算法,将稳定的语音片段和观测点的位置信息整合到传统的路径扩展框架中。首先,提取具有稳定频谱的语音片段;其次,利用检测到的稳定段对传统hmm间扩展框架进行改进,给出了一种初步改进算法。然后,在概率计算阶段,将响应概率(RP)进一步纳入解码,RP表示声学特征空间内观测点的位置信息。因此,RP指导解码器增强/削弱通过前端基于稳定段的解码的候选路径。在普通话语音识别实验中,所提出算法的字符错误率与只使用稳定段的系统相比降低了4.6%,运行时间因子与只使用稳定段的系统相比降低了10.0%。
{"title":"An improved steady segment based decoding algorithm by using response probability for LVCSR","authors":"Zhanlei Yang, Wenju Liu, Hao Chao","doi":"10.1109/ISCSLP.2012.6423525","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423525","url":null,"abstract":"This paper proposes a novel decoding algorithm by integrating both steady speech segments and observations' location information into conventional path extension framework. First, speech segments which possess stable spectrum are extracted. Second, a preliminarily improved algorithm is given by modifying traditional inter-HMM extension framework using the detected steady segments. Then, at probability calculation stage, response probability (RP), which represents location information of observations within acoustic feature space, is further incorporated into decoding. Thus, RP directs the decoder to enhance/weaken path candidates that get through the front end steady-segment-based decoding. Experiments conducted on Mandarin speech recognition show that character error rate of proposed algorithm achieves a 4.6% relative reduction when compared with a system in which only steady segment is used, and run time factor achieves a 10.0% relative reduction when compared with a system in which only RP is used.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124044020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A simple and effective pitch re-estimation method for rich prosody and speaking styles in HMM-based speech synthesis 基于hmm的语音合成中丰富韵律和说话风格的一种简单有效的音高重估计方法
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423473
Chengyuan Lin, Chien-Hung Huang, C. Kuo
This paper proposes a novel way of controllable pitch re-estimation that can produce better pitch contour or provide diverse speaking styles for text-to-speech (TTS) systems. The method is composed of a pitch re-estimation model and a set of control parameters. The pitch re-estimation model is employed to reduce over-smoothing effects which is usually introduced by TTS training. The control parameters are designed to generate not only rich intonations but also speaking styles, e.g. a foreign accent or an excited tone. To verify the feasibility of the proposed method, we conducted experiments for both objective measures and subjective tests. Although the re-estimated pitch results in only slightly less prediction error for objective measure, it produces clearly better intonation for listening test. Moreover, the expressive speech can be generated successfully under the framework of controllable pitch re-estimation.
本文提出了一种新的可控性音高重估计方法,可以为文本-语音(TTS)系统提供更好的音高轮廓或多样化的说话风格。该方法由一个节距重估计模型和一组控制参数组成。采用基音重估计模型消除了TTS训练过程中产生的过平滑效应。控制参数的设计不仅可以生成丰富的语调,还可以生成说话风格,例如外国口音或激动的语气。为了验证所提出方法的可行性,我们进行了客观测量和主观测试的实验。虽然重新估计的音高在客观测量上的预测误差仅略小,但在听力测试中明显更好。此外,在可控的音高重估计框架下,可以成功地生成富有表现力的语音。
{"title":"A simple and effective pitch re-estimation method for rich prosody and speaking styles in HMM-based speech synthesis","authors":"Chengyuan Lin, Chien-Hung Huang, C. Kuo","doi":"10.1109/ISCSLP.2012.6423473","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423473","url":null,"abstract":"This paper proposes a novel way of controllable pitch re-estimation that can produce better pitch contour or provide diverse speaking styles for text-to-speech (TTS) systems. The method is composed of a pitch re-estimation model and a set of control parameters. The pitch re-estimation model is employed to reduce over-smoothing effects which is usually introduced by TTS training. The control parameters are designed to generate not only rich intonations but also speaking styles, e.g. a foreign accent or an excited tone. To verify the feasibility of the proposed method, we conducted experiments for both objective measures and subjective tests. Although the re-estimated pitch results in only slightly less prediction error for objective measure, it produces clearly better intonation for listening test. Moreover, the expressive speech can be generated successfully under the framework of controllable pitch re-estimation.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127405149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
mENUNCIATE: Development of a computer-aided pronunciation training system on a cross-platform framework for mobile, speech-enabled application development mENUNCIATE:开发一个基于跨平台框架的计算机辅助发音训练系统,用于移动语音应用程序开发
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423507
Pengfei Liu, K. Yuen, Wai-Kim Leung, H. Meng
This paper presents our ongoing research in the field of speech-enabled multimodal, mobile application development. We have developed a multimodal framework that enables cross-platform development using open standards-based HTML, CSS and JavaScript. This framework brings high extendibility through plugin-based architecture and provides scalable REST-based speech services in the cloud to support large amounts of requests from mobile devices. This paper describes the architecture and implementation of the framework, and the development of a mobile computer-aided pronunciation training application for Chinese learners of English, named mENUNCIATE, based on this framework. We also report a preliminary performance evaluation on mENUNCIATE.
本文介绍了我们在支持语音的多模态移动应用程序开发领域的持续研究。我们已经开发了一个多模态框架,可以使用基于开放标准的HTML、CSS和JavaScript进行跨平台开发。该框架通过基于插件的架构带来了高度的可扩展性,并在云中提供可扩展的基于rest的语音服务,以支持来自移动设备的大量请求。本文描述了该框架的体系结构和实现,并基于该框架开发了面向中国英语学习者的移动计算机辅助语音训练应用程序mENUNCIATE。我们还报告了mENUNCIATE的初步性能评估。
{"title":"mENUNCIATE: Development of a computer-aided pronunciation training system on a cross-platform framework for mobile, speech-enabled application development","authors":"Pengfei Liu, K. Yuen, Wai-Kim Leung, H. Meng","doi":"10.1109/ISCSLP.2012.6423507","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423507","url":null,"abstract":"This paper presents our ongoing research in the field of speech-enabled multimodal, mobile application development. We have developed a multimodal framework that enables cross-platform development using open standards-based HTML, CSS and JavaScript. This framework brings high extendibility through plugin-based architecture and provides scalable REST-based speech services in the cloud to support large amounts of requests from mobile devices. This paper describes the architecture and implementation of the framework, and the development of a mobile computer-aided pronunciation training application for Chinese learners of English, named mENUNCIATE, based on this framework. We also report a preliminary performance evaluation on mENUNCIATE.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133723654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Enhanced lengthening cancellation using bidirectional pitch similarity alignment for spontaneous speech 利用双向音高相似性对齐增强的自发语音延长消除
Pub Date : 2012-12-01 DOI: 10.1109/ISCSLP.2012.6423517
Po-Yi Shih, Bo-Wei Chen, Jhing-Fa Wang, Jhing-Wei Wu
In this work, an enhanced lengthening cancellation method is proposed to detect and cancel the lengthening part of vowels. The proposed method consists of autocorrelation function, cosine similarity-based lengthening detection and bidirectional pitch contour alignment. Autocorrelation function is used to obtain the reference pitch contour. Cosine similarity-based method is applied to measure the similarity between the reference and the next adjacent pitch contours. Due to the variant lengths of periodic segments, fixed size frames may cause accumulative errors. Therefore, bidirectional pitch contour alignment is adopted in this study. Experiments indicate that the proposed method can achieve an accuracy rate of 91.4% and 88.7% on a 60-keyword and 50-scentence database, respectively. Moreover, the proposed approach performs about three times speed than the baseline. Such results prove the effectiveness of the proposed method.
本文提出了一种增强的延长抵消方法来检测和消除元音的延长部分。该方法由自相关函数、基于余弦相似度的延长检测和双向基音轮廓对齐组成。利用自相关函数获得参考基音轮廓。基于余弦相似度的方法用于测量参考和下一个相邻的基音轮廓之间的相似度。由于周期段的长度不同,固定大小的帧可能会导致累积错误。因此,本研究采用双向节距轮廓对准。实验表明,该方法在60个关键词和50个句子的数据库中,准确率分别达到91.4%和88.7%。此外,该方法的执行速度约为基线的三倍。这些结果证明了所提方法的有效性。
{"title":"Enhanced lengthening cancellation using bidirectional pitch similarity alignment for spontaneous speech","authors":"Po-Yi Shih, Bo-Wei Chen, Jhing-Fa Wang, Jhing-Wei Wu","doi":"10.1109/ISCSLP.2012.6423517","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423517","url":null,"abstract":"In this work, an enhanced lengthening cancellation method is proposed to detect and cancel the lengthening part of vowels. The proposed method consists of autocorrelation function, cosine similarity-based lengthening detection and bidirectional pitch contour alignment. Autocorrelation function is used to obtain the reference pitch contour. Cosine similarity-based method is applied to measure the similarity between the reference and the next adjacent pitch contours. Due to the variant lengths of periodic segments, fixed size frames may cause accumulative errors. Therefore, bidirectional pitch contour alignment is adopted in this study. Experiments indicate that the proposed method can achieve an accuracy rate of 91.4% and 88.7% on a 60-keyword and 50-scentence database, respectively. Moreover, the proposed approach performs about three times speed than the baseline. Such results prove the effectiveness of the proposed method.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"445 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133780486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2012 8th International Symposium on Chinese Spoken Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1