Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423542
Jun Du, Qiang Huo
In this paper, we present a synthesized stereo-based stochastic mapping approach for robust speech recognition. We extend the traditional stereo-based stochastic mapping (SSM) in two main aspects. First, the constraint of stereo-data, which is not practical in real applications, is relaxed by using HMM-based speech synthesis. Then we make feature mapping more focused on those incorrectly recognized samples via a data selection strategy. Experimental results on Aurora3 databases show that our approach can achieve consistently significant improvements of recognition performance in the well-matched (WM) condition among four different European languages.
{"title":"Synthesized stereo-based stochastic mapping with data selection for robust speech recognition","authors":"Jun Du, Qiang Huo","doi":"10.1109/ISCSLP.2012.6423542","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423542","url":null,"abstract":"In this paper, we present a synthesized stereo-based stochastic mapping approach for robust speech recognition. We extend the traditional stereo-based stochastic mapping (SSM) in two main aspects. First, the constraint of stereo-data, which is not practical in real applications, is relaxed by using HMM-based speech synthesis. Then we make feature mapping more focused on those incorrectly recognized samples via a data selection strategy. Experimental results on Aurora3 databases show that our approach can achieve consistently significant improvements of recognition performance in the well-matched (WM) condition among four different European languages.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134163141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423462
Xuan Ji, Jing Wang, Hailong He, Jingming Kuang
This paper presents a novel technique of context-based adaptive arithmetic coding of the quantized MDCT coefficients and frequency band gains in audio compression. A key feature of the new technique is combining the context model in time domain and frequency domain, which used for the quantized norms' and MDCT coefficients' probability. With this new technique, we achieve a high degree of adaptation and redundancy reduction in the adaptive arithmetic coding. In addition, we employ an efficient variable rate algorithm for G.719. The variable rate algorithm is designed based on the baseline entropy coding method of G.719 and the proposed adaptive arithmetic coding technique respectively. For a set of audio samples used in the application, we achieve an average bit-rate saving of 7.2% while producing audio quality equal to that of the original G.719.
{"title":"The lossless adaptive arithmetic coding based on context for ITU-T G.719 at variable rate","authors":"Xuan Ji, Jing Wang, Hailong He, Jingming Kuang","doi":"10.1109/ISCSLP.2012.6423462","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423462","url":null,"abstract":"This paper presents a novel technique of context-based adaptive arithmetic coding of the quantized MDCT coefficients and frequency band gains in audio compression. A key feature of the new technique is combining the context model in time domain and frequency domain, which used for the quantized norms' and MDCT coefficients' probability. With this new technique, we achieve a high degree of adaptation and redundancy reduction in the adaptive arithmetic coding. In addition, we employ an efficient variable rate algorithm for G.719. The variable rate algorithm is designed based on the baseline entropy coding method of G.719 and the proposed adaptive arithmetic coding technique respectively. For a set of audio samples used in the application, we achieve an average bit-rate saving of 7.2% while producing audio quality equal to that of the original G.719.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134400787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423526
Yang Li, Xunying Liu, Lan Wang
It is a challenging task that to handle ambient variable acoustic factors in automatic speech recognition (ASR) system. The ambient variable noise and the distinct acoustic factors among speakers are two key issues for recognition task. To solve these problems, we present a new framework for robust speech recognition based on structured modeling, using generalized variable parameter HMMs (GVP-HMMs) and unsupervised speaker adaptation (SA) to compensate the mismatch from environment and speaker variability. GVP-HMMs can explicitly approximate the continuous trajectory of Gaussian component mean, variance and linear transformation parameter with a polynomial function against the varying noise level. In recognition stage, MLLR transform captures general relationship between the original model set and the current speaker, which could help in removing the effects of unwanted speaker factors. The effectiveness of the proposed approach is confirmed by evaluation experiment on a medium vocabulary Mandarin recognition task.
{"title":"Structured modeling based on generalized variable parameter HMMs and speaker adaptation","authors":"Yang Li, Xunying Liu, Lan Wang","doi":"10.1109/ISCSLP.2012.6423526","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423526","url":null,"abstract":"It is a challenging task that to handle ambient variable acoustic factors in automatic speech recognition (ASR) system. The ambient variable noise and the distinct acoustic factors among speakers are two key issues for recognition task. To solve these problems, we present a new framework for robust speech recognition based on structured modeling, using generalized variable parameter HMMs (GVP-HMMs) and unsupervised speaker adaptation (SA) to compensate the mismatch from environment and speaker variability. GVP-HMMs can explicitly approximate the continuous trajectory of Gaussian component mean, variance and linear transformation parameter with a polynomial function against the varying noise level. In recognition stage, MLLR transform captures general relationship between the original model set and the current speaker, which could help in removing the effects of unwanted speaker factors. The effectiveness of the proposed approach is confirmed by evaluation experiment on a medium vocabulary Mandarin recognition task.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116544351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423467
Xingyu Na, Xiang Xie, Jingming Kuang, Yaling He
This paper proposes a tone labeling technique for tonal language speech synthesis. Non-uniform segmentation using Viterbi alignment is introduced to determine the boundaries to get F0 symbols, which are used as tonal label to eliminate the mismatch between tone patterns and F0 contours of training data. During context clustering, the tendency of adjacent F0 state distributions are captured by the state-based phonetic trees. Means of tone model states are directly quantized to get full tonal label in the synthesis stage. Both objective and subjective experiment results show that the proposed technique can improve the perceptual prosody of synthetic speech of non-professional speakers.
{"title":"An improved tone labeling and prediction method with non-uniform segmentation of F0 contour","authors":"Xingyu Na, Xiang Xie, Jingming Kuang, Yaling He","doi":"10.1109/ISCSLP.2012.6423467","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423467","url":null,"abstract":"This paper proposes a tone labeling technique for tonal language speech synthesis. Non-uniform segmentation using Viterbi alignment is introduced to determine the boundaries to get F0 symbols, which are used as tonal label to eliminate the mismatch between tone patterns and F0 contours of training data. During context clustering, the tendency of adjacent F0 state distributions are captured by the state-based phonetic trees. Means of tone model states are directly quantized to get full tonal label in the synthesis stage. Both objective and subjective experiment results show that the proposed technique can improve the perceptual prosody of synthetic speech of non-professional speakers.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126275026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423496
Van Hai Do, Xiong Xiao, Chng Eng Siong, Haizhou Li
This paper presents a novel method for acoustic modeling with limited training data. The idea is to leverage on a well-trained acoustic model of a source language. In this paper, a conventional HMM/GMM triphone acoustic model of the source language is used to derive likelihood scores for each feature vector of the target language. These scores are then mapped to triphones of the target language using neural networks. We conduct a case study where Malay is the source language while English (Aurora-4 task) is the target language. Experimental results on the Aurora-4 (clean test set) show that by using only 7, 16, and 55 minutes of English training data, we achieve 21.58%, 17.97%, and 12.93% word error rate, respectively. These results outperform the conventional HMM/GMM and hybrid systems significantly.
本文提出了一种基于有限训练数据的声学建模新方法。这个想法是利用训练有素的源语言声学模型。本文采用传统的源语言HMM/GMM三音器声学模型,对目标语言的每个特征向量进行似然评分。然后用神经网络将这些分数映射到目标语言的三音。我们进行了一个案例研究,其中马来语是源语言,而英语(Aurora-4任务)是目标语言。在Aurora-4 (clean test set)上的实验结果表明,仅使用7分钟、16分钟和55分钟的英语训练数据,我们的单词错误率分别达到21.58%、17.97%和12.93%。这些结果明显优于传统的HMM/GMM和混合系统。
{"title":"Context dependant phone mapping for cross-lingual acoustic modeling","authors":"Van Hai Do, Xiong Xiao, Chng Eng Siong, Haizhou Li","doi":"10.1109/ISCSLP.2012.6423496","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423496","url":null,"abstract":"This paper presents a novel method for acoustic modeling with limited training data. The idea is to leverage on a well-trained acoustic model of a source language. In this paper, a conventional HMM/GMM triphone acoustic model of the source language is used to derive likelihood scores for each feature vector of the target language. These scores are then mapped to triphones of the target language using neural networks. We conduct a case study where Malay is the source language while English (Aurora-4 task) is the target language. Experimental results on the Aurora-4 (clean test set) show that by using only 7, 16, and 55 minutes of English training data, we achieve 21.58%, 17.97%, and 12.93% word error rate, respectively. These results outperform the conventional HMM/GMM and hybrid systems significantly.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"261 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116030975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423461
Chenhao Zhang, T. Zheng, Ruxin Chen
Text-Dependent Speaker Recognition (TDSR) is widely used nowadays. The short-term features like Mel-Frequency Cepstral Coefficient (MFCC) have been the dominant features used in traditional Dynamic Time Warping (DTW) based TDSR systems. The short-term features capture better local portion of the significant temporal dynamics but worse in overall sentence statistical characteristics. Functional Data Analysis (FDA) has been proven to show significant advantage in exploring the statistic information of data, so in this paper, a long-term feature extraction based on MFCC and FDA theory is proposed, where the extraction procedure consists of the following steps: Firstly, the FDA theory is applied after the MFCC feature extraction; Secondly, for the purpose of compressing the redundant data information, new feature based on the Functional Principle Component Analysis (FPCA) is generated; Thirdly, the distance between train features and test features is calculated for the use of the recognition procedure. Compared with the existing MFCC plus DTW method, experimental results show that the new features extracted with the proposed method plus the cosine similarity measure demonstrates better performance.
基于文本的说话人识别(TDSR)是目前应用广泛的一种识别方法。在传统的基于动态时间翘曲(DTW)的TDSR系统中,mel -频率倒谱系数(MFCC)等短期特征一直是主要特征。短期特征较好地捕捉了显著时间动态的局部部分,但较差地捕捉了整体句子统计特征。功能数据分析(Functional Data Analysis, FDA)在挖掘数据统计信息方面具有显著优势,因此本文提出了一种基于MFCC和FDA理论的长期特征提取方法,提取过程包括以下几个步骤:首先,在MFCC特征提取后应用FDA理论;其次,为了压缩冗余数据信息,基于功能主成分分析(FPCA)生成新的特征;第三,计算训练特征和测试特征之间的距离,以便使用识别程序。实验结果表明,与现有的MFCC + DTW方法相比,该方法结合余弦相似度度量提取的新特征具有更好的性能。
{"title":"Text-Dependent Speaker Recognition with long-term features based on functional data analysis","authors":"Chenhao Zhang, T. Zheng, Ruxin Chen","doi":"10.1109/ISCSLP.2012.6423461","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423461","url":null,"abstract":"Text-Dependent Speaker Recognition (TDSR) is widely used nowadays. The short-term features like Mel-Frequency Cepstral Coefficient (MFCC) have been the dominant features used in traditional Dynamic Time Warping (DTW) based TDSR systems. The short-term features capture better local portion of the significant temporal dynamics but worse in overall sentence statistical characteristics. Functional Data Analysis (FDA) has been proven to show significant advantage in exploring the statistic information of data, so in this paper, a long-term feature extraction based on MFCC and FDA theory is proposed, where the extraction procedure consists of the following steps: Firstly, the FDA theory is applied after the MFCC feature extraction; Secondly, for the purpose of compressing the redundant data information, new feature based on the Functional Principle Component Analysis (FPCA) is generated; Thirdly, the distance between train features and test features is calculated for the use of the recognition procedure. Compared with the existing MFCC plus DTW method, experimental results show that the new features extracted with the proposed method plus the cosine similarity measure demonstrates better performance.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116304579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423525
Zhanlei Yang, Wenju Liu, Hao Chao
This paper proposes a novel decoding algorithm by integrating both steady speech segments and observations' location information into conventional path extension framework. First, speech segments which possess stable spectrum are extracted. Second, a preliminarily improved algorithm is given by modifying traditional inter-HMM extension framework using the detected steady segments. Then, at probability calculation stage, response probability (RP), which represents location information of observations within acoustic feature space, is further incorporated into decoding. Thus, RP directs the decoder to enhance/weaken path candidates that get through the front end steady-segment-based decoding. Experiments conducted on Mandarin speech recognition show that character error rate of proposed algorithm achieves a 4.6% relative reduction when compared with a system in which only steady segment is used, and run time factor achieves a 10.0% relative reduction when compared with a system in which only RP is used.
{"title":"An improved steady segment based decoding algorithm by using response probability for LVCSR","authors":"Zhanlei Yang, Wenju Liu, Hao Chao","doi":"10.1109/ISCSLP.2012.6423525","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423525","url":null,"abstract":"This paper proposes a novel decoding algorithm by integrating both steady speech segments and observations' location information into conventional path extension framework. First, speech segments which possess stable spectrum are extracted. Second, a preliminarily improved algorithm is given by modifying traditional inter-HMM extension framework using the detected steady segments. Then, at probability calculation stage, response probability (RP), which represents location information of observations within acoustic feature space, is further incorporated into decoding. Thus, RP directs the decoder to enhance/weaken path candidates that get through the front end steady-segment-based decoding. Experiments conducted on Mandarin speech recognition show that character error rate of proposed algorithm achieves a 4.6% relative reduction when compared with a system in which only steady segment is used, and run time factor achieves a 10.0% relative reduction when compared with a system in which only RP is used.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124044020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423473
Chengyuan Lin, Chien-Hung Huang, C. Kuo
This paper proposes a novel way of controllable pitch re-estimation that can produce better pitch contour or provide diverse speaking styles for text-to-speech (TTS) systems. The method is composed of a pitch re-estimation model and a set of control parameters. The pitch re-estimation model is employed to reduce over-smoothing effects which is usually introduced by TTS training. The control parameters are designed to generate not only rich intonations but also speaking styles, e.g. a foreign accent or an excited tone. To verify the feasibility of the proposed method, we conducted experiments for both objective measures and subjective tests. Although the re-estimated pitch results in only slightly less prediction error for objective measure, it produces clearly better intonation for listening test. Moreover, the expressive speech can be generated successfully under the framework of controllable pitch re-estimation.
{"title":"A simple and effective pitch re-estimation method for rich prosody and speaking styles in HMM-based speech synthesis","authors":"Chengyuan Lin, Chien-Hung Huang, C. Kuo","doi":"10.1109/ISCSLP.2012.6423473","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423473","url":null,"abstract":"This paper proposes a novel way of controllable pitch re-estimation that can produce better pitch contour or provide diverse speaking styles for text-to-speech (TTS) systems. The method is composed of a pitch re-estimation model and a set of control parameters. The pitch re-estimation model is employed to reduce over-smoothing effects which is usually introduced by TTS training. The control parameters are designed to generate not only rich intonations but also speaking styles, e.g. a foreign accent or an excited tone. To verify the feasibility of the proposed method, we conducted experiments for both objective measures and subjective tests. Although the re-estimated pitch results in only slightly less prediction error for objective measure, it produces clearly better intonation for listening test. Moreover, the expressive speech can be generated successfully under the framework of controllable pitch re-estimation.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127405149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423507
Pengfei Liu, K. Yuen, Wai-Kim Leung, H. Meng
This paper presents our ongoing research in the field of speech-enabled multimodal, mobile application development. We have developed a multimodal framework that enables cross-platform development using open standards-based HTML, CSS and JavaScript. This framework brings high extendibility through plugin-based architecture and provides scalable REST-based speech services in the cloud to support large amounts of requests from mobile devices. This paper describes the architecture and implementation of the framework, and the development of a mobile computer-aided pronunciation training application for Chinese learners of English, named mENUNCIATE, based on this framework. We also report a preliminary performance evaluation on mENUNCIATE.
{"title":"mENUNCIATE: Development of a computer-aided pronunciation training system on a cross-platform framework for mobile, speech-enabled application development","authors":"Pengfei Liu, K. Yuen, Wai-Kim Leung, H. Meng","doi":"10.1109/ISCSLP.2012.6423507","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423507","url":null,"abstract":"This paper presents our ongoing research in the field of speech-enabled multimodal, mobile application development. We have developed a multimodal framework that enables cross-platform development using open standards-based HTML, CSS and JavaScript. This framework brings high extendibility through plugin-based architecture and provides scalable REST-based speech services in the cloud to support large amounts of requests from mobile devices. This paper describes the architecture and implementation of the framework, and the development of a mobile computer-aided pronunciation training application for Chinese learners of English, named mENUNCIATE, based on this framework. We also report a preliminary performance evaluation on mENUNCIATE.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133723654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, an enhanced lengthening cancellation method is proposed to detect and cancel the lengthening part of vowels. The proposed method consists of autocorrelation function, cosine similarity-based lengthening detection and bidirectional pitch contour alignment. Autocorrelation function is used to obtain the reference pitch contour. Cosine similarity-based method is applied to measure the similarity between the reference and the next adjacent pitch contours. Due to the variant lengths of periodic segments, fixed size frames may cause accumulative errors. Therefore, bidirectional pitch contour alignment is adopted in this study. Experiments indicate that the proposed method can achieve an accuracy rate of 91.4% and 88.7% on a 60-keyword and 50-scentence database, respectively. Moreover, the proposed approach performs about three times speed than the baseline. Such results prove the effectiveness of the proposed method.
{"title":"Enhanced lengthening cancellation using bidirectional pitch similarity alignment for spontaneous speech","authors":"Po-Yi Shih, Bo-Wei Chen, Jhing-Fa Wang, Jhing-Wei Wu","doi":"10.1109/ISCSLP.2012.6423517","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423517","url":null,"abstract":"In this work, an enhanced lengthening cancellation method is proposed to detect and cancel the lengthening part of vowels. The proposed method consists of autocorrelation function, cosine similarity-based lengthening detection and bidirectional pitch contour alignment. Autocorrelation function is used to obtain the reference pitch contour. Cosine similarity-based method is applied to measure the similarity between the reference and the next adjacent pitch contours. Due to the variant lengths of periodic segments, fixed size frames may cause accumulative errors. Therefore, bidirectional pitch contour alignment is adopted in this study. Experiments indicate that the proposed method can achieve an accuracy rate of 91.4% and 88.7% on a 60-keyword and 50-scentence database, respectively. Moreover, the proposed approach performs about three times speed than the baseline. Such results prove the effectiveness of the proposed method.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"445 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133780486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}