Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423504
Wenping Hu, Yao Qian, F. Soong
Automatic detection/prediction of pitch accent, which determines the existence of prominent syllable of a word and its corresponding pitch accent pattern, is crucial in making expressive Text-To-Speech (TTS) synthesis. To train a model to detect and predict pitch accent usually requires a large amount of annotated training data to be manually labeled by phonetically trained language experts, which is both time consuming and costly. In this paper, we propose a semi-automatic algorithm to do pitch accent modeling, where the existence of accentuation in the training data is labeled at the word level by native speaker (i.e., not phonetically trained language experts) and the type of a pitch accent is automatically detected with its vector quantized DCT coefficient patterns. A cascaded, two-stage approach, which separates predicting the pitch accent existence and determining corresponding pitch accent type, is proposed to process any unrestricted text input with Conditional Random Field (CRF) trained models. The evaluation results show that the new approach outperforms the conventional, single stage approach.
{"title":"Pitch accent detection and prediction with DCT features and CRF model","authors":"Wenping Hu, Yao Qian, F. Soong","doi":"10.1109/ISCSLP.2012.6423504","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423504","url":null,"abstract":"Automatic detection/prediction of pitch accent, which determines the existence of prominent syllable of a word and its corresponding pitch accent pattern, is crucial in making expressive Text-To-Speech (TTS) synthesis. To train a model to detect and predict pitch accent usually requires a large amount of annotated training data to be manually labeled by phonetically trained language experts, which is both time consuming and costly. In this paper, we propose a semi-automatic algorithm to do pitch accent modeling, where the existence of accentuation in the training data is labeled at the word level by native speaker (i.e., not phonetically trained language experts) and the type of a pitch accent is automatically detected with its vector quantized DCT coefficient patterns. A cascaded, two-stage approach, which separates predicting the pitch accent existence and determining corresponding pitch accent type, is proposed to process any unrestricted text input with Conditional Random Field (CRF) trained models. The evaluation results show that the new approach outperforms the conventional, single stage approach.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130912886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423482
Ruofei Chen, C. Chan
Given accurate system parameters like state transition matrix F and corruption mapping matrix H, clean speech autoregressive (AR) parameters can be effectively estimated from a series of noisy observations with Kalman filtering. In this paper, we address several fundamental issues to improve the linear dynamical system (LDS) based AR parameter estimation. A hierarchical time series clustering scheme is devised to truly group speech blocks with similar trajectories and corruption types. In addition, a correlated robust identification scheme using a posteriori signal-to-noise (SNR) mask is proposed to improve the identification accuracy. The effectiveness of the proposed clustering and identification scheme is evaluated in terms of spectral distortion between the Kalman estimates and the true clean speech parameters. Significant improvement is observed over the original matrix quantization (MQ) based approach. The proposed scheme is also successfully applied in a model-based speech enhancement application, and is expected to be effective in various codebook driven speech applications for robust identification purpose.
{"title":"Hierarchical clustering and robust identification for block-based autoregressive speech parameter estimation","authors":"Ruofei Chen, C. Chan","doi":"10.1109/ISCSLP.2012.6423482","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423482","url":null,"abstract":"Given accurate system parameters like state transition matrix F and corruption mapping matrix H, clean speech autoregressive (AR) parameters can be effectively estimated from a series of noisy observations with Kalman filtering. In this paper, we address several fundamental issues to improve the linear dynamical system (LDS) based AR parameter estimation. A hierarchical time series clustering scheme is devised to truly group speech blocks with similar trajectories and corruption types. In addition, a correlated robust identification scheme using a posteriori signal-to-noise (SNR) mask is proposed to improve the identification accuracy. The effectiveness of the proposed clustering and identification scheme is evaluated in terms of spectral distortion between the Kalman estimates and the true clean speech parameters. Significant improvement is observed over the original matrix quantization (MQ) based approach. The proposed scheme is also successfully applied in a model-based speech enhancement application, and is expected to be effective in various codebook driven speech applications for robust identification purpose.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133489508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In prosody event detection field, many local acoustic features have been proposed for representing the prosody characteristics of speech unit. The context information that represents some possible regularities underlying neighboring prosody events, however, hasn't been used effectively. The main difficulty to utilize prosodic context is that it's hard to capture the long-distance sequential dependency. In order to solve this problem, we introduce a new learning approach: auto-context. In this algorithm, a classifier is first trained based on local acoustic features; the discriminative probabilities produced by the classifier are selected as context information for the next iteration. Then a new classifier is trained by using the selected context information and local acoustic features. Repeating using the updated probabilities as the context information for the next iteration, the algorithm can boost recognition ability during its iterative process until converged. The merit of this method is that it can choose context information in a flexible way, while reserving reliable context information and abandoning unreliable ones. The experimental results showed that the proposed method improved the accuracy by absolutely about 1% for pitch accent detection.
{"title":"Automatic pitch accent detection using auto-context with acoustic features","authors":"Junhong Zhao, Weiqiang Zhang, Hua Yuan, Jia Liu, Shanhong Xia","doi":"10.1109/ISCSLP.2012.6423523","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423523","url":null,"abstract":"In prosody event detection field, many local acoustic features have been proposed for representing the prosody characteristics of speech unit. The context information that represents some possible regularities underlying neighboring prosody events, however, hasn't been used effectively. The main difficulty to utilize prosodic context is that it's hard to capture the long-distance sequential dependency. In order to solve this problem, we introduce a new learning approach: auto-context. In this algorithm, a classifier is first trained based on local acoustic features; the discriminative probabilities produced by the classifier are selected as context information for the next iteration. Then a new classifier is trained by using the selected context information and local acoustic features. Repeating using the updated probabilities as the context information for the next iteration, the algorithm can boost recognition ability during its iterative process until converged. The merit of this method is that it can choose context information in a flexible way, while reserving reliable context information and abandoning unreliable ones. The experimental results showed that the proposed method improved the accuracy by absolutely about 1% for pitch accent detection.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"3 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132792609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423469
C. Lin, Po Kai Huang, Chengyuan Lin, C. Kuo
Reducing the recording effort required in practical speaker adaptive text-to-speech applications would be very useful. In this paper, we present two sentence selection approaches based on a greedy algorithm; one is based on phone coverage and the other is based on model coverage. The former considers the phonetic information in speaker adaptation data, while the latter focuses on occurrences of Mel-cepstral and logF0 models in decision trees of the average voice model. To verify the efficacy of the proposed methods, we compare their performance with that of a random selection method in objective and subjective evaluations. The objective and subjective evaluation results demonstrate that both methods outperform the random selection method.
{"title":"Effective sentence selection based on phone/model coverage maximization for speaker adaptation in HMM-based speech synthesis","authors":"C. Lin, Po Kai Huang, Chengyuan Lin, C. Kuo","doi":"10.1109/ISCSLP.2012.6423469","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423469","url":null,"abstract":"Reducing the recording effort required in practical speaker adaptive text-to-speech applications would be very useful. In this paper, we present two sentence selection approaches based on a greedy algorithm; one is based on phone coverage and the other is based on model coverage. The former considers the phonetic information in speaker adaptation data, while the latter focuses on occurrences of Mel-cepstral and logF0 models in decision trees of the average voice model. To verify the efficacy of the proposed methods, we compare their performance with that of a random selection method in objective and subjective evaluations. The objective and subjective evaluation results demonstrate that both methods outperform the random selection method.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133012669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423498
Yu Zou, Yan Wang, W. He
Which diachronic phonetic changes happened in Mandarin Chinese by the past 100 years? This paper intends to analyze and compare the pitch and duration of read speech in broadcast news from a diachronic perspective. The research results show that the peaks of pitch are the highest, the pitch range is the widest in the 1970s, especially the upraising of the valleys of pitch are so frequent; and the 1950-60s is the second; during the three periods of 1980s, 1990s and 2000s, the peaks of the pitch gradually drift down and the pitch range becomes narrowed. The average word speed of the 1970s is the slowest, and the duration of syllables is the longest; and the 1950-60s is the second; during the other three periods, word speed speeds up and the duration of syllables becomes shortened. Furthermore, the prosodic features are not determined by the text in the different historical periods.
{"title":"Diachronic contrastive analysis on read speech in broadcast news: Evidence from pitch and duration","authors":"Yu Zou, Yan Wang, W. He","doi":"10.1109/ISCSLP.2012.6423498","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423498","url":null,"abstract":"Which diachronic phonetic changes happened in Mandarin Chinese by the past 100 years? This paper intends to analyze and compare the pitch and duration of read speech in broadcast news from a diachronic perspective. The research results show that the peaks of pitch are the highest, the pitch range is the widest in the 1970s, especially the upraising of the valleys of pitch are so frequent; and the 1950-60s is the second; during the three periods of 1980s, 1990s and 2000s, the peaks of the pitch gradually drift down and the pitch range becomes narrowed. The average word speed of the 1970s is the slowest, and the duration of syllables is the longest; and the 1950-60s is the second; during the other three periods, word speed speeds up and the duration of syllables becomes shortened. Furthermore, the prosodic features are not determined by the text in the different historical periods.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128632344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423483
Shixiang Lu, Wei Wei, Xiaoyin Fu, Lichun Fan, Bo Xu
In this paper, we propose an unsupervised phrase-based data selection model, address the problem of selecting no-domain-specific language model (LM) training data to build adapted LM for use. In spoken language translation (SLT) system, we aim at finding the LM training sentences which are similar to the translation task. Compared with the traditional bag-of-words models, the phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. Large-scale experimental results demonstrate that our approach significantly outperforms the state-of-the-art approaches on both LM perplexity and translation performance, respectively.
{"title":"Phrase-based data selection for language model adaptation in spoken language translation","authors":"Shixiang Lu, Wei Wei, Xiaoyin Fu, Lichun Fan, Bo Xu","doi":"10.1109/ISCSLP.2012.6423483","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423483","url":null,"abstract":"In this paper, we propose an unsupervised phrase-based data selection model, address the problem of selecting no-domain-specific language model (LM) training data to build adapted LM for use. In spoken language translation (SLT) system, we aim at finding the LM training sentences which are similar to the translation task. Compared with the traditional bag-of-words models, the phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. Large-scale experimental results demonstrate that our approach significantly outperforms the state-of-the-art approaches on both LM perplexity and translation performance, respectively.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134552238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423503
Duc Hoang Ha Nguyen, Xiong Xiao, Chng Eng Siong, Haizhou Li
In this paper, we investigate a feature conditioning method for the VTS-based model compensation. The VTS is a technique that predicts noisy acoustic model from clean acoustic model and noise model. It is noted that most of the previous studies use a single Gaussian noise model, which is unable to model noise statistics well, especially in non-stationary noisy environments. In this paper, we propose a combination of feature processing and VTS model compensation to handle non-stationary noise more efficiently. In the feature processing stage, the non-stationary characteristics of noise is reduced, hence the processed features is more suitable for VTS model compensation using single Gaussian noise model. Experimental analysis on the AURORA2 task shows that the proposed method has the potential to improve the performance of VTS method in non-stationary environments if good noise estimation is available.
{"title":"An analysis of vector Taylor series model compensation for non-stationary noise in speech recognition","authors":"Duc Hoang Ha Nguyen, Xiong Xiao, Chng Eng Siong, Haizhou Li","doi":"10.1109/ISCSLP.2012.6423503","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423503","url":null,"abstract":"In this paper, we investigate a feature conditioning method for the VTS-based model compensation. The VTS is a technique that predicts noisy acoustic model from clean acoustic model and noise model. It is noted that most of the previous studies use a single Gaussian noise model, which is unable to model noise statistics well, especially in non-stationary noisy environments. In this paper, we propose a combination of feature processing and VTS model compensation to handle non-stationary noise more efficiently. In the feature processing stage, the non-stationary characteristics of noise is reduced, hence the processed features is more suitable for VTS model compensation using single Gaussian noise model. Experimental analysis on the AURORA2 task shows that the proposed method has the potential to improve the performance of VTS method in non-stationary environments if good noise estimation is available.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125231948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423505
Dazuo Wang, Xiuxiu Wang, Gang Peng
This study investigated the effects of three different carriers on Mandarin tone perception. Three tone continua were constructed: Modified speech, synthesized speech, and nonspeech. Identification tests were conducted for the two speech continua, while discrimination tests were conducted for all the three continua. Results showed that category boundary position differed significantly between the modified speech and synthesized speech continua. Boundary position of the modified speech tone continuum was more toward the rising end than that of the synthesized speech tone continuum, suggesting that greater complexity reduces the overall pitch sensitivity. In the discrimination test, subjects generally exhibited the same pattern for the three continua, but with slightly lower discrimination accuracy for the nonspeech continuum, suggesting the effects of long-term tone language experience of Mandarin is carried over to nonspeech domain.
{"title":"Effects of carriers on Mandarin tone categorical perception","authors":"Dazuo Wang, Xiuxiu Wang, Gang Peng","doi":"10.1109/ISCSLP.2012.6423505","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423505","url":null,"abstract":"This study investigated the effects of three different carriers on Mandarin tone perception. Three tone continua were constructed: Modified speech, synthesized speech, and nonspeech. Identification tests were conducted for the two speech continua, while discrimination tests were conducted for all the three continua. Results showed that category boundary position differed significantly between the modified speech and synthesized speech continua. Boundary position of the modified speech tone continuum was more toward the rising end than that of the synthesized speech tone continuum, suggesting that greater complexity reduces the overall pitch sensitivity. In the discrimination test, subjects generally exhibited the same pattern for the three continua, but with slightly lower discrimination accuracy for the nonspeech continuum, suggesting the effects of long-term tone language experience of Mandarin is carried over to nonspeech domain.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125061786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423538
Hua Yuan, Junhong Zhao, Jia Liu
This paper presents a method to improve the mispronunciation detection performance for low-resource acoustic model. The 1h speech data is randomly selected from CU-CHLOE to imitate the low-resource non-native English situation. The Tandem feature derived from articulatory based Multi-Layer Perception (MLP) is employed to replace the traditional spectral feature (e.g. PLP). Further, motivated by similar pronunciation characteristics between Chinese speaking English and Mandarin, the Mandarin speech data is used to assist in training the multilingual articulatory MLPs. The Tandem feature is also combined with PLP to improve the performance. Finally, the phone recognition correctness (CORR) is improved by 3.84%, and the diagnosis accuracy (DA) is improved by 2.25% with the proposed method.
{"title":"Improve mispronunciation detection with Tandem feature","authors":"Hua Yuan, Junhong Zhao, Jia Liu","doi":"10.1109/ISCSLP.2012.6423538","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423538","url":null,"abstract":"This paper presents a method to improve the mispronunciation detection performance for low-resource acoustic model. The 1h speech data is randomly selected from CU-CHLOE to imitate the low-resource non-native English situation. The Tandem feature derived from articulatory based Multi-Layer Perception (MLP) is employed to replace the traditional spectral feature (e.g. PLP). Further, motivated by similar pronunciation characteristics between Chinese speaking English and Mandarin, the Mandarin speech data is used to assist in training the multilingual articulatory MLPs. The Tandem feature is also combined with PLP to improve the performance. Finally, the phone recognition correctness (CORR) is improved by 3.84%, and the diagnosis accuracy (DA) is improved by 2.25% with the proposed method.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126149165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/ISCSLP.2012.6423521
Na Li, Y. Qiao
Voice conversion can be formulated as finding a mapping function which transforms the features of a source speaker to those of the target speaker. Gaussian mixture model (GMM)-based conversion techniques [1, 2] have been widely used in voice conversion due to its effectiveness and efficiency. In a recent work [3], we generalized GMM-based mapping to Mixture of Probabilistic Linear Regressions (MPLR). But both GMM based mapping and MPLR are subjected to overfitting problem especially when the training utterances are sparse,and both ignore the inherent time-dependency among speech features. This paper addresses this problem by introducing dynamic kernel features and conducting Bayesian analysis for MPLR. The dynamic kernel features are calculated as kernel transformations of current, previous and next frames, which can model both the nonlinearities and dynamics in the features. We further develop Maximum a Posterior (MAP) inference to alleviate the overfitting problem by introducing prior on the parameters of kernel transformation. Our experimental results exhibit that the proposed methods achieve better performance compared to the MPLR based model.
{"title":"Voice conversion using Bayesian mixture of Probabilistic Linear Regressions and dynamic kernel features","authors":"Na Li, Y. Qiao","doi":"10.1109/ISCSLP.2012.6423521","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423521","url":null,"abstract":"Voice conversion can be formulated as finding a mapping function which transforms the features of a source speaker to those of the target speaker. Gaussian mixture model (GMM)-based conversion techniques [1, 2] have been widely used in voice conversion due to its effectiveness and efficiency. In a recent work [3], we generalized GMM-based mapping to Mixture of Probabilistic Linear Regressions (MPLR). But both GMM based mapping and MPLR are subjected to overfitting problem especially when the training utterances are sparse,and both ignore the inherent time-dependency among speech features. This paper addresses this problem by introducing dynamic kernel features and conducting Bayesian analysis for MPLR. The dynamic kernel features are calculated as kernel transformations of current, previous and next frames, which can model both the nonlinearities and dynamics in the features. We further develop Maximum a Posterior (MAP) inference to alleviate the overfitting problem by introducing prior on the parameters of kernel transformation. Our experimental results exhibit that the proposed methods achieve better performance compared to the MPLR based model.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117246019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}