Pub Date : 2017-02-21DOI: 10.16511/J.CNKI.QHDXXB.2017.22.008
Wang Gaowu, Dang Jianwu, Kong Jiangping
Summary form only given. In this paper, the tongue tip was modeled based on the articulatory data from MRI images in Standard Chinese. First, the MRI articulatory database of Standard Chinese, including 9 vowels and 75 consonant variants, were established. Second, Principle Component Analysis (PCA) was performed on the tongue shape to find articulatory factors, and the result showed that it would be more precise and concise when the tongue was divided as the tongue tip and tongue body and modeled separately. Finally, according to this result, the tongue tip was modeled by two articulatory parameters: Tongue Tip Protrude and Tongue Tip Raise, which represents the protruding/advancing and raising/retroflexing movements of the tongue tip.
{"title":"The modeling of tongue tip in Standard Chinese using MRI","authors":"Wang Gaowu, Dang Jianwu, Kong Jiangping","doi":"10.16511/J.CNKI.QHDXXB.2017.22.008","DOIUrl":"https://doi.org/10.16511/J.CNKI.QHDXXB.2017.22.008","url":null,"abstract":"Summary form only given. In this paper, the tongue tip was modeled based on the articulatory data from MRI images in Standard Chinese. First, the MRI articulatory database of Standard Chinese, including 9 vowels and 75 consonant variants, were established. Second, Principle Component Analysis (PCA) was performed on the tongue shape to find articulatory factors, and the result showed that it would be more precise and concise when the tongue was divided as the tongue tip and tongue body and modeled separately. Finally, according to this result, the tongue tip was modeled by two articulatory parameters: Tongue Tip Protrude and Tongue Tip Raise, which represents the protruding/advancing and raising/retroflexing movements of the tongue tip.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"259 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122678418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-27DOI: 10.1109/ISCSLP.2014.6936629
Dan Zhang, Xianqian Liu, N. Yan, Lan Wang, Yun Zhu, Hui Chen
The application of articulatory database in speech production and automatic speech recognition has been practiced for many years. The goal of the research was to build an articulatory database specifying in Chinese Mandarin production and to investigate its efficacy in speech animation. Carstens EMA AG501 device were respectively used to capture acoustic data and articulatory data. Also, a Microsoft Kinect camera was applied to capture face-tracking data as a supplement. Finally, we tried several methods to extract acoustic parameters and built up a 3D talking head model to verify the efficacy of the database.
发音数据库在语音生成和语音自动识别中的应用已经有多年的实践。本研究的目的在于建立中文普通话制作的发音资料库,并探讨其在语音动画中的效果。采用Carstens EMA AG501装置分别采集声学数据和发音数据。此外,微软Kinect摄像头也被用于捕捉面部追踪数据,作为补充。最后,我们尝试了几种方法提取声学参数,并建立了一个三维说话头模型来验证数据库的有效性。
{"title":"A multi-channel/multi-speaker articulatory database in Mandarin for speech visualization","authors":"Dan Zhang, Xianqian Liu, N. Yan, Lan Wang, Yun Zhu, Hui Chen","doi":"10.1109/ISCSLP.2014.6936629","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936629","url":null,"abstract":"The application of articulatory database in speech production and automatic speech recognition has been practiced for many years. The goal of the research was to build an articulatory database specifying in Chinese Mandarin production and to investigate its efficacy in speech animation. Carstens EMA AG501 device were respectively used to capture acoustic data and articulatory data. Also, a Microsoft Kinect camera was applied to capture face-tracking data as a supplement. Finally, we tried several methods to extract acoustic parameters and built up a 3D talking head model to verify the efficacy of the database.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123548026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-27DOI: 10.1109/ISCSLP.2014.6936622
Xiangang Li, Xihong Wu
Recently, deep neural network (DNN) with hidden Markov model (HMM) has turned out to be a superior sequence learning framework, based on which significant improvements were achieved in many application tasks, such as automatic speech recognition (ASR). However, the training of DNN-HMM requires the pre-segmented training data, which can be generated using Gaussian Mixture Model (GMM) in ASR tasks. Thus, questions are raised by many researchers: can we train the DNN-HMM without GMM seeding, and what does it suggest if the answer is yes? In this research, we come up with the `yes' answer by presenting forward-backward learning algorithm for DNN-HMM framework. Besides, a training procedure is proposed, in which, the training for context independent (CI) DNN-HMM is treated as the pre-training for context dependent (CD) DNN-HMM. To evaluate the contribution of this work, experiments on ASR task with the benchmark corpus TIMIT are performed, and the results demonstrate the effectiveness of this research.
{"title":"Labeling unsegmented sequence data with DNN-HMM and its application for speech recognition","authors":"Xiangang Li, Xihong Wu","doi":"10.1109/ISCSLP.2014.6936622","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936622","url":null,"abstract":"Recently, deep neural network (DNN) with hidden Markov model (HMM) has turned out to be a superior sequence learning framework, based on which significant improvements were achieved in many application tasks, such as automatic speech recognition (ASR). However, the training of DNN-HMM requires the pre-segmented training data, which can be generated using Gaussian Mixture Model (GMM) in ASR tasks. Thus, questions are raised by many researchers: can we train the DNN-HMM without GMM seeding, and what does it suggest if the answer is yes? In this research, we come up with the `yes' answer by presenting forward-backward learning algorithm for DNN-HMM framework. Besides, a training procedure is proposed, in which, the training for context independent (CI) DNN-HMM is treated as the pre-training for context dependent (CD) DNN-HMM. To evaluate the contribution of this work, experiments on ASR task with the benchmark corpus TIMIT are performed, and the results demonstrate the effectiveness of this research.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125556354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-27DOI: 10.1109/ISCSLP.2014.6936596
Zhao You, Bo Xu
Deep neural network acoustic models have shown large improvement in performance over Gaussian mixture models (GMMs) in recent studies. Typically, stochastic gradient descent (SGD) is the most popular method for training deep neural networks. However, training DNN with minibatch based SGD is very slow. Because it requires frequent serial training and scanning the whole training set many passes before reaching the asymptotic region, making it difficult to scale to large dataset. Commonly, we can reduce training time from two aspects, reducing the epochs of training and exploring the distributed training algorithm. There are some distributed training algorithms, such as LBFGS, Hessian-free optimization and asynchronous SGD, have proven significantly reducing the training time. In order to further reduce the training time, we attempted to explore training algorithm with fast convergence and combined it with distributed training algorithm. Averaged stochastic gradient descent (ASGD) is proved simple and effective for one pass on-line learning. This paper investigates the asynchronous ASGD algorithm for deep neural network training. We tested asynchronous ASGD on the Mandarin Chinese recorded speech recognition task using deep neural networks. Experimental results show that the performance of one pass asynchronous ASGD is very close to that of multiple passes asynchronous SGD. Meanwhile, we can reduce the training time by a factor of 6.3.
{"title":"Improving training time of deep neural networkwith asynchronous averaged stochastic gradient descent","authors":"Zhao You, Bo Xu","doi":"10.1109/ISCSLP.2014.6936596","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936596","url":null,"abstract":"Deep neural network acoustic models have shown large improvement in performance over Gaussian mixture models (GMMs) in recent studies. Typically, stochastic gradient descent (SGD) is the most popular method for training deep neural networks. However, training DNN with minibatch based SGD is very slow. Because it requires frequent serial training and scanning the whole training set many passes before reaching the asymptotic region, making it difficult to scale to large dataset. Commonly, we can reduce training time from two aspects, reducing the epochs of training and exploring the distributed training algorithm. There are some distributed training algorithms, such as LBFGS, Hessian-free optimization and asynchronous SGD, have proven significantly reducing the training time. In order to further reduce the training time, we attempted to explore training algorithm with fast convergence and combined it with distributed training algorithm. Averaged stochastic gradient descent (ASGD) is proved simple and effective for one pass on-line learning. This paper investigates the asynchronous ASGD algorithm for deep neural network training. We tested asynchronous ASGD on the Mandarin Chinese recorded speech recognition task using deep neural networks. Experimental results show that the performance of one pass asynchronous ASGD is very close to that of multiple passes asynchronous SGD. Meanwhile, we can reduce the training time by a factor of 6.3.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128320719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-27DOI: 10.1109/ISCSLP.2014.6936588
Bin Wang, Zhijian Ou, Jian Li, A. Kawamura
The state-of-the-art language models (LMs) for Chinese speech recognition are word n-gram models. However, in Chinese, characters are morphological in meaning and words are not consistently defined. There are recent interests in building the character n-gram LM and its combination with the word n-gram LM. In this paper, in order to exploit both character-level and word-level constraints, we propose the joint n-gram LM, which is an n-gram model based on joint-state that is a pair of character and its position-of-character (POC) tag. We point out the pitfall in naive solving of the smoothing and scoring problems for joint n-gram models, and provide corrected solutions. For experimental comparison, different LMs (including word 4-grams, character 6-grams and joint 6-grams) are tested for speech recognition, using training corpus of 1.9 billion characters. The joint n-gram LM achieves performance improvements, especially in recognizing the utterances containing OOV words.
{"title":"Joint-character-POC N-gram language modeling for Chinese speech recognition","authors":"Bin Wang, Zhijian Ou, Jian Li, A. Kawamura","doi":"10.1109/ISCSLP.2014.6936588","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936588","url":null,"abstract":"The state-of-the-art language models (LMs) for Chinese speech recognition are word n-gram models. However, in Chinese, characters are morphological in meaning and words are not consistently defined. There are recent interests in building the character n-gram LM and its combination with the word n-gram LM. In this paper, in order to exploit both character-level and word-level constraints, we propose the joint n-gram LM, which is an n-gram model based on joint-state that is a pair of character and its position-of-character (POC) tag. We point out the pitfall in naive solving of the smoothing and scoring problems for joint n-gram models, and provide corrected solutions. For experimental comparison, different LMs (including word 4-grams, character 6-grams and joint 6-grams) are tested for speech recognition, using training corpus of 1.9 billion characters. The joint n-gram LM achieves performance improvements, especially in recognizing the utterances containing OOV words.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129741925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-27DOI: 10.1109/ISCSLP.2014.6936671
Fanhu Bie, Dong Wang, T. Zheng
Summary form only given. The speech truncating phenomenon is a general problem is practical speaker recognition system. After the speech was truncated by amplitude, the spectral was changed during the process, resulting in the decreasing in the system`s performance. The paper describes the observation and the conclusion on the impact of the truncated segments, studies the reason of the impact on the recognition performance, gives out the ways of the truncated segments detection and reducing the decreasing of the performance. The simulation on NIST SRE08 shows that, just when the amplitude truncating ratio remains high (more than the 80% of the maximum amplitude), the performance drops sharply; the performance of traditional GMM-UBM system and I-vector system behavior familiar when the amplitude truncating is low, while I-vector gives a better robustness when is high. The paper gives out a proposal on truncating segments detection based on subspace discriminant information, which is then used to discard the truncating segments. The experiments show that this proposal could well detect the truncated segments. However, the results show that there are still speaker discriminant information in the truncated segments, when the amplitude truncated ratio remains low, it's better to remain the data to sustain the performance, otherwise, the speaker should take another recording to keep the system performance.
{"title":"Research on truncated speech in speaker verification","authors":"Fanhu Bie, Dong Wang, T. Zheng","doi":"10.1109/ISCSLP.2014.6936671","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936671","url":null,"abstract":"Summary form only given. The speech truncating phenomenon is a general problem is practical speaker recognition system. After the speech was truncated by amplitude, the spectral was changed during the process, resulting in the decreasing in the system`s performance. The paper describes the observation and the conclusion on the impact of the truncated segments, studies the reason of the impact on the recognition performance, gives out the ways of the truncated segments detection and reducing the decreasing of the performance. The simulation on NIST SRE08 shows that, just when the amplitude truncating ratio remains high (more than the 80% of the maximum amplitude), the performance drops sharply; the performance of traditional GMM-UBM system and I-vector system behavior familiar when the amplitude truncating is low, while I-vector gives a better robustness when is high. The paper gives out a proposal on truncating segments detection based on subspace discriminant information, which is then used to discard the truncating segments. The experiments show that this proposal could well detect the truncated segments. However, the results show that there are still speaker discriminant information in the truncated segments, when the amplitude truncated ratio remains low, it's better to remain the data to sustain the performance, otherwise, the speaker should take another recording to keep the system performance.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114070795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A recently introduced deep neural network (DNN) has achieved some unprecedented gains in many challenging automatic speech recognition (ASR) tasks. In this paper deep neural network hidden Markov model (DNN-HMM) acoustic models is introduced to phonotactic language recognition and outperforms artificial neural network hidden Markov model (ANN-HMM) and Gaussian mixture model hidden Markov model (GMM-HMM) acoustic model. Experimental results have confirmed that phonotactic language recognition system using DNN-HMM acoustic model yields relative equal error rate reduction of 28.42%, 14.06%, 18.70% and 12.55%, 7.20%, 2.47% for 30s, 10s, 3s comparing with the ANN-HMM and GMM-HMM approaches respectively on National Institute of Standards and Technology language recognition evaluation (NIST LRE) 2009 tasks.
{"title":"Phonotactic language recognition based on DNN-HMM acoustic model","authors":"Weiwei Liu, Meng Cai, Hua Yuan, Xiao-Bei Shi, Weiqiang Zhang, Jia Liu","doi":"10.1109/ISCSLP.2014.6936704","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936704","url":null,"abstract":"A recently introduced deep neural network (DNN) has achieved some unprecedented gains in many challenging automatic speech recognition (ASR) tasks. In this paper deep neural network hidden Markov model (DNN-HMM) acoustic models is introduced to phonotactic language recognition and outperforms artificial neural network hidden Markov model (ANN-HMM) and Gaussian mixture model hidden Markov model (GMM-HMM) acoustic model. Experimental results have confirmed that phonotactic language recognition system using DNN-HMM acoustic model yields relative equal error rate reduction of 28.42%, 14.06%, 18.70% and 12.55%, 7.20%, 2.47% for 30s, 10s, 3s comparing with the ANN-HMM and GMM-HMM approaches respectively on National Institute of Standards and Technology language recognition evaluation (NIST LRE) 2009 tasks.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"306 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121588695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-27DOI: 10.1109/ISCSLP.2014.6936726
Wenbo Liu, Zhiding Yu, Ming Li
We present an iterative and unsupervised learning approach for the speaker verification task. In conventional speaker verification, Probabilistic Linear Discriminant Analysis (PLDA) has been widely used as a supervised backend. However, PLDA requires fully labeled training data, which is often difficult to obtain in reality. To automatically retrieve the speaker labels of unlabeled training data, we propose to use the Affinity Propagation (AP) - a clustering method that takes pairwise data similarity as input - to generate the labels for the PLDA modeling. We further propose an iterative refinement strategy that incrementally updates the similarity input of the AP clustering with the previous iteration's PLDA scoring outputs. Moreover, we evaluate the performance of different PLDA scoring methods for the multiple enrollment task and show that the generalized hypothesis testing achieves the best results. Experiments were conducted on the NIST SRE 2010 and the 2014 i-vector challenge database. The results show that our proposed iterative and unsupervised PLDA model learning approach outperformed the cosine similarity baseline by 35% relatively.
{"title":"An iterative framework for unsupervised learning in the PLDA based speaker verification","authors":"Wenbo Liu, Zhiding Yu, Ming Li","doi":"10.1109/ISCSLP.2014.6936726","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936726","url":null,"abstract":"We present an iterative and unsupervised learning approach for the speaker verification task. In conventional speaker verification, Probabilistic Linear Discriminant Analysis (PLDA) has been widely used as a supervised backend. However, PLDA requires fully labeled training data, which is often difficult to obtain in reality. To automatically retrieve the speaker labels of unlabeled training data, we propose to use the Affinity Propagation (AP) - a clustering method that takes pairwise data similarity as input - to generate the labels for the PLDA modeling. We further propose an iterative refinement strategy that incrementally updates the similarity input of the AP clustering with the previous iteration's PLDA scoring outputs. Moreover, we evaluate the performance of different PLDA scoring methods for the multiple enrollment task and show that the generalized hypothesis testing achieves the best results. Experiments were conducted on the NIST SRE 2010 and the 2014 i-vector challenge database. The results show that our proposed iterative and unsupervised PLDA model learning approach outperformed the cosine similarity baseline by 35% relatively.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"1021 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121713388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-27DOI: 10.1109/ISCSLP.2014.6936691
Zuyan Wang, Jin-Song Zhang
This paper aims at studying on perceptual influences from vowel segments on the judgments of alveolar/velar nasals by Chinese and Japanese subjects, through two experiments: a) perception of the natural syllables; b) perception of the synthesized syllables. The results show that: 1) The nasalized vowels play dominating roles in cueing Chinese subjects to judge which coda the nasal is, whereas they have few effects for Japanese subjects, especially for the discrimination between an and ang. 2) When the nasalized vowel portions are missing, the vowel nuclei lay similar influences on the perceptions for both Chinese and Japanese. The larger the acoustic differences between the vowel nuclei are for the pair of alveolar/velar ones, the easier it is for both Chinese and Japanese to correctly distinguish them. From these results, we suggest that the importance of the acoustic differences between vowel portions in the pair of alveolar/velar nasals, and the sensitivity to the nasalized vowels, should be highlighted in the learning of Chinese as a second language by Japanese students.
{"title":"Influences of vowels on perception of nasal codas in Mandarin for Japanese learners and Chinese","authors":"Zuyan Wang, Jin-Song Zhang","doi":"10.1109/ISCSLP.2014.6936691","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936691","url":null,"abstract":"This paper aims at studying on perceptual influences from vowel segments on the judgments of alveolar/velar nasals by Chinese and Japanese subjects, through two experiments: a) perception of the natural syllables; b) perception of the synthesized syllables. The results show that: 1) The nasalized vowels play dominating roles in cueing Chinese subjects to judge which coda the nasal is, whereas they have few effects for Japanese subjects, especially for the discrimination between an and ang. 2) When the nasalized vowel portions are missing, the vowel nuclei lay similar influences on the perceptions for both Chinese and Japanese. The larger the acoustic differences between the vowel nuclei are for the pair of alveolar/velar ones, the easier it is for both Chinese and Japanese to correctly distinguish them. From these results, we suggest that the importance of the acoustic differences between vowel portions in the pair of alveolar/velar nasals, and the sensitivity to the nasalized vowels, should be highlighted in the learning of Chinese as a second language by Japanese students.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130888272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-27DOI: 10.1109/ISCSLP.2014.6936633
H. Gu, Sung-Fung Tsai
In this paper, the voice conversion method based on segmental Gaussian mixture models (GMMs) is further improved by adding the module of target frame selection (TFS). Segmental GMMs are meant to replace a single GMM of a large number of mixture components with several voice-content specific GMMs each consisting of much fewer mixture components. In addition, TFS is used to find a frame, of spectral features near to the mapped feature vector, from the target-speaker frame pool corresponding to the segment class as the input frame belongs to. Both ideas are intended to alleviate the problem that the converted spectral envelopes are often over smoothed. To evaluate the performance of the two ideas mentioned, three voice conversion systems are constructed, and used to conduct listening tests. The results of the tests show that the system using the two ideas together can obtain much improved voice quality. In addition, the measured variance ratio (VR) values show that the system with the two ideas adopted also obtains the highest VR value.
{"title":"Improving segmental GMM based voice conversion method with target frame selection","authors":"H. Gu, Sung-Fung Tsai","doi":"10.1109/ISCSLP.2014.6936633","DOIUrl":"https://doi.org/10.1109/ISCSLP.2014.6936633","url":null,"abstract":"In this paper, the voice conversion method based on segmental Gaussian mixture models (GMMs) is further improved by adding the module of target frame selection (TFS). Segmental GMMs are meant to replace a single GMM of a large number of mixture components with several voice-content specific GMMs each consisting of much fewer mixture components. In addition, TFS is used to find a frame, of spectral features near to the mapped feature vector, from the target-speaker frame pool corresponding to the segment class as the input frame belongs to. Both ideas are intended to alleviate the problem that the converted spectral envelopes are often over smoothed. To evaluate the performance of the two ideas mentioned, three voice conversion systems are constructed, and used to conduct listening tests. The results of the tests show that the system using the two ideas together can obtain much improved voice quality. In addition, the measured variance ratio (VR) values show that the system with the two ideas adopted also obtains the highest VR value.","PeriodicalId":285277,"journal":{"name":"The 9th International Symposium on Chinese Spoken Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133132931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}