Chao Zhang, Yi Liu, Yunqing Xia, Xuan Wang, Chin-Hui Lee
{"title":"Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition","authors":"Chao Zhang, Yi Liu, Yunqing Xia, Xuan Wang, Chin-Hui Lee","doi":"10.1109/TASL.2013.2265087","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a discriminative dynamic Gaussian mixture selection (DGMS) strategy to generate reliable accent-specific units (ASUs) for multi-accent speech recognition. Time-aligned phone recognition is used to generate the ASUs that model accent variations explicitly and accurately. DGMS reconstructs and adjusts a pre-trained set of hidden Markov model (HMM) state densities to build dynamic observation densities for each input speech frame. A discriminative minimum classification error criterion is adopted to optimize the sizes of the HMM state observation densities with a genetic algorithm (GA). To the author's knowledge, the discriminative optimization for DGMS accomplishes discriminative training of discrete variables that is first proposed. We found the proposed framework is able to cover more multi-accent changes, thus reduce some performance loss in pruned beam search, without increasing the model size of the original acoustic model set. Evaluation on three typical Chinese accents, Chuan, Yue and Wu, shows that our approach outperforms traditional acoustic model reconstruction techniques with a syllable error rate reduction of 8.0%, 5.5% and 5.0%, respectively, while maintaining a good performance on standard Putonghua speech.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2265087","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2013.2265087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
In this paper, we propose a discriminative dynamic Gaussian mixture selection (DGMS) strategy to generate reliable accent-specific units (ASUs) for multi-accent speech recognition. Time-aligned phone recognition is used to generate the ASUs that model accent variations explicitly and accurately. DGMS reconstructs and adjusts a pre-trained set of hidden Markov model (HMM) state densities to build dynamic observation densities for each input speech frame. A discriminative minimum classification error criterion is adopted to optimize the sizes of the HMM state observation densities with a genetic algorithm (GA). To the author's knowledge, the discriminative optimization for DGMS accomplishes discriminative training of discrete variables that is first proposed. We found the proposed framework is able to cover more multi-accent changes, thus reduce some performance loss in pruned beam search, without increasing the model size of the original acoustic model set. Evaluation on three typical Chinese accents, Chuan, Yue and Wu, shows that our approach outperforms traditional acoustic model reconstruction techniques with a syllable error rate reduction of 8.0%, 5.5% and 5.0%, respectively, while maintaining a good performance on standard Putonghua speech.
期刊介绍:
The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.