Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition

IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-10-01 DOI:10.1109/TASL.2013.2265087

Chao Zhang, Yi Liu, Yunqing Xia, Xuan Wang, Chin-Hui Lee

{"title":"Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition","authors":"Chao Zhang, Yi Liu, Yunqing Xia, Xuan Wang, Chin-Hui Lee","doi":"10.1109/TASL.2013.2265087","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a discriminative dynamic Gaussian mixture selection (DGMS) strategy to generate reliable accent-specific units (ASUs) for multi-accent speech recognition. Time-aligned phone recognition is used to generate the ASUs that model accent variations explicitly and accurately. DGMS reconstructs and adjusts a pre-trained set of hidden Markov model (HMM) state densities to build dynamic observation densities for each input speech frame. A discriminative minimum classification error criterion is adopted to optimize the sizes of the HMM state observation densities with a genetic algorithm (GA). To the author's knowledge, the discriminative optimization for DGMS accomplishes discriminative training of discrete variables that is first proposed. We found the proposed framework is able to cover more multi-accent changes, thus reduce some performance loss in pruned beam search, without increasing the model size of the original acoustic model set. Evaluation on three typical Chinese accents, Chuan, Yue and Wu, shows that our approach outperforms traditional acoustic model reconstruction techniques with a syllable error rate reduction of 8.0%, 5.5% and 5.0%, respectively, while maintaining a good performance on standard Putonghua speech.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"73 1","pages":"2073-2084"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2265087","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2013.2265087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

In this paper, we propose a discriminative dynamic Gaussian mixture selection (DGMS) strategy to generate reliable accent-specific units (ASUs) for multi-accent speech recognition. Time-aligned phone recognition is used to generate the ASUs that model accent variations explicitly and accurately. DGMS reconstructs and adjusts a pre-trained set of hidden Markov model (HMM) state densities to build dynamic observation densities for each input speech frame. A discriminative minimum classification error criterion is adopted to optimize the sizes of the HMM state observation densities with a genetic algorithm (GA). To the author's knowledge, the discriminative optimization for DGMS accomplishes discriminative training of discrete variables that is first proposed. We found the proposed framework is able to cover more multi-accent changes, thus reduce some performance loss in pruned beam search, without increasing the model size of the original acoustic model set. Evaluation on three typical Chinese accents, Chuan, Yue and Wu, shows that our approach outperforms traditional acoustic model reconstruction techniques with a syllable error rate reduction of 8.0%, 5.5% and 5.0%, respectively, while maintaining a good performance on standard Putonghua speech.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于判别动态高斯混合选择的汉语多口音语音识别中可靠的口音单位生成

在本文中，我们提出了一种判别动态高斯混合选择(DGMS)策略来生成可靠的多口音特定单元(ASUs)。时间对齐的电话识别用于生成华硕模型口音变化明确和准确。DGMS对预训练的隐马尔可夫模型(HMM)状态密度进行重构和调整，为每个输入语音帧构建动态观察密度。采用判别最小分类误差准则，利用遗传算法优化HMM状态观测密度的大小。据笔者所知，DGMS的判别优化完成了首次提出的离散变量的判别训练。我们发现，所提出的框架能够覆盖更多的多重音变化，从而在不增加原始声学模型集的模型大小的情况下减少修剪波束搜索的一些性能损失。对三种典型的中国口音川、越、吴的测试表明，我们的方法优于传统的声学模型重建技术，音节错误率分别降低了8.0%、5.5%和5.0%，同时在标准普通话语音上保持了良好的表现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Audio Speech and Language Processing 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

24.0 months

期刊介绍： The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.