A novel neural-based pronunciation modeling method for robust speech recognition

2011 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2011-12-01 DOI:10.1109/ASRU.2011.6163985

Guangpu Huang, M. Er

引用次数: 3

Abstract

This paper describes a recurrent neural network (RNN) based articulatory-phonetic inversion (API) model for improved speech recognition. And a specialized optimization algorithm is introduced to enable human-like heuristic learning in an efficient data-driven manner to capture the dynamic nature of English speech pronunciations. The API model demonstrates superior pronunciation modeling ability and robustness against noise contaminations in large-vocabulary speech recognition experiments. Using a simple rescoring formula, it improves the hidden Markov model (HMM) baseline speech recognizer with consistent error rates reduction of 5.30% and 10.14% for phoneme recognition tasks on clean and noisy speech respectively on the selected TIMIT datasets. And an error rate reduction of 3.35% is obtained for the SCRIBE-TIMIT word recognition tasks. The proposed system qualifies as a competitive candidate for profound pronunciation modeling with intrinsic salient features such as generality and portability.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

鲁棒语音识别中一种新的基于神经网络的语音建模方法

本文提出了一种基于递归神经网络(RNN)的发音-语音反转(API)模型，用于改进语音识别。并引入了一种专门的优化算法，以有效的数据驱动方式实现类似人类的启发式学习，以捕捉英语语音发音的动态特性。在大词汇量语音识别实验中，该API模型显示了良好的语音建模能力和抗噪声污染的鲁棒性。利用简单的评分公式，在选定的TIMIT数据集上对隐马尔可夫模型(HMM)基线语音识别器进行改进，在干净语音和有噪声语音的音素识别任务中错误率分别降低了5.30%和10.14%。在SCRIBE-TIMIT词识别任务中，错误率降低了3.35%。所提出的系统具有通用性和可移植性等内在显著特征，具有深度语音建模的竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

自引率

0.00%

发文量

期刊最新文献

Applying feature bagging for more accurate and robust automated speaking assessment Towards choosing better primes for spoken dialog systems Accent level adjustment in bilingual Thai-English text-to-speech synthesis Fast speaker diarization using a high-level scripting language Evaluating prosodic features for automated scoring of non-native read speech