Investigations on an EM-Style Optimization Algorithm for Discriminative Training of HMMs

IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-12-01 DOI:10.1109/TASL.2013.2280234

G. Heigold, H. Ney, R. Schlüter

引用次数: 8

Abstract

Today's speech recognition systems are based on hidden Markov models (HMMs) with Gaussian mixture models whose parameters are estimated using a discriminative training criterion such as Maximum Mutual Information (MMI) or Minimum Phone Error (MPE). Currently, the optimization is almost always done with (empirical variants of) Extended Baum-Welch (EBW). This type of optimization requires sophisticated update schemes for the step sizes and a considerable amount of parameter tuning, and only little is known about its convergence behavior. In this paper, we derive an EM-style algorithm for discriminative training of HMMs. Like Expectation-Maximization (EM) for the generative training of HMMs, the proposed algorithm improves the training criterion on each iteration, converges to a local optimum, and is completely parameter-free. We investigate the feasibility of the proposed EM-style algorithm for discriminative training of two tasks, namely grapheme-to-phoneme conversion and spoken digit string recognition.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

hmm判别训练的em型优化算法研究

目前的语音识别系统是基于隐马尔可夫模型(hmm)和高斯混合模型，这些模型的参数是使用判别训练准则(如最大互信息(MMI)或最小电话误差(MPE))来估计的。目前，优化几乎总是用扩展鲍姆-韦尔奇(EBW)的(经验变体)来完成。这种类型的优化需要复杂的步长更新方案和大量的参数调优，并且对其收敛行为知之甚少。在本文中，我们推导了一种em风格的hmm判别训练算法。与基于期望最大化的hmm生成训练算法一样，该算法改进了每次迭代的训练准则，收敛到局部最优，并且完全无参数。我们研究了所提出的em风格算法在两个任务(即字素到音素转换和口语数字字符串识别)的判别训练中的可行性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Audio Speech and Language Processing 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

24.0 months

期刊介绍： The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.