首页 > 最新文献

2011 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Randomized maximum entropy language models 随机化最大熵语言模型
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163935
Puyang Xu, S. Khudanpur, A. Gunawardana
We address the memory problem of maximum entropy language models(MELM) with very large feature sets. Randomized techniques are employed to remove all large, exact data structures in MELM implementations. To avoid the dictionary structure that maps each feature to its corresponding weight, the feature hashing trick [1] [2] can be used. We also replace the explicit storage of features with a Bloom filter. We show with extensive experiments that false positive errors of Bloom filters and random hash collisions do not degrade model performance. Both perplexity and WER improvements are demonstrated by building MELM that would otherwise be prohibitively large to estimate or store.
我们解决了具有非常大特征集的最大熵语言模型(MELM)的内存问题。在MELM实现中,随机化技术被用来去除所有大的、精确的数据结构。为了避免将每个特征映射到其相应权重的字典结构,可以使用特征哈希技巧[1][2]。我们还用Bloom过滤器代替了显式的特征存储。我们通过大量的实验表明,布隆过滤器的假阳性误差和随机哈希碰撞不会降低模型的性能。通过构建MELM来证明困惑和WER的改进,否则估计或存储的MELM会大得令人望而却步。
{"title":"Randomized maximum entropy language models","authors":"Puyang Xu, S. Khudanpur, A. Gunawardana","doi":"10.1109/ASRU.2011.6163935","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163935","url":null,"abstract":"We address the memory problem of maximum entropy language models(MELM) with very large feature sets. Randomized techniques are employed to remove all large, exact data structures in MELM implementations. To avoid the dictionary structure that maps each feature to its corresponding weight, the feature hashing trick [1] [2] can be used. We also replace the explicit storage of features with a Bloom filter. We show with extensive experiments that false positive errors of Bloom filters and random hash collisions do not degrade model performance. Both perplexity and WER improvements are demonstrated by building MELM that would otherwise be prohibitively large to estimate or store.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127112362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Efficient discriminative training of long-span language models 大跨度语言模型的高效判别训练
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163933
A. Rastrow, Mark Dredze, S. Khudanpur
Long-span language models, such as those involving syntactic dependencies, produce more coherent text than their n-gram counterparts. However, evaluating the large number of sentence-hypotheses in a packed representation such as an ASR lattice is intractable under such long-span models both during decoding and discriminative training. The accepted compromise is to rescore only the N-best hypotheses in the lattice using the long-span LM. We present discriminative hill climbing, an efficient and effective discriminative training procedure for long-span LMs based on a hill climbing rescoring algorithm [1]. We empirically demonstrate significant computational savings as well as error-rate reduction over N-best training methods in a state of the art ASR system for Broadcast News transcription.
长跨度语言模型,比如那些涉及句法依赖关系的模型,产生的文本比它们的n-gram对应的文本更连贯。然而,在解码和判别训练过程中,在这样的大跨度模型下,在诸如ASR格这样的压缩表示中评估大量的句子假设是很棘手的。公认的折中方案是使用长跨度LM只对格中的n个最佳假设进行重新评分。我们提出了判别爬坡,这是一种基于爬坡评分算法的高效的大跨度LMs判别训练方法[1]。我们通过经验证明,在最先进的广播新闻转录ASR系统中,N-best训练方法显著节省了计算量,并降低了错误率。
{"title":"Efficient discriminative training of long-span language models","authors":"A. Rastrow, Mark Dredze, S. Khudanpur","doi":"10.1109/ASRU.2011.6163933","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163933","url":null,"abstract":"Long-span language models, such as those involving syntactic dependencies, produce more coherent text than their n-gram counterparts. However, evaluating the large number of sentence-hypotheses in a packed representation such as an ASR lattice is intractable under such long-span models both during decoding and discriminative training. The accepted compromise is to rescore only the N-best hypotheses in the lattice using the long-span LM. We present discriminative hill climbing, an efficient and effective discriminative training procedure for long-span LMs based on a hill climbing rescoring algorithm [1]. We empirically demonstrate significant computational savings as well as error-rate reduction over N-best training methods in a state of the art ASR system for Broadcast News transcription.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130715823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
On-line policy optimisation of spoken dialogue systems via live interaction with human subjects 通过与人类受试者的实时互动,在线策略优化口语对话系统
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163950
Milica Gasic, Filip Jurcícek, Blaise Thomson, Kai Yu, S. Young
Statistical dialogue models have required a large number of dialogues to optimise the dialogue policy, relying on the use of a simulated user. This results in a mismatch between training and live conditions, and significant development costs for the simulator thereby mitigating many of the claimed benefits of such models. Recent work on Gaussian process reinforcement learning, has shown that learning can be substantially accelerated. This paper reports on an experiment to learn a policy for a real-world task directly from human interaction using rewards provided by users. It shows that a usable policy can be learnt in just a few hundred dialogues without needing a user simulator and, using a learning strategy that reduces the risk of taking bad actions. The paper also investigates adaptation behaviour when the system continues learning for several thousand dialogues and highlights the need for robustness to noisy rewards.
统计对话模型需要大量的对话来优化对话策略,依赖于模拟用户的使用。这导致了训练和实际条件之间的不匹配,以及模拟器的重大开发成本,从而减轻了此类模型所声称的许多好处。最近对高斯过程强化学习的研究表明,学习可以大大加速。本文报告了一个实验,使用用户提供的奖励直接从人类交互中学习现实世界任务的策略。它表明,一个可用的策略可以在几百个对话中学习,而不需要用户模拟器,并且使用一种降低采取不良行为风险的学习策略。本文还研究了系统继续学习数千个对话时的适应行为,并强调了对噪声奖励的鲁棒性的需要。
{"title":"On-line policy optimisation of spoken dialogue systems via live interaction with human subjects","authors":"Milica Gasic, Filip Jurcícek, Blaise Thomson, Kai Yu, S. Young","doi":"10.1109/ASRU.2011.6163950","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163950","url":null,"abstract":"Statistical dialogue models have required a large number of dialogues to optimise the dialogue policy, relying on the use of a simulated user. This results in a mismatch between training and live conditions, and significant development costs for the simulator thereby mitigating many of the claimed benefits of such models. Recent work on Gaussian process reinforcement learning, has shown that learning can be substantially accelerated. This paper reports on an experiment to learn a policy for a real-world task directly from human interaction using rewards provided by users. It shows that a usable policy can be learnt in just a few hundred dialogues without needing a user simulator and, using a learning strategy that reduces the risk of taking bad actions. The paper also investigates adaptation behaviour when the system continues learning for several thousand dialogues and highlights the need for robustness to noisy rewards.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127943814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 83
Multi-level context-dependent acoustic modeling for automatic speech recognition 用于自动语音识别的多级上下文相关声学建模
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163911
Hung-An Chang, James R. Glass
In this paper, we propose a multi-level, context-dependent acoustic modeling framework for automatic speech recognition. For each context-dependent unit considered by the recognizer, we construct a set of classifiers that target different amounts of contextual resolution, and then combine them for scoring. Since information from multiple levels of contexts is appropriately combined, the proposed modeling framework provides reasonable scores for units with few or no training examples, while maintaining an ability to distinguish between different context-dependent units. On a large vocabulary lecture transcription task, the proposed modeling framework outperforms a traditional clustering-based context-dependent acoustic model by 3.5% (11.4% relative) in terms of word error rate.
在本文中,我们提出了一个用于自动语音识别的多层次、上下文相关的声学建模框架。对于识别器考虑的每个上下文相关单元,我们构建了一组针对不同上下文分辨率的分类器,然后将它们组合起来进行评分。由于来自多个上下文级别的信息被适当地组合在一起,因此所提出的建模框架为只有很少或没有训练示例的单元提供了合理的分数,同时保持了区分不同上下文相关单元的能力。在一个大词汇量的演讲转录任务中,所提出的建模框架在单词错误率方面比传统的基于聚类的上下文相关声学模型高出3.5%(相对11.4%)。
{"title":"Multi-level context-dependent acoustic modeling for automatic speech recognition","authors":"Hung-An Chang, James R. Glass","doi":"10.1109/ASRU.2011.6163911","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163911","url":null,"abstract":"In this paper, we propose a multi-level, context-dependent acoustic modeling framework for automatic speech recognition. For each context-dependent unit considered by the recognizer, we construct a set of classifiers that target different amounts of contextual resolution, and then combine them for scoring. Since information from multiple levels of contexts is appropriately combined, the proposed modeling framework provides reasonable scores for units with few or no training examples, while maintaining an ability to distinguish between different context-dependent units. On a large vocabulary lecture transcription task, the proposed modeling framework outperforms a traditional clustering-based context-dependent acoustic model by 3.5% (11.4% relative) in terms of word error rate.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125451630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A convergence analysis of log-linear training and its application to speech recognition 对数线性训练的收敛性分析及其在语音识别中的应用
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163895
Simon Wiesler, R. Schlüter, H. Ney
Log-linear models are a promising approach for speech recognition. Typically, log-linear models are trained according to a strictly convex criterion. Optimization algorithms are guaranteed to converge to the unique global optimum of the objective function from any initialization. For large-scale applications, considerations in the limit of infinite iterations are not sufficient. We show that log-linear training can be a highly ill-conditioned optimization problem, resulting in extremely slow convergence. Conversely, the optimization problem can be preconditioned by feature transformations. Making use of our convergence analysis, we improve our log-linear speech recognition system and achieve a strong reduction of its training time. In addition, we validate our analysis on a continuous handwriting recognition task.
对数线性模型是一种很有前途的语音识别方法。通常,对数线性模型是根据严格的凸准则训练的。优化算法保证从任意初始化到目标函数的唯一全局最优。对于大规模应用程序,仅考虑无限迭代的限制是不够的。我们表明对数线性训练可能是一个高度病态的优化问题,导致极其缓慢的收敛。相反,优化问题可以通过特征变换进行预处理。利用我们的收敛分析,我们改进了我们的对数线性语音识别系统,大大减少了它的训练时间。此外,我们在一个连续的手写识别任务上验证了我们的分析。
{"title":"A convergence analysis of log-linear training and its application to speech recognition","authors":"Simon Wiesler, R. Schlüter, H. Ney","doi":"10.1109/ASRU.2011.6163895","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163895","url":null,"abstract":"Log-linear models are a promising approach for speech recognition. Typically, log-linear models are trained according to a strictly convex criterion. Optimization algorithms are guaranteed to converge to the unique global optimum of the objective function from any initialization. For large-scale applications, considerations in the limit of infinite iterations are not sufficient. We show that log-linear training can be a highly ill-conditioned optimization problem, resulting in extremely slow convergence. Conversely, the optimization problem can be preconditioned by feature transformations. Making use of our convergence analysis, we improve our log-linear speech recognition system and achieve a strong reduction of its training time. In addition, we validate our analysis on a continuous handwriting recognition task.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123192880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription 会话语音转录中上下文相关深度神经网络的特征工程
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163899
F. Seide, Gang Li, Xie Chen, Dong Yu
We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.
我们从特征工程的角度研究了上下文相关的深度神经网络hmm或cd - dnn - hmm的潜力。最近,我们已经表明,对于独立于说话人的电话转录(NIST RT03S Fisher数据),cd - dnn - hmm使用300多个小时的训练数据(交换机),9000多个绑定三音状态和多达9个隐藏网络层,将单词错误率从带有HLDA特征的判别训练高斯混合hmm获得的27.4%降低到18.5%,减少了多达三分之一。
{"title":"Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription","authors":"F. Seide, Gang Li, Xie Chen, Dong Yu","doi":"10.1109/ASRU.2011.6163899","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163899","url":null,"abstract":"We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123208043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 690
A Trajectory-based Parallel Model Combination with a unified static and dynamic parameter compensation for noisy speech recognition 基于轨迹的并行模型与统一的静态和动态参数补偿相结合用于噪声语音识别
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163914
K. Sim, Minh-Thang Luong
Parallel Model Combination (PMC) is widely used as a technique to compensate Gaussian parameters of a clean speech model for noisy speech recognition. The basic principle of PMC uses a log normal approximation to transform statistics of the data distribution between the cepstral domain and the linear spectral domain. Typically, further approximations are needed to compensate the dynamic parameters separately. In this paper, Trajectory PMC (TPMC) is proposed to compensate both the static and dynamic parameters. TPMC uses the explicit relationships between the static and dynamic features to transform the static and dynamic parameters into a sequence (trajectory) of static parameters, so that the log normal approximation can be applied. Experimental results on WSJCAM0 database corrupted with additive babble noise reveals that the proposed TPMC method gives promising improvements over PMC and VTS.
并行模型组合(PMC)作为一种补偿干净语音模型高斯参数的技术被广泛应用于噪声语音识别。PMC的基本原理是利用对数正态近似在倒谱域和线性谱域之间变换数据分布的统计量。通常,需要进一步逼近来单独补偿动态参数。本文提出了轨迹PMC (TPMC)来补偿静态和动态参数。TPMC利用静态和动态特征之间的显式关系,将静态和动态参数转换为静态参数的序列(轨迹),从而可以应用对数正态逼近。实验结果表明,TPMC方法比PMC和VTS方法有较好的改进。
{"title":"A Trajectory-based Parallel Model Combination with a unified static and dynamic parameter compensation for noisy speech recognition","authors":"K. Sim, Minh-Thang Luong","doi":"10.1109/ASRU.2011.6163914","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163914","url":null,"abstract":"Parallel Model Combination (PMC) is widely used as a technique to compensate Gaussian parameters of a clean speech model for noisy speech recognition. The basic principle of PMC uses a log normal approximation to transform statistics of the data distribution between the cepstral domain and the linear spectral domain. Typically, further approximations are needed to compensate the dynamic parameters separately. In this paper, Trajectory PMC (TPMC) is proposed to compensate both the static and dynamic parameters. TPMC uses the explicit relationships between the static and dynamic features to transform the static and dynamic parameters into a sequence (trajectory) of static parameters, so that the log normal approximation can be applied. Experimental results on WSJCAM0 database corrupted with additive babble noise reveals that the proposed TPMC method gives promising improvements over PMC and VTS.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126611244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
iVector-based discriminative adaptation for automatic speech recognition 基于向量的判别自适应自动语音识别
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163922
M. Karafiát, L. Burget, P. Matejka, O. Glembek, J. Černocký
We presented a novel technique for discriminative feature-level adaptation of automatic speech recognition system. The concept of iVectors popular in Speaker Recognition is used to extract information about speaker or acoustic environment from speech segment. iVector is a low-dimensional fixed-length representing such information. To utilized iVectors for adaptation, Region Dependent Linear Transforms (RDLT) are discriminatively trained using MPE criterion on large amount of annotated data to extract the relevant information from iVectors and to compensate speech feature. The approach was tested on standard CTS data. We found it to be complementary to common adaptation techniques. On a well tuned RDLT system with standard CMLLR adaptation we reached 0.8% additive absolute WER improvement.
提出了一种自动语音识别系统的判别特征级自适应技术。利用说话人识别中常用的向量概念,从语音片段中提取说话人或声环境的信息。向量是表示这些信息的低维定长向量。为了利用向量进行自适应,利用MPE准则对大量标注数据进行区域相关线性变换(Region Dependent Linear Transforms, RDLT)判别性训练,从向量中提取相关信息并对语音特征进行补偿。该方法在标准CTS数据上进行了测试。我们发现它是对常见适应技术的补充。在具有标准cmlr自适应的经过良好调优的RDLT系统上,我们达到了0.8%的添加绝对WER改进。
{"title":"iVector-based discriminative adaptation for automatic speech recognition","authors":"M. Karafiát, L. Burget, P. Matejka, O. Glembek, J. Černocký","doi":"10.1109/ASRU.2011.6163922","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163922","url":null,"abstract":"We presented a novel technique for discriminative feature-level adaptation of automatic speech recognition system. The concept of iVectors popular in Speaker Recognition is used to extract information about speaker or acoustic environment from speech segment. iVector is a low-dimensional fixed-length representing such information. To utilized iVectors for adaptation, Region Dependent Linear Transforms (RDLT) are discriminatively trained using MPE criterion on large amount of annotated data to extract the relevant information from iVectors and to compensate speech feature. The approach was tested on standard CTS data. We found it to be complementary to common adaptation techniques. On a well tuned RDLT system with standard CMLLR adaptation we reached 0.8% additive absolute WER improvement.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126867772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
Extending noise robust structured support vector machines to larger vocabulary tasks 将噪声鲁棒结构化支持向量机扩展到更大的词汇量任务
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163898
Shi-Xiong Zhang, M. Gales
This paper describes a structured SVM framework suitable for noise-robust medium/large vocabulary speech recognition. Several theoretical and practical extensions to previous work on small vocabulary tasks are detailed. The joint feature space based on word models is extended to allow context-dependent triphone models to be used. By interpreting the structured SVM as a large margin log-linear model, illustrates that there is an implicit assumption that the prior of the discriminative parameter is a zero mean Gaussian. However, depending on the definition of likelihood feature space, a non-zero prior may be more appropriate. A general Gaussian prior is incorporated into the large margin training criterion in a form that allows the cutting plan algorithm to be directly applied. To further speed up the training process, 1-slack algorithm, caching competing hypothesis and parallelization strategies are also proposed. The performance of structured SVMs is evaluated on noise corrupted medium vocabulary speech recognition task: AURORA 4.
本文描述了一种适合于噪声鲁棒的中/大词汇量语音识别的结构化支持向量机框架。本文详细介绍了对先前小词汇任务研究的理论和实践扩展。扩展了基于词模型的联合特征空间,允许使用上下文相关的三音模型。通过将结构化支持向量机解释为一个大余量对数线性模型,说明有一个隐含的假设,即判别参数的先验是零均值高斯。然而,根据似然特征空间的定义,非零先验可能更合适。将一般高斯先验以一种允许切割计划算法直接应用的形式合并到大余量训练准则中。为了进一步加快训练过程,还提出了1-slack算法、缓存竞争假设和并行化策略。对结构化支持向量机在噪声介质词汇语音识别任务AURORA 4中的性能进行了评价。
{"title":"Extending noise robust structured support vector machines to larger vocabulary tasks","authors":"Shi-Xiong Zhang, M. Gales","doi":"10.1109/ASRU.2011.6163898","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163898","url":null,"abstract":"This paper describes a structured SVM framework suitable for noise-robust medium/large vocabulary speech recognition. Several theoretical and practical extensions to previous work on small vocabulary tasks are detailed. The joint feature space based on word models is extended to allow context-dependent triphone models to be used. By interpreting the structured SVM as a large margin log-linear model, illustrates that there is an implicit assumption that the prior of the discriminative parameter is a zero mean Gaussian. However, depending on the definition of likelihood feature space, a non-zero prior may be more appropriate. A general Gaussian prior is incorporated into the large margin training criterion in a form that allows the cutting plan algorithm to be directly applied. To further speed up the training process, 1-slack algorithm, caching competing hypothesis and parallelization strategies are also proposed. The performance of structured SVMs is evaluated on noise corrupted medium vocabulary speech recognition task: AURORA 4.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121686093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Subword-based automatic lexicon learning for Speech Recognition 基于子词的语音识别自动词汇学习
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163938
Timo Mertens, S. Seneff
We present a framework for learning a pronunciation lexicon for an Automatic Speech Recognition (ASR) system from multiple utterances of the same training words, where the lexical identities of the words are unknown. Instead of only trying to learn pronunciations for known words we go one step further and try to learn both spelling and pronunciation in a joint optimization. Decoding based on linguistically motivated hybrid subword units generates the joint lexical search space, which is reduced to the most appropriate lexical entries based on a set of simple pruning techniques. A cascade of letter and acoustic pruning, followed by re-scoring N-best hypotheses with discriminative decoder statistics resulted in optimal lexical entries in terms of both spelling and pronunciation. Evaluating the framework on English isolated word recognition, we achieve reductions of 7.7% absolute on word error rate and 20.9% absolute on character error rate over baselines that use no pruning.
我们提出了一个框架,用于自动语音识别(ASR)系统从相同训练词的多个话语中学习发音词汇,其中单词的词汇身份是未知的。我们不是只学习已知单词的发音,而是更进一步,尝试在联合优化中学习拼写和发音。基于语言动机的混合子词单元解码生成联合词汇搜索空间,并基于一组简单的修剪技术将其缩减为最合适的词汇条目。一个字母和声学的级联修剪,然后用判别解码器统计重新评分n个最佳假设,在拼写和发音方面产生最佳的词汇条目。通过对该框架在英语孤立词识别上的评估,我们实现了在不使用剪枝的基线上,单词错误率绝对降低7.7%,字符错误率绝对降低20.9%。
{"title":"Subword-based automatic lexicon learning for Speech Recognition","authors":"Timo Mertens, S. Seneff","doi":"10.1109/ASRU.2011.6163938","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163938","url":null,"abstract":"We present a framework for learning a pronunciation lexicon for an Automatic Speech Recognition (ASR) system from multiple utterances of the same training words, where the lexical identities of the words are unknown. Instead of only trying to learn pronunciations for known words we go one step further and try to learn both spelling and pronunciation in a joint optimization. Decoding based on linguistically motivated hybrid subword units generates the joint lexical search space, which is reduced to the most appropriate lexical entries based on a set of simple pruning techniques. A cascade of letter and acoustic pruning, followed by re-scoring N-best hypotheses with discriminative decoder statistics resulted in optimal lexical entries in terms of both spelling and pronunciation. Evaluating the framework on English isolated word recognition, we achieve reductions of 7.7% absolute on word error rate and 20.9% absolute on character error rate over baselines that use no pruning.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131590643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2011 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1