首页 > 最新文献

2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Lattice-based lexical cues for word fragment detection in conversational speech 会话语音中基于格的词片段检测
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373419
Kartik Audhkhasi, P. Georgiou, Shrikanth S. Narayanan
Previous approaches to the problem of word fragment detection in speech have focussed primarily on acoustic-prosodic features [1], [2]. This paper proposes that the output of a continuous Automatic Speech Recognition (ASR) system can also be used to derive robust lexical features for the task. We hypothesize that the confusion in the word lattice generated by the ASR system can be exploited for detecting word fragments. Two sets of lexical features are proposed -one which is based on the word confusion, and the other based on the pronunciation confusion between the word hypotheses in the lattice. Classification experiments with a Support Vector Machine (SVM) classifier show that these lexical features perform better than the previously proposed acoustic-prosodic features by around 5.20% (relative) on a corpus chosen from the DARPA Transtac Iraqi-English (San Diego) corpus [3]. A combination of both these feature sets improves the word fragment detection accuracy by 11.50% relative to using just the acoustic-prosodic features.
先前针对语音中的词片段检测问题的方法主要集中在声学韵律特征上[1],[2]。本文提出,连续自动语音识别(ASR)系统的输出也可以用于为任务派生鲁棒的词法特征。我们假设由ASR系统产生的词格中的混淆可以用于检测词片段。提出了两组词汇特征——一组基于词混淆,另一组基于格中词假设之间的发音混淆。使用支持向量机(SVM)分类器进行的分类实验表明,在DARPA Transtac Iraqi-English (San Diego)语料库中选择的语料库上,这些词汇特征比之前提出的声学-韵律特征的表现要好约5.20%(相对)[3]。与仅使用声学韵律特征相比,这两种特征集的组合可将词片段检测准确率提高11.50%。
{"title":"Lattice-based lexical cues for word fragment detection in conversational speech","authors":"Kartik Audhkhasi, P. Georgiou, Shrikanth S. Narayanan","doi":"10.1109/ASRU.2009.5373419","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373419","url":null,"abstract":"Previous approaches to the problem of word fragment detection in speech have focussed primarily on acoustic-prosodic features [1], [2]. This paper proposes that the output of a continuous Automatic Speech Recognition (ASR) system can also be used to derive robust lexical features for the task. We hypothesize that the confusion in the word lattice generated by the ASR system can be exploited for detecting word fragments. Two sets of lexical features are proposed -one which is based on the word confusion, and the other based on the pronunciation confusion between the word hypotheses in the lattice. Classification experiments with a Support Vector Machine (SVM) classifier show that these lexical features perform better than the previously proposed acoustic-prosodic features by around 5.20% (relative) on a corpus chosen from the DARPA Transtac Iraqi-English (San Diego) corpus [3]. A combination of both these feature sets improves the word fragment detection accuracy by 11.50% relative to using just the acoustic-prosodic features.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123010362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Back-off action selection in summary space-based POMDP dialogue systems 基于空间的POMDP对话系统中的后退操作选择
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373416
Milica Gasic, F. Lefèvre, Filip Jurcícek, Simon Keizer, François Mairesse, Blaise Thomson, Kai Yu, S. Young
This paper deals with the issue of invalid state-action pairs in the Partially Observable Markov Decision Process (POMDP) framework, with a focus on real-world tasks where the need for approximate solutions exacerbates this problem. In particular, when modelling dialogue as a POMDP, both the state and the action space must be reduced to smaller scale summary spaces in order to make learning tractable. However, since not all actions are valid in all states, the action proposed by the policy in summary space sometimes leads to an invalid action when mapped back to master space. Some form of back-off scheme must then be used to generate an alternative action. This paper demonstrates how the value function derived during reinforcement learning can be used to order back-off actions in an N-best list. Compared to a simple baseline back-off strategy and to a strategy that extends the summary space to minimise the occurrence of invalid actions, the proposed N-best action selection scheme is shown to be significantly more robust.
本文讨论了部分可观察马尔可夫决策过程(POMDP)框架中无效状态-动作对的问题,重点关注了对近似解的需求加剧了这一问题的现实世界任务。特别是,当将对话建模为POMDP时,为了使学习易于处理,状态和动作空间都必须缩减为较小规模的总结空间。然而,由于并非所有操作在所有状态下都有效,因此策略在摘要空间中提出的操作有时会在映射回主空间时导致无效操作。然后必须使用某种形式的退让方案来生成替代操作。本文演示了在强化学习过程中推导的值函数如何用于排序n -最佳列表中的退退动作。与简单的基线后退策略和扩展汇总空间以最小化无效动作发生的策略相比,所提出的n -最佳动作选择方案显着更具鲁棒性。
{"title":"Back-off action selection in summary space-based POMDP dialogue systems","authors":"Milica Gasic, F. Lefèvre, Filip Jurcícek, Simon Keizer, François Mairesse, Blaise Thomson, Kai Yu, S. Young","doi":"10.1109/ASRU.2009.5373416","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373416","url":null,"abstract":"This paper deals with the issue of invalid state-action pairs in the Partially Observable Markov Decision Process (POMDP) framework, with a focus on real-world tasks where the need for approximate solutions exacerbates this problem. In particular, when modelling dialogue as a POMDP, both the state and the action space must be reduced to smaller scale summary spaces in order to make learning tractable. However, since not all actions are valid in all states, the action proposed by the policy in summary space sometimes leads to an invalid action when mapped back to master space. Some form of back-off scheme must then be used to generate an alternative action. This paper demonstrates how the value function derived during reinforcement learning can be used to order back-off actions in an N-best list. Compared to a simple baseline back-off strategy and to a strategy that extends the summary space to minimise the occurrence of invalid actions, the proposed N-best action selection scheme is shown to be significantly more robust.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122113575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Robust speech recognition using a Small Power Boosting algorithm 基于小功率增强算法的鲁棒语音识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373230
Chanwoo Kim, Kshitiz Kumar, R. Stern
In this paper, we present a noise robustness algorithm called Small Power Boosting (SPB). We observe that in the spectral domain, time-frequency bins with smaller power are more affected by additive noise. The conventional way of handling this problem is estimating the noise from the test utterance and doing normalization or subtraction. In our work, in contrast, we intentionally boost the power of time-frequency bins with small energy for both the training and testing datasets. Since time-frequency bins with small power no longer exist after this power boosting, the spectral distortion between the clean and corrupt test sets becomes reduced. This type of small power boosting is also highly related to physiological nonlinearity. We observe that when small power boosting is done, suitable weighting smoothing becomes highly important. Our experimental results indicate that this simple idea is very helpful for very difficult noisy environments such as corruption by background music.
本文提出了一种称为小功率增强(SPB)的噪声鲁棒性算法。我们观察到,在谱域中,功率较小的时频箱受加性噪声的影响更大。处理该问题的传统方法是从测试话语中估计噪声并进行归一化或减法处理。相比之下,在我们的工作中,我们有意用小能量来提高训练和测试数据集的时频箱的功率。由于功率增强后功率小的时频箱不再存在,因此干净测试集和损坏测试集之间的频谱失真减小。这种类型的小功率提升也与生理非线性高度相关。我们发现,当进行小功率升压时,适当的加权平滑变得非常重要。我们的实验结果表明,这个简单的想法对非常困难的噪声环境(如背景音乐的破坏)非常有帮助。
{"title":"Robust speech recognition using a Small Power Boosting algorithm","authors":"Chanwoo Kim, Kshitiz Kumar, R. Stern","doi":"10.1109/ASRU.2009.5373230","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373230","url":null,"abstract":"In this paper, we present a noise robustness algorithm called Small Power Boosting (SPB). We observe that in the spectral domain, time-frequency bins with smaller power are more affected by additive noise. The conventional way of handling this problem is estimating the noise from the test utterance and doing normalization or subtraction. In our work, in contrast, we intentionally boost the power of time-frequency bins with small energy for both the training and testing datasets. Since time-frequency bins with small power no longer exist after this power boosting, the spectral distortion between the clean and corrupt test sets becomes reduced. This type of small power boosting is also highly related to physiological nonlinearity. We observe that when small power boosting is done, suitable weighting smoothing becomes highly important. Our experimental results indicate that this simple idea is very helpful for very difficult noisy environments such as corruption by background music.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128608795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Dynamic network decoding revisited 重新访问动态网络解码
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372904
H. Soltau, G. Saon
We present a dynamic network decoder capable of using large cross-word context models and large n-gram histories. Our method for constructing the search network is designed to process large cross-word context models very efficiently and we address the optimization of the search network to minimize any overhead during run-time for the dynamic network decoder. The search procedure uses the full LM history for lookahead, and path recombination is done as early as possible. In our systematic comparison to a static FSM based decoder, we find the dynamic decoder can run at comparable speed as the static decoder when large language models are used, while the static decoder performs best for small language models. We discuss the use of very large vocabularies of up to 2.5 million words for both decoding approaches and analyze the effect of weak acoustic models for pruning.
我们提出了一个动态网络解码器,能够使用大的跨词上下文模型和大的n-gram历史。我们构建搜索网络的方法旨在非常有效地处理大型跨词上下文模型,并且我们解决了搜索网络的优化问题,以最小化动态网络解码器在运行时的任何开销。搜索过程使用完整的LM历史进行向前看,并尽可能早地完成路径重组。在我们与基于静态FSM的解码器的系统比较中,我们发现当使用大型语言模型时,动态解码器可以以与静态解码器相当的速度运行,而静态解码器在小型语言模型中表现最佳。我们讨论了在两种解码方法中使用高达250万单词的非常大的词汇表,并分析了弱声学模型对修剪的影响。
{"title":"Dynamic network decoding revisited","authors":"H. Soltau, G. Saon","doi":"10.1109/ASRU.2009.5372904","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372904","url":null,"abstract":"We present a dynamic network decoder capable of using large cross-word context models and large n-gram histories. Our method for constructing the search network is designed to process large cross-word context models very efficiently and we address the optimization of the search network to minimize any overhead during run-time for the dynamic network decoder. The search procedure uses the full LM history for lookahead, and path recombination is done as early as possible. In our systematic comparison to a static FSM based decoder, we find the dynamic decoder can run at comparable speed as the static decoder when large language models are used, while the static decoder performs best for small language models. We discuss the use of very large vocabularies of up to 2.5 million words for both decoding approaches and analyze the effect of weak acoustic models for pruning.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130607235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Support vector machines for noise robust ASR 支持向量机用于噪声鲁棒ASR
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372913
M. Gales, A. Ragni, H. AlDamarki, C. Gautier
Using discriminative classifiers, such as Support Vector Machines (SVMs) in combination with, or as an alternative to, Hidden Markov Models (HMMs) has a number of advantages for difficult speech recognition tasks. For example, the models can make use of additional dependencies in the observation sequences than HMMs provided the appropriate form of kernel is used. However standard SVMs are binary classifiers, and speech is a multi-class problem. Furthermore, to train SVMs to distinguish word pairs requires that each word appears in the training data. This paper examines both of these limitations. Tree-based reduction approaches for multiclass classification are described, as well as some of the issues in applying them to dynamic data, such as speech. To address the training data issues, a simplified version of HMM-based synthesis can be used, which allows data for any word-pair to be generated. These approaches are evaluated on two noise corrupted digit sequence tasks: AURORA 2.0; and actual in-car collected data.
使用判别分类器,如支持向量机(svm)与隐马尔可夫模型(hmm)相结合或作为替代,对于困难的语音识别任务具有许多优点。例如,如果使用适当形式的内核,模型可以利用观测序列中比hmm更多的依赖项。然而,标准支持向量机是二元分类器,语音是一个多类问题。此外,为了训练支持向量机来区分单词对,需要每个单词都出现在训练数据中。本文考察了这两个限制。描述了用于多类分类的基于树的约简方法,以及将它们应用于动态数据(如语音)时的一些问题。为了解决训练数据问题,可以使用基于hmm的合成的简化版本,它允许生成任何词对的数据。这些方法在两个噪声破坏的数字序列任务上进行了评估:AURORA 2.0;以及实际的车内收集数据。
{"title":"Support vector machines for noise robust ASR","authors":"M. Gales, A. Ragni, H. AlDamarki, C. Gautier","doi":"10.1109/ASRU.2009.5372913","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372913","url":null,"abstract":"Using discriminative classifiers, such as Support Vector Machines (SVMs) in combination with, or as an alternative to, Hidden Markov Models (HMMs) has a number of advantages for difficult speech recognition tasks. For example, the models can make use of additional dependencies in the observation sequences than HMMs provided the appropriate form of kernel is used. However standard SVMs are binary classifiers, and speech is a multi-class problem. Furthermore, to train SVMs to distinguish word pairs requires that each word appears in the training data. This paper examines both of these limitations. Tree-based reduction approaches for multiclass classification are described, as well as some of the issues in applying them to dynamic data, such as speech. To address the training data issues, a simplified version of HMM-based synthesis can be used, which allows data for any word-pair to be generated. These approaches are evaluated on two noise corrupted digit sequence tasks: AURORA 2.0; and actual in-car collected data.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128691572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Improved vocabulary independent search with approximate match based on Conditional Random Fields 改进的基于条件随机场近似匹配的词汇独立搜索
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373323
U. Chaudhari, M. Picheny
We investigate the use of Conditional Random Fields (CRF) to model confusions and account for errors in the phonetic decoding derived from Automatic Speech Recognition output. The goal is to improve the accuracy of approximate phonetic match, given query terms and an indexed database of documents, in a vocabulary independent audio search system. Audio data is ingested, segmented, decoded to produce a sequence of phones, and subsequently indexed using phone N-grams. Search is performed by expanding queries into phone sequences and matching against the index. The approximate match score is derived from a CRF, trained on parallel transcripts, which provides a general framework for modeling the errors that a recognition system may make taking contextual effects into consideration. Our approach differs from other work in the field in that we focus on using CRFs to model context dependent phone level confusions, rather than on explicitly modeling parameters of an edit distance. While, the results we obtain on both in and out of vocabulary (OOV) search tasks improve on previous work which incorporated high order phone confusions, the gains for OOV are more impressive.
我们研究了使用条件随机场(CRF)来模拟混淆并解释自动语音识别输出中派生的语音解码中的错误。目标是在一个独立于词汇的音频搜索系统中,在给定查询词和索引数据库的情况下,提高近似语音匹配的准确性。音频数据被摄取、分割、解码以产生电话序列,随后使用电话N-grams进行索引。通过将查询扩展到电话序列并根据索引进行匹配来执行搜索。近似匹配分数来自于在平行转录本上训练的CRF,它为将上下文影响考虑在内的识别系统可能产生的错误建模提供了一个通用框架。我们的方法与该领域的其他工作不同,因为我们专注于使用crf来建模依赖于上下文的电话级别混淆,而不是明确地建模编辑距离的参数。虽然我们在词汇内外(OOV)搜索任务上获得的结果比之前包含高阶电话混淆的工作有所改善,但OOV的收益更令人印象深刻。
{"title":"Improved vocabulary independent search with approximate match based on Conditional Random Fields","authors":"U. Chaudhari, M. Picheny","doi":"10.1109/ASRU.2009.5373323","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373323","url":null,"abstract":"We investigate the use of Conditional Random Fields (CRF) to model confusions and account for errors in the phonetic decoding derived from Automatic Speech Recognition output. The goal is to improve the accuracy of approximate phonetic match, given query terms and an indexed database of documents, in a vocabulary independent audio search system. Audio data is ingested, segmented, decoded to produce a sequence of phones, and subsequently indexed using phone N-grams. Search is performed by expanding queries into phone sequences and matching against the index. The approximate match score is derived from a CRF, trained on parallel transcripts, which provides a general framework for modeling the errors that a recognition system may make taking contextual effects into consideration. Our approach differs from other work in the field in that we focus on using CRFs to model context dependent phone level confusions, rather than on explicitly modeling parameters of an edit distance. While, the results we obtain on both in and out of vocabulary (OOV) search tasks improve on previous work which incorporated high order phone confusions, the gains for OOV are more impressive.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129226801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
MLP based hierarchical system for task adaptation in ASR 基于MLP的ASR任务自适应分层系统
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373383
Joel Pinto, M. Magimai.-Doss, H. Bourlard
We investigate a multilayer perceptron (MLP) based hierarchical approach for task adaptation in automatic speech recognition. The system consists of two MLP classifiers in tandem. A well-trained MLP available off-the-shelf is used at the first stage of the hierarchy. A second MLP is trained on the posterior features estimated by the first, but with a long temporal context of around 130 ms. By using an MLP trained on 232 hours of conversational telephone speech, the hierarchical adaptation approach yields a word error rate of 1.8% on the 600-word Phonebook isolated word recognition task. This compares favorably to the error rate of 4% obtained by the conventional single MLP based system trained with the same amount of Phonebook data that is used for adaptation. The proposed adaptation scheme also benefits from the ability of the second MLP to model the temporal information in the posterior features.
研究了一种基于多层感知器(MLP)的自动语音识别任务自适应分层方法。该系统由两个MLP分类器串联组成。在层次结构的第一阶段使用训练有素的现成MLP。第二个MLP是在第一个估计的后验特征上训练的,但具有大约130毫秒的长时间背景。通过使用经过232小时会话电话语音训练的MLP,分层适应方法在600个单词的电话簿孤立单词识别任务中产生1.8%的单词错误率。这与使用相同数量的电话簿数据进行自适应训练的传统的基于单个MLP的系统获得的4%的错误率相比是有利的。该自适应方案还受益于第二MLP对后验特征中时间信息的建模能力。
{"title":"MLP based hierarchical system for task adaptation in ASR","authors":"Joel Pinto, M. Magimai.-Doss, H. Bourlard","doi":"10.1109/ASRU.2009.5373383","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373383","url":null,"abstract":"We investigate a multilayer perceptron (MLP) based hierarchical approach for task adaptation in automatic speech recognition. The system consists of two MLP classifiers in tandem. A well-trained MLP available off-the-shelf is used at the first stage of the hierarchy. A second MLP is trained on the posterior features estimated by the first, but with a long temporal context of around 130 ms. By using an MLP trained on 232 hours of conversational telephone speech, the hierarchical adaptation approach yields a word error rate of 1.8% on the 600-word Phonebook isolated word recognition task. This compares favorably to the error rate of 4% obtained by the conventional single MLP based system trained with the same amount of Phonebook data that is used for adaptation. The proposed adaptation scheme also benefits from the ability of the second MLP to model the temporal information in the posterior features.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129260262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Scaling shrinkage-based language models 缩放基于收缩的语言模型
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373380
Stanley F. Chen, L. Mangu, B. Ramabhadran, R. Sarikaya, A. Sethy
In [1], we show that a novel class-based language model, Model M, and the method of regularized minimum discrimination information (rMDI) models outperform comparable methods on moderate amounts of Wall Street Journal data. Both of these methods are motivated by the observation that shrinking the sum of parameter magnitudes in an exponential language model tends to improve performance [2]. In this paper, we investigate whether these shrinkage-based techniques also perform well on larger training sets and on other domains. First, we explain why good performance on large data sets is uncertain, by showing that gains relative to a baseline n-gram model tend to decrease as training set size increases. Next, we evaluate several methods for data/model combination with Model M and rMDI models on limited-scale domains, to uncover which techniques should work best on large domains. Finally, we apply these methods on a variety of medium-to-large-scale domains covering several languages, and show that Model M consistently provides significant gains over existing language models for state-of-the-art systems in both speech recognition and machine translation.
在[1]中,我们展示了一种新的基于类的语言模型,模型M和正则化最小区别信息(rMDI)模型方法在中等数量的华尔街日报数据上优于可比方法。这两种方法的动机都是观察到在指数语言模型中缩小参数大小的总和倾向于提高性能。在本文中,我们研究了这些基于收缩的技术是否在更大的训练集和其他领域也表现良好。首先,我们解释了为什么大型数据集上的良好性能是不确定的,通过显示相对于基线n-gram模型的增益倾向于随着训练集大小的增加而减少。接下来,我们评估了几种在有限尺度域中与模型M和rMDI模型进行数据/模型组合的方法,以揭示哪些技术在大范围域中应该最有效。最后,我们将这些方法应用于涵盖多种语言的各种中大型领域,并表明在语音识别和机器翻译方面,模型M始终为最先进的系统提供比现有语言模型更大的收益。
{"title":"Scaling shrinkage-based language models","authors":"Stanley F. Chen, L. Mangu, B. Ramabhadran, R. Sarikaya, A. Sethy","doi":"10.1109/ASRU.2009.5373380","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373380","url":null,"abstract":"In [1], we show that a novel class-based language model, Model M, and the method of regularized minimum discrimination information (rMDI) models outperform comparable methods on moderate amounts of Wall Street Journal data. Both of these methods are motivated by the observation that shrinking the sum of parameter magnitudes in an exponential language model tends to improve performance [2]. In this paper, we investigate whether these shrinkage-based techniques also perform well on larger training sets and on other domains. First, we explain why good performance on large data sets is uncertain, by showing that gains relative to a baseline n-gram model tend to decrease as training set size increases. Next, we evaluate several methods for data/model combination with Model M and rMDI models on limited-scale domains, to uncover which techniques should work best on large domains. Finally, we apply these methods on a variety of medium-to-large-scale domains covering several languages, and show that Model M consistently provides significant gains over existing language models for state-of-the-art systems in both speech recognition and machine translation.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114903338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Weighted finite state transducer based statistical dialog management 基于加权有限状态传感器的统计对话框管理
Pub Date : 1900-01-01 DOI: 10.1109/ASRU.2009.5373350
Chiori Hori, Kiyonori Ohtake, Teruhisa Misu, H. Kashioka, Satoshi Nakamura
We proposed a dialog system using a weighted finite-state transducer (WFST) in which user concept and system action tags are input and output of the transducer, respectively. The WFST-based platform for dialog management enables us to combine various statistical models for dialog management (DM), user input understanding and system action generation, and then search the best system action in response to user inputs among multiple hypotheses. To test the potential of the WFST-based DM platform using statistical models, we constructed a dialog system using a human-to-human spoken dialog corpus for hotel reservation, which is annotated with Interchange Format (IF). A scenario WFST and a spoken language understanding (SLU) WFST were obtained from the corpus and then composed together and optimized. We evaluated the detection accuracy of the system next action tags using Mean Reciprocal Ranking (MRR). Finally, we constructed a full WFST-based dialog system by composing SLU, scenario and sentence generation (SG) WFSTs. Humans read the system responses in natural language and judged the quality of the responses. We confirmed that the WFST-based DM platform was capable of handling various spoken language and scenarios when the user concept and system action tags are consistent and distinguishable.
我们提出了一个使用加权有限状态传感器(WFST)的对话系统,其中用户概念和系统动作标签分别是传感器的输入和输出。基于wfst的对话管理平台使我们能够结合各种统计模型进行对话管理(DM)、用户输入理解和系统动作生成,然后在多个假设中搜索响应用户输入的最佳系统动作。为了使用统计模型测试基于wfst的DM平台的潜力,我们使用一个用于酒店预订的人对人口语对话语料库构建了一个对话系统,该对话语料库使用交换格式(IF)进行注释。从语料库中得到情景化WFST和口语理解WFST,并对其进行组合和优化。我们使用平均倒数排名(MRR)来评估系统下一个动作标签的检测精度。最后,我们通过组合SLU、场景生成和句子生成(SG) wfst,构建了一个完整的基于wfst的对话系统。人类用自然语言阅读系统的反应,并判断反应的质量。我们确认基于wfst的DM平台能够在用户概念和系统操作标签一致且可区分的情况下处理各种口语和场景。
{"title":"Weighted finite state transducer based statistical dialog management","authors":"Chiori Hori, Kiyonori Ohtake, Teruhisa Misu, H. Kashioka, Satoshi Nakamura","doi":"10.1109/ASRU.2009.5373350","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373350","url":null,"abstract":"We proposed a dialog system using a weighted finite-state transducer (WFST) in which user concept and system action tags are input and output of the transducer, respectively. The WFST-based platform for dialog management enables us to combine various statistical models for dialog management (DM), user input understanding and system action generation, and then search the best system action in response to user inputs among multiple hypotheses. To test the potential of the WFST-based DM platform using statistical models, we constructed a dialog system using a human-to-human spoken dialog corpus for hotel reservation, which is annotated with Interchange Format (IF). A scenario WFST and a spoken language understanding (SLU) WFST were obtained from the corpus and then composed together and optimized. We evaluated the detection accuracy of the system next action tags using Mean Reciprocal Ranking (MRR). Finally, we constructed a full WFST-based dialog system by composing SLU, scenario and sentence generation (SG) WFSTs. Humans read the system responses in natural language and judged the quality of the responses. We confirmed that the WFST-based DM platform was capable of handling various spoken language and scenarios when the user concept and system action tags are consistent and distinguishable.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132812750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Generalized likelihood ratio discriminant analysis 广义似然比判别分析
Pub Date : 1900-01-01 DOI: 10.1109/ASRU.2009.5373395
Muhammad Ali Tahir, G. Heigold, Christian Plahl, R. Schlüter, H. Ney
In the past several decades, classifier-independent front-end feature extraction, where the derivation of acoustic features is lightly associated with the back-end model training or classification, has been prominently used in various pattern recognition tasks, including automatic speech recognition (ASR). In this paper, we present a novel discriminative feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension - heteroscedastic linear discriminant analysis (HLDA) can be regarded as two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance. Experimental results demonstrate that GLRDA and its variant can yield moderate performance improvements over HLDA and LDA for the large vocabulary continuous speech recognition (LVCSR) task.
在过去的几十年里,独立于分类器的前端特征提取在各种模式识别任务中得到了突出的应用,包括自动语音识别(ASR),其中声学特征的推导与后端模型训练或分类很少相关。在似然比检验(LRT)的基础上,提出了一种新的判别特征变换——广义似然比判别分析(GLRDA)。它试图通过使最令人困惑的情况(由零假设描述)在没有类分布的同方差假设的情况下尽可能不可能发生,来寻求更低维的特征子空间。我们还证明了经典的线性判别分析(LDA)及其众所周知的扩展异方差线性判别分析(HLDA)可以看作是我们提出的方法的两种特殊情况。可以将经验类混淆信息进一步纳入GLRDA中,以获得更好的识别性能。实验结果表明,在大词汇量连续语音识别(LVCSR)任务中,GLRDA及其变体比HLDA和LDA有适度的性能提升。
{"title":"Generalized likelihood ratio discriminant analysis","authors":"Muhammad Ali Tahir, G. Heigold, Christian Plahl, R. Schlüter, H. Ney","doi":"10.1109/ASRU.2009.5373395","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373395","url":null,"abstract":"In the past several decades, classifier-independent front-end feature extraction, where the derivation of acoustic features is lightly associated with the back-end model training or classification, has been prominently used in various pattern recognition tasks, including automatic speech recognition (ASR). In this paper, we present a novel discriminative feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension - heteroscedastic linear discriminant analysis (HLDA) can be regarded as two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance. Experimental results demonstrate that GLRDA and its variant can yield moderate performance improvements over HLDA and LDA for the large vocabulary continuous speech recognition (LVCSR) task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129759600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1