首页 > 最新文献

2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Improved decision trees for multi-stream HMM-based audio-visual continuous speech recognition 基于多流hmm的视听连续语音识别改进决策树
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373454
Jing Huang, Karthik Venkat Ramanan
HMM-based audio-visual speech recognition (AVSR) systems have shown success in continuous speech recognition by combining visual and audio information, especially in noisy environments. In this paper we study how to improve decision trees used to create context classes in HMM-based AVSR systems. Traditionally, visual models have been trained with the same context classes as the audio only models. In this paper we investigate the use of separate decision trees to model the context classes for the audio and visual streams independently. Additionally we investigate the use of viseme classes in the decision tree building for the visual stream. On experiments with a 37-speaker 1.5 hours test set (about 12000 words) of continuous digits in noise, we obtain about a 3% absolute (20% relative) gain on AVSR performance by using separate decision trees for the audio and visual streams when using viseme classes in decision tree building for the visual stream.
基于hmm的视听语音识别(AVSR)系统通过结合视觉和音频信息,在连续语音识别中取得了成功,特别是在嘈杂环境中。本文研究了如何改进基于hmm的AVSR系统中用于创建上下文类的决策树。传统上,视觉模型与音频模型使用相同的上下文类进行训练。在本文中,我们研究了使用独立的决策树对音频和视觉流的上下文类进行独立建模。此外,我们还研究了viseme类在视觉流决策树构建中的使用。在37个扬声器1.5小时的连续数字噪声测试集(约12000个单词)的实验中,当在视觉流的决策树构建中使用viseme类时,我们通过对音频和视觉流使用单独的决策树,获得了AVSR性能的3%绝对增益(20%相对增益)。
{"title":"Improved decision trees for multi-stream HMM-based audio-visual continuous speech recognition","authors":"Jing Huang, Karthik Venkat Ramanan","doi":"10.1109/ASRU.2009.5373454","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373454","url":null,"abstract":"HMM-based audio-visual speech recognition (AVSR) systems have shown success in continuous speech recognition by combining visual and audio information, especially in noisy environments. In this paper we study how to improve decision trees used to create context classes in HMM-based AVSR systems. Traditionally, visual models have been trained with the same context classes as the audio only models. In this paper we investigate the use of separate decision trees to model the context classes for the audio and visual streams independently. Additionally we investigate the use of viseme classes in the decision tree building for the visual stream. On experiments with a 37-speaker 1.5 hours test set (about 12000 words) of continuous digits in noise, we obtain about a 3% absolute (20% relative) gain on AVSR performance by using separate decision trees for the audio and visual streams when using viseme classes in decision tree building for the visual stream.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125998056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Constrained discriminative training of N-gram language models N-gram语言模型的约束判别训练
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373338
A. Rastrow, A. Sethy, B. Ramabhadran
In this paper, we present a novel version of discriminative training for N-gram language models. Language models impose language specific constraints on the acoustic hypothesis and are crucial in discriminating between competing acoustic hypotheses. As reported in the literature, discriminative training of acoustic models has yielded significant improvements in the performance of a speech recognition system, however, discriminative training for N-gram language models (LMs) has not yielded the same impact. In this paper, we present three techniques to improve the discriminative training of LMs, namely updating the back-off probability of unseen events, normalization of the N-gram updates to ensure a probability distribution and a relative-entropy based global constraint on the N-gram probability updates. We also present a framework for discriminative adaptation of LMs to a new domain and compare it to existing linear interpolation methods. Results are reported on the Broadcast News and the MIT lecture corpora. A modest improvement of 0.2% absolute (on Broadcast News) and 0.3% absolute (on MIT lectures) was observed with discriminatively trained LMs over state-of-the-art systems.
在本文中,我们提出了一种新的N-gram语言模型的判别训练方法。语言模型对声学假设施加了特定的语言约束,对于区分相互竞争的声学假设至关重要。据文献报道,声学模型的判别性训练已经显著提高了语音识别系统的性能,然而,N-gram语言模型(LMs)的判别性训练并没有产生同样的影响。在本文中,我们提出了三种改进LMs判别训练的技术,即更新未见事件的回退概率,对N-gram更新进行归一化以确保概率分布,以及基于相对熵的N-gram概率更新的全局约束。我们还提出了一种LMs对新域的判别适应框架,并将其与现有的线性插值方法进行了比较。结果报告在广播新闻和麻省理工学院的讲座语料库。在最先进的系统上,通过判别训练的LMs可以观察到0.2%的绝对改进(在广播新闻上)和0.3%的绝对改进(在麻省理工学院的讲座上)。
{"title":"Constrained discriminative training of N-gram language models","authors":"A. Rastrow, A. Sethy, B. Ramabhadran","doi":"10.1109/ASRU.2009.5373338","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373338","url":null,"abstract":"In this paper, we present a novel version of discriminative training for N-gram language models. Language models impose language specific constraints on the acoustic hypothesis and are crucial in discriminating between competing acoustic hypotheses. As reported in the literature, discriminative training of acoustic models has yielded significant improvements in the performance of a speech recognition system, however, discriminative training for N-gram language models (LMs) has not yielded the same impact. In this paper, we present three techniques to improve the discriminative training of LMs, namely updating the back-off probability of unseen events, normalization of the N-gram updates to ensure a probability distribution and a relative-entropy based global constraint on the N-gram probability updates. We also present a framework for discriminative adaptation of LMs to a new domain and compare it to existing linear interpolation methods. Results are reported on the Broadcast News and the MIT lecture corpora. A modest improvement of 0.2% absolute (on Broadcast News) and 0.3% absolute (on MIT lectures) was observed with discriminatively trained LMs over state-of-the-art systems.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"416 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122461206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Automatic punctuation generation for speech 语音自动标点生成
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373365
Wenzhu Shen, Roger Peng Yu, F. Seide, Ji Wu
Automatic generation of punctuation is an essential feature for many speech-to-text transcription tasks. This paper describes a Maximum A-Posteriori (MAP) approach for inserting punctuation marks into raw word sequences obtained from Automatic Speech Recognition (ASR). The system consists of an “acoustic model” (AM) for prosodic features (actually pause duration) and a “language model” (LM) for text-only features. The LM combines three components: an MLP-based trigger-word model and a forward and a backward trigram punctuation predictor. The separation into acoustic and language model allows to learn these models on different corpora, especially allowing the LM to be trained on large amounts of data (text) for which no acoustic information is available. We find that the trigger-word LM is very useful, and further improvement can be achieved when combining both prosodic and lexical information. We achieve an F-measure of 81.0% and 56.5% for voicemails and podcasts, respectively, on reference transcripts, and 69.6% for voicemails on ASR transcripts.
自动生成标点符号是许多语音到文本转录任务的基本功能。本文描述了一种将标点符号插入到自动语音识别(ASR)获得的原始单词序列中的最大后验(MAP)方法。该系统由韵律特征(实际上是暂停时间)的“声学模型”(AM)和纯文本特征的“语言模型”(LM)组成。LM结合了三个组件:一个基于mlp的触发词模型和一个前向和后向三重标点预测器。声学和语言模型的分离允许在不同的语料库上学习这些模型,特别是允许LM在没有声学信息的大量数据(文本)上进行训练。我们发现触发词LM非常有用,并且当韵律和词汇信息结合在一起时可以进一步改进。语音邮件和播客在参考文本上的f值分别为81.0%和56.5%,语音邮件在ASR文本上的f值分别为69.6%。
{"title":"Automatic punctuation generation for speech","authors":"Wenzhu Shen, Roger Peng Yu, F. Seide, Ji Wu","doi":"10.1109/ASRU.2009.5373365","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373365","url":null,"abstract":"Automatic generation of punctuation is an essential feature for many speech-to-text transcription tasks. This paper describes a Maximum A-Posteriori (MAP) approach for inserting punctuation marks into raw word sequences obtained from Automatic Speech Recognition (ASR). The system consists of an “acoustic model” (AM) for prosodic features (actually pause duration) and a “language model” (LM) for text-only features. The LM combines three components: an MLP-based trigger-word model and a forward and a backward trigram punctuation predictor. The separation into acoustic and language model allows to learn these models on different corpora, especially allowing the LM to be trained on large amounts of data (text) for which no acoustic information is available. We find that the trigger-word LM is very useful, and further improvement can be achieved when combining both prosodic and lexical information. We achieve an F-measure of 81.0% and 56.5% for voicemails and podcasts, respectively, on reference transcripts, and 69.6% for voicemails on ASR transcripts.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114277104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Robust Speaker Diarization for short speech recordings 稳健的说话人拨号短语音记录
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373254
David Imseng, G. Friedland
We investigate a state-of-the-art Speaker Diarization system regarding its behavior on meetings that are much shorter (from 500 seconds down to 100 seconds) than those typically analyzed in Speaker Diarization benchmarks. First, the problems inherent to this task are analyzed. Then, we propose an approach that consists of a novel initialization parameter estimation method for typical state-of-the-art diarization approaches. The estimation method balances the relationship between the optimal value of the duration of speech data per Gaussian and the duration of the speech data, which is verified experimentally for the first time in this article. As a result, the Diarization Error Rate for short meetings extracted from the 2006, 2007, and 2009 NIST RT evaluation data is decreased by up to 50% relative.
我们研究了一个最先进的扬声器拨号系统,它在比扬声器拨号基准测试中通常分析的会议短得多(从500秒到100秒)的会议上的行为。首先,分析了这项任务所固有的问题。然后,我们提出了一种由一种新的初始化参数估计方法组成的方法。该估计方法平衡了每高斯次语音数据持续时间的最优值与语音数据持续时间之间的关系,本文首次通过实验验证了这一点。因此,从2006年、2007年和2009年NIST RT评估数据中提取的简短会议的Diarization错误率相对降低了50%。
{"title":"Robust Speaker Diarization for short speech recordings","authors":"David Imseng, G. Friedland","doi":"10.1109/ASRU.2009.5373254","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373254","url":null,"abstract":"We investigate a state-of-the-art Speaker Diarization system regarding its behavior on meetings that are much shorter (from 500 seconds down to 100 seconds) than those typically analyzed in Speaker Diarization benchmarks. First, the problems inherent to this task are analyzed. Then, we propose an approach that consists of a novel initialization parameter estimation method for typical state-of-the-art diarization approaches. The estimation method balances the relationship between the optimal value of the duration of speech data per Gaussian and the duration of the speech data, which is verified experimentally for the first time in this article. As a result, the Diarization Error Rate for short meetings extracted from the 2006, 2007, and 2009 NIST RT evaluation data is decreased by up to 50% relative.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133191742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Reinforcing language model for speech translation with auxiliary data 用辅助数据强化语音翻译的语言模型
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373308
Jia Cui, Yonggang Deng, Bowen Zhou
Language model domain adaption usually uses a large quantity of auxiliary data in different genres and domains. It has mostly been relying on scoring functions for selection and it is typically independent of intended applications such as machine translation. In this paper, we present a novel domain adaptation approach that is directly motivated by the need of translation engine. We first identify interesting phrases by examining phrase translation tables, and then use those phrases as anchors to select useful and relevant sentences from general domain data, with the goal of improving domain coverage or providing additional contextual information. The experimental results on Farsi to English translation in military force protection domain and Chinese to English translation in travel domain show statistical significant gain using the reinforced language models over the baseline.
语言模型领域自适应通常使用大量不同体裁和领域的辅助数据。它主要依赖于评分功能进行选择,并且通常独立于机器翻译等预期应用程序。本文提出了一种直接受翻译引擎需求驱动的领域自适应方法。我们首先通过检查短语翻译表来识别有趣的短语,然后使用这些短语作为锚点从一般领域数据中选择有用和相关的句子,目的是提高领域覆盖率或提供额外的上下文信息。在军事保护领域的波斯语英翻译和旅行领域的汉英翻译的实验结果表明,使用强化语言模型在基线上有统计学上的显著提高。
{"title":"Reinforcing language model for speech translation with auxiliary data","authors":"Jia Cui, Yonggang Deng, Bowen Zhou","doi":"10.1109/ASRU.2009.5373308","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373308","url":null,"abstract":"Language model domain adaption usually uses a large quantity of auxiliary data in different genres and domains. It has mostly been relying on scoring functions for selection and it is typically independent of intended applications such as machine translation. In this paper, we present a novel domain adaptation approach that is directly motivated by the need of translation engine. We first identify interesting phrases by examining phrase translation tables, and then use those phrases as anchors to select useful and relevant sentences from general domain data, with the goal of improving domain coverage or providing additional contextual information. The experimental results on Farsi to English translation in military force protection domain and Chinese to English translation in travel domain show statistical significant gain using the reinforced language models over the baseline.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125743987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Spoken term detection from bilingual spontaneous speech using code-switched lattice-based structures for words and subword units 基于码交换格结构的词和子词单元的双语自发语音词汇检测
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372901
Hung-yi Lee, Yueh-Lien Tang, Hao Tang, Lin-Shan Lee
This paper presents the first work known publicly on spoken term detection from bilingual spontaneous speech using code-switched lattice-based structures for word and subword units. The corpus used is the lectures with Chinese as the host language and English as the guest language recorded for a real course offered in National Taiwan University. The techniques reported here have been successfully implemented and tested in a real lecture system now available on-line over the Internet. We also present the approaches of using word fragment as the subword unit for English, and analyse the difficult issues when code-switched lattice-based structures for subword units are used for tasks involving languages of quite different natures.
本文介绍了第一个公开的关于使用基于码交换格的词和子词单位结构从双语自发语音中检测口语术语的工作。所使用的语料库是以中文为主语,英语为客席语的台大真实课程的讲课录音。这里报告的技术已经成功地在一个真实的讲座系统中实现和测试,现在可以在互联网上在线使用。我们还提出了使用词片段作为英语子词单位的方法,并分析了在涉及完全不同性质的语言的任务中使用基于码交换格的子词单位结构时遇到的困难问题。
{"title":"Spoken term detection from bilingual spontaneous speech using code-switched lattice-based structures for words and subword units","authors":"Hung-yi Lee, Yueh-Lien Tang, Hao Tang, Lin-Shan Lee","doi":"10.1109/ASRU.2009.5372901","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372901","url":null,"abstract":"This paper presents the first work known publicly on spoken term detection from bilingual spontaneous speech using code-switched lattice-based structures for word and subword units. The corpus used is the lectures with Chinese as the host language and English as the guest language recorded for a real course offered in National Taiwan University. The techniques reported here have been successfully implemented and tested in a real lecture system now available on-line over the Internet. We also present the approaches of using word fragment as the subword unit for English, and analyse the difficult issues when code-switched lattice-based structures for subword units are used for tasks involving languages of quite different natures.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"233 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132959844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Discriminative Product-of-Expert acoustic mapping for cross-lingual phone recognition 跨语言电话识别的专家判别产品声学映射
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372910
K. Sim
This paper presents a Product-of-Expert framework to perform probabilistic acoustic mapping for cross-lingual phone recognition. Under this framework, the posterior probabilities of the target HMM states are modelled as the weighted product of experts, where the experts or their weights are modelled as functions of the posterior probabilities of the source HMM states generated by a foreign phone recogniser. Careful choice of these functions leads to the Product-of-Posterior and Posterior Weighted Product-of-Expert models, which can be conveniently represented as 2-layer and 3-layer feed-forward neural networks respectively. Therefore, the commonly used error back-propagation method can be used to discriminatively train the model parameters. Experimental results are presented on the NTIMIT database using the Czech, Hungarian and Russian hybrid NN/HMM recognisers as the foreign phone recognisers to recognise English phones. With only about 15.6 minutes of training data, the best acoustic mapping model achieved 46.00% phone error rate, which is not far behind the 43.55% performance of the NN/HMM system trained directly on the full 3.31 hours of data.
本文提出了一个专家产品框架来执行跨语言电话识别的概率声学映射。在该框架下,目标HMM状态的后验概率被建模为专家的加权积,其中专家或其权重被建模为由国外电话识别器生成的源HMM状态的后验概率的函数。仔细选择这些函数可以得到后验产物和后验加权专家产物模型,它们可以方便地分别表示为2层和3层前馈神经网络。因此,可以采用常用的误差反向传播方法对模型参数进行判别训练。在NTIMIT数据库上使用捷克、匈牙利和俄罗斯混合NN/HMM识别器作为外文电话识别器进行英语电话识别的实验结果。仅用大约15.6分钟的训练数据,最佳声学映射模型的电话错误率达到46.00%,与直接训练完整3.31小时数据的NN/HMM系统43.55%的性能相差不远。
{"title":"Discriminative Product-of-Expert acoustic mapping for cross-lingual phone recognition","authors":"K. Sim","doi":"10.1109/ASRU.2009.5372910","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372910","url":null,"abstract":"This paper presents a Product-of-Expert framework to perform probabilistic acoustic mapping for cross-lingual phone recognition. Under this framework, the posterior probabilities of the target HMM states are modelled as the weighted product of experts, where the experts or their weights are modelled as functions of the posterior probabilities of the source HMM states generated by a foreign phone recogniser. Careful choice of these functions leads to the Product-of-Posterior and Posterior Weighted Product-of-Expert models, which can be conveniently represented as 2-layer and 3-layer feed-forward neural networks respectively. Therefore, the commonly used error back-propagation method can be used to discriminatively train the model parameters. Experimental results are presented on the NTIMIT database using the Czech, Hungarian and Russian hybrid NN/HMM recognisers as the foreign phone recognisers to recognise English phones. With only about 15.6 minutes of training data, the best acoustic mapping model achieved 46.00% phone error rate, which is not far behind the 43.55% performance of the NN/HMM system trained directly on the full 3.31 hours of data.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116106947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
A study on hidden Markov model's generalization capability for speech recognition 隐马尔可夫模型在语音识别中的泛化能力研究
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373359
Xiong Xiao, Jinyu Li, Chng Eng Siong, Haizhou Li, Chin-Hui Lee
From statistical learning theory, the generalization capability of a model is the ability to generalize well on unseen test data which follow the same distribution as the training data. This paper investigates how generalization capability can also improve robustness when testing and training data are from different distributions in the context of speech recognition. Two discriminative training (DT) methods are used to train the hidden Markov model (HMM) for better generalization capability, namely the minimum classification error (MCE) and the soft-margin estimation (SME) methods. Results on Aurora-2 task show that both SME and MCE are effective in improving one of the measures of acoustic model's generalization capability, i.e. the margin of the model, with SME be moderately more effective. In addition, the better generalization capability translates into better robustness of speech recognition performance, even when there is significant mismatch between the training and testing data. We also applied the mean and variance normalization (MVN) to preprocess the data to reduce the training-testing mismatch. After MVN, MCE and SME perform even better as the generalization capability now is more closely related to robustness. The best performance on Aurora-2 is obtained from SME and about 28% relative error rate reduction is achieved over the MVN baseline system. Finally, we also use SME to demonstrate the potential of better generalization capability in improving robustness in more realistic noisy task using the Aurora-3 task, and significant improvements are obtained.
从统计学习理论来看,模型的泛化能力是指对与训练数据具有相同分布的未知测试数据进行泛化的能力。本文研究了在语音识别的背景下,当测试和训练数据来自不同的分布时,泛化能力如何提高鲁棒性。为了提高隐马尔可夫模型的泛化能力,采用了两种判别训练(DT)方法,即最小分类误差(MCE)和软边际估计(SME)方法。极光-2任务的结果表明,SME和MCE都能有效地提高声学模型泛化能力的一项指标,即模型的边际,其中SME的效果略好。此外,更好的泛化能力转化为更好的语音识别性能的鲁棒性,即使在训练数据和测试数据之间存在显著不匹配的情况下。我们还采用均值和方差归一化(MVN)对数据进行预处理,以减少训练-测试不匹配。在MVN之后,MCE和SME表现更好,因为现在的泛化能力与鲁棒性更密切相关。SME系统在Aurora-2上的性能最好,相对错误率比MVN基线系统降低了约28%。最后,我们还利用SME证明了在更现实的噪声任务中使用Aurora-3任务具有更好的泛化能力,并获得了显着的改进。
{"title":"A study on hidden Markov model's generalization capability for speech recognition","authors":"Xiong Xiao, Jinyu Li, Chng Eng Siong, Haizhou Li, Chin-Hui Lee","doi":"10.1109/ASRU.2009.5373359","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373359","url":null,"abstract":"From statistical learning theory, the generalization capability of a model is the ability to generalize well on unseen test data which follow the same distribution as the training data. This paper investigates how generalization capability can also improve robustness when testing and training data are from different distributions in the context of speech recognition. Two discriminative training (DT) methods are used to train the hidden Markov model (HMM) for better generalization capability, namely the minimum classification error (MCE) and the soft-margin estimation (SME) methods. Results on Aurora-2 task show that both SME and MCE are effective in improving one of the measures of acoustic model's generalization capability, i.e. the margin of the model, with SME be moderately more effective. In addition, the better generalization capability translates into better robustness of speech recognition performance, even when there is significant mismatch between the training and testing data. We also applied the mean and variance normalization (MVN) to preprocess the data to reduce the training-testing mismatch. After MVN, MCE and SME perform even better as the generalization capability now is more closely related to robustness. The best performance on Aurora-2 is obtained from SME and about 28% relative error rate reduction is achieved over the MVN baseline system. Finally, we also use SME to demonstrate the potential of better generalization capability in improving robustness in more realistic noisy task using the Aurora-3 task, and significant improvements are obtained.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123938620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Towards the use of inferred cognitive states in language modeling 论在语言建模中使用推断认知状态
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373290
Nigel G. Ward, Alejandro Vega
In spoken dialog, speakers are simultaneously engaged in various mental processes, and it seems likely that the word that will be said next depends, to some extent, on the states of these mental processes. Further, these states can be inferred, to some extent, from properties of the speaker's voice as they change from moment to moment. As a illustration of how to apply these ideas in language modeling, we examine volume and speaking rate as predictors of the upcoming word. Combining the information which these provide with a trigram model gave a 2.6% improvement in perplexity.
在口语对话中,说话者同时处于不同的心理过程中,从某种程度上说,下一个要说的词很可能取决于这些心理过程的状态。此外,在某种程度上,这些状态可以从说话人每时每刻变化的声音属性中推断出来。为了说明如何在语言建模中应用这些思想,我们检查了音量和说话速度作为即将到来的单词的预测因素。将这些信息与三元模型相结合,可以使困惑度提高2.6%。
{"title":"Towards the use of inferred cognitive states in language modeling","authors":"Nigel G. Ward, Alejandro Vega","doi":"10.1109/ASRU.2009.5373290","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373290","url":null,"abstract":"In spoken dialog, speakers are simultaneously engaged in various mental processes, and it seems likely that the word that will be said next depends, to some extent, on the states of these mental processes. Further, these states can be inferred, to some extent, from properties of the speaker's voice as they change from moment to moment. As a illustration of how to apply these ideas in language modeling, we examine volume and speaking rate as predictors of the upcoming word. Combining the information which these provide with a trigram model gave a 2.6% improvement in perplexity.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122388794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Three-layer optimizations for fast GMM computations on GPU-like parallel processors 在类似gpu的并行处理器上进行快速GMM计算的三层优化
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373410
Kshitij Gupta, John Douglas Owens
In this paper we focus on optimizing compute and memory-bandwidth-intensive GMM computations for low-end, small-form-factor devices running on GPU-like parallel processors. With special emphasis on tackling the memory bandwidth issue that is exacerbated by a lack of CPU-like caches providing temporal locality on GPU-like parallel processors, we propose modifications to three well-known GMM computation reduction techniques. We find considerable locality at the frame, CI-GMM, and mixture layers of GMM compute, and show how it can be extracted by following a chunk-based technique of processing multiple frames for every load of a GMM. On a 1,000- word, command-and-control, continuous-speech task, we are able to achieve compute and memory bandwidth savings of over 60% and 90% respectively, with some degradation in accuracy, when compared to existing GPU-based fast GMM computation techniques.
在本文中,我们专注于为运行在类似gpu的并行处理器上的低端、小尺寸设备优化计算和内存带宽密集型GMM计算。由于缺乏类cpu的缓存,在类gpu的并行处理器上提供时间局域性,我们特别强调解决内存带宽问题,我们提出了对三种著名的GMM计算减少技术的修改。我们在GMM计算的帧、CI-GMM和混合层上发现了相当大的局部性,并展示了如何通过基于块的技术为每次GMM负载处理多个帧来提取它。与现有的基于gpu的快速GMM计算技术相比,在1000字、命令和控制、连续语音任务上,我们能够分别节省超过60%和90%的计算和内存带宽,但准确性有所下降。
{"title":"Three-layer optimizations for fast GMM computations on GPU-like parallel processors","authors":"Kshitij Gupta, John Douglas Owens","doi":"10.1109/ASRU.2009.5373410","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373410","url":null,"abstract":"In this paper we focus on optimizing compute and memory-bandwidth-intensive GMM computations for low-end, small-form-factor devices running on GPU-like parallel processors. With special emphasis on tackling the memory bandwidth issue that is exacerbated by a lack of CPU-like caches providing temporal locality on GPU-like parallel processors, we propose modifications to three well-known GMM computation reduction techniques. We find considerable locality at the frame, CI-GMM, and mixture layers of GMM compute, and show how it can be extracted by following a chunk-based technique of processing multiple frames for every load of a GMM. On a 1,000- word, command-and-control, continuous-speech task, we are able to achieve compute and memory bandwidth savings of over 60% and 90% respectively, with some degradation in accuracy, when compared to existing GPU-based fast GMM computation techniques.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124772574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1