首页 > 最新文献

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)最新文献

英文 中文
Fast audio search using vector space modelling 快速音频搜索使用向量空间建模
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430187
Brett Matthews, U. Chaudhari, B. Ramabhadran
Many techniques for retrieving arbitrary content from audio have been developed to leverage the important challenge of providing fast access to very large volumes of multimedia data. We present a two-stage method for fast audio search, where a vector-space modelling approach is first used to retrieve a short list of candidate audio segments for a query. The list of candidate segments is then searched using a word-based index for known words and a phone-based index for out-of-vocabulary words. We explore various system configurations and examine trade-offs between speed and accuracy. We evaluate our audio search system according to the NIST 2006 Spoken Term Detection evaluation initiative. We find that we can obtain a 30-times speedup for the search phase of our system with a 10% relative loss in accuracy.
已经开发了许多从音频中检索任意内容的技术,以利用提供对大量多媒体数据的快速访问这一重要挑战。我们提出了一种快速音频搜索的两阶段方法,其中首先使用向量空间建模方法来检索查询所需的候选音频片段的短列表。然后使用基于单词的索引搜索已知单词,使用基于电话的索引搜索词汇表外的单词。我们探索各种系统配置,并检查速度和准确性之间的权衡。我们根据NIST 2006口语术语检测评估计划评估我们的音频搜索系统。我们发现,我们可以在系统的搜索阶段获得30倍的加速,而准确度相对损失为10%。
{"title":"Fast audio search using vector space modelling","authors":"Brett Matthews, U. Chaudhari, B. Ramabhadran","doi":"10.1109/ASRU.2007.4430187","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430187","url":null,"abstract":"Many techniques for retrieving arbitrary content from audio have been developed to leverage the important challenge of providing fast access to very large volumes of multimedia data. We present a two-stage method for fast audio search, where a vector-space modelling approach is first used to retrieve a short list of candidate audio segments for a query. The list of candidate segments is then searched using a word-based index for known words and a phone-based index for out-of-vocabulary words. We explore various system configurations and examine trade-offs between speed and accuracy. We evaluate our audio search system according to the NIST 2006 Spoken Term Detection evaluation initiative. We find that we can obtain a 30-times speedup for the search phase of our system with a 10% relative loss in accuracy.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125211754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Submodularity and adaptation 子模块性和适应性
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430118
J. Bilmes
Summary form only given. Convexity is a property of real-valued functions that enable their efficient optimization. Convex optimization moreover is a problem onto which an amazing variety of practical problems can be cast. Having strong analogs to convexity, submodularity is a property of functions on discrete sets that allows their optimization to be done in only polynomial time. Submodularity generalizes the common notion of diminishing returns. Like convexity, a large variety of discrete optimization problems can be cast in terms of submodular optimization. The first part of this talk will survey recent work taking place in our lab on the application of submodularity to machine learning, which includes discriminative structure learning and word clustering for language models. The second part of the talk will discuss recent work on a technique that for many years has been widely successful in speech recognition, namely adaptation. We will view adaptation in a setting where the training and testing time distributions are not assumed identical (unlike typical Bayes risk theory). We will derive generalization error and sample complexity bounds for adaptation which are specified in terms of a natural divergence between the train/test distributions. These bounds, moreover, lead to practical and effective adaptation strategies for both generative models (e.g., GMMs, HMMs) and discriminative models (e.g., MLPs, SVMs). Joint work with Mukund Narasimhan and Xiao Li.
只提供摘要形式。凸性是实值函数的一种特性,使其能够有效地优化。此外,凸优化是一个可以应用于大量实际问题的问题。与凸性类似,子模块性是离散集合上的函数的一种性质,它允许在多项式时间内完成它们的优化。子模块化概括了收益递减的一般概念。与凸性问题一样,大量的离散优化问题也可以用子模优化来表达。本讲座的第一部分将概述我们实验室最近在子模块化应用于机器学习方面的工作,包括判别结构学习和语言模型的词聚类。演讲的第二部分将讨论近年来在语音识别领域取得广泛成功的一项技术,即适应技术。我们将在训练和测试时间分布不相同的情况下(与典型的贝叶斯风险理论不同)来观察适应性。我们将推导出泛化误差和样本复杂度的自适应界限,这是根据训练/测试分布之间的自然散度来指定的。此外,这些界限还为生成模型(如GMMs、hmm)和判别模型(如mlp、svm)提供了实用而有效的适应策略。与Mukund Narasimhan和Xiao Li合作。
{"title":"Submodularity and adaptation","authors":"J. Bilmes","doi":"10.1109/ASRU.2007.4430118","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430118","url":null,"abstract":"Summary form only given. Convexity is a property of real-valued functions that enable their efficient optimization. Convex optimization moreover is a problem onto which an amazing variety of practical problems can be cast. Having strong analogs to convexity, submodularity is a property of functions on discrete sets that allows their optimization to be done in only polynomial time. Submodularity generalizes the common notion of diminishing returns. Like convexity, a large variety of discrete optimization problems can be cast in terms of submodular optimization. The first part of this talk will survey recent work taking place in our lab on the application of submodularity to machine learning, which includes discriminative structure learning and word clustering for language models. The second part of the talk will discuss recent work on a technique that for many years has been widely successful in speech recognition, namely adaptation. We will view adaptation in a setting where the training and testing time distributions are not assumed identical (unlike typical Bayes risk theory). We will derive generalization error and sample complexity bounds for adaptation which are specified in terms of a natural divergence between the train/test distributions. These bounds, moreover, lead to practical and effective adaptation strategies for both generative models (e.g., GMMs, HMMs) and discriminative models (e.g., MLPs, SVMs). Joint work with Mukund Narasimhan and Xiao Li.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125728207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-native pronunciation variation modeling using an indirect data driven method 基于间接数据驱动方法的非母语语音变异建模
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430114
Mina Kim, Y. Oh, H. Kim
In this paper, we propose a pronunciation variation modeling method for improving the performance of a non-native automatic speech recognition (ASR) system that does not degrade the performance of a native ASR system. The proposed method is based on an indirect data-driven approach, where pronunciation variability is investigated from the training speech data, and variant rules are subsequently derived and applied to compensate for variability in the ASR pronunciation dictionary. To this end, native utterances are first recognized by using a phoneme recognizer, and then the variant phoneme patterns of native speech are obtained by aligning the recognized and reference phonetic sequences. The reference sequences are transcribed by using each of canonical, knowledge-based, and hand-labeled methods. Similar to non-native speech, the variant phoneme patterns of non-native speech can also be obtained by recognizing non-native utterances and comparing the recognized phoneme sequences and reference phonetic transcriptions. Finally, variant rules are derived from native and non-native variant phoneme patterns using decision trees and applied to the adaptation of a dictionary for non-native and native ASR systems. In this paper, Korean spoken by Chinese native speakers is considered as the non-native speech. It is shown from non-native ASR experiments that an ASR system using the dictionary constructed by the proposed pronunciation variation modeling method can relatively reduce the average word error rate (WER) by 18.5% when compared to the baseline ASR system using a canonical transcribed dictionary. In addition, the WER of a native ASR system using the proposed dictionary is also relatively reduced by 1.1%, as compared to the baseline native ASR system with a canonical constructed dictionary.
在本文中,我们提出了一种发音变化建模方法,以提高非本地自动语音识别(ASR)系统的性能,而不会降低本地自动语音识别系统的性能。该方法基于间接数据驱动的方法,从训练语音数据中研究发音的可变性,随后推导出可变规则,并应用于补偿ASR发音字典中的可变性。为此,首先使用音素识别器对母语话语进行识别,然后将识别的音素序列与参考音素序列进行比对,得到母语话语的变体音素模式。参考序列通过使用规范、基于知识和手工标记的方法进行转录。与非母语语音相似,非母语语音的变体音位模式也可以通过识别非母语话语,并将识别的音位序列与参考音标进行比较来获得。最后,使用决策树从本地和非本地变体音素模式中导出变体规则,并将其应用于非本地和本地ASR系统的词典改编。本文将以汉语为母语的朝鲜语视为非母语语。非母语ASR实验表明,与使用标准转录词典的基线ASR系统相比,使用该方法构建的词典的ASR系统平均单词错误率(WER)相对降低了18.5%。此外,与使用规范构造字典的原生ASR系统相比,使用该字典的原生ASR系统的WER也相对降低了1.1%。
{"title":"Non-native pronunciation variation modeling using an indirect data driven method","authors":"Mina Kim, Y. Oh, H. Kim","doi":"10.1109/ASRU.2007.4430114","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430114","url":null,"abstract":"In this paper, we propose a pronunciation variation modeling method for improving the performance of a non-native automatic speech recognition (ASR) system that does not degrade the performance of a native ASR system. The proposed method is based on an indirect data-driven approach, where pronunciation variability is investigated from the training speech data, and variant rules are subsequently derived and applied to compensate for variability in the ASR pronunciation dictionary. To this end, native utterances are first recognized by using a phoneme recognizer, and then the variant phoneme patterns of native speech are obtained by aligning the recognized and reference phonetic sequences. The reference sequences are transcribed by using each of canonical, knowledge-based, and hand-labeled methods. Similar to non-native speech, the variant phoneme patterns of non-native speech can also be obtained by recognizing non-native utterances and comparing the recognized phoneme sequences and reference phonetic transcriptions. Finally, variant rules are derived from native and non-native variant phoneme patterns using decision trees and applied to the adaptation of a dictionary for non-native and native ASR systems. In this paper, Korean spoken by Chinese native speakers is considered as the non-native speech. It is shown from non-native ASR experiments that an ASR system using the dictionary constructed by the proposed pronunciation variation modeling method can relatively reduce the average word error rate (WER) by 18.5% when compared to the baseline ASR system using a canonical transcribed dictionary. In addition, the WER of a native ASR system using the proposed dictionary is also relatively reduced by 1.1%, as compared to the baseline native ASR system with a canonical constructed dictionary.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125864794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Towards bottom-up continuous phone recognition 走向自下而上的连续电话识别
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430174
S. Siniscalchi, T. Svendsen, Chin-Hui Lee
We present a novel approach to designing bottom-up automatic speech recognition (ASR) systems. The key component of the proposed approach is a bank of articulatory attribute detectors implemented using a set of feed-forward artificial neural networks (ANNs). Each detector computes a score describing an activation level of the specified speech attributes that the current frame exhibits. These cues are first combined by an event merger that provides some evidence about the presence of a higher level feature which is then verified by an evidence verifier to produce hypotheses at the phone or word level. We evaluate several configurations of our proposed system on a continuous phone recognition task. Experimental results on the TIMIT database show that the system achieves a phone error rate of 25% which is superior to results obtained with either hidden Markov model (HMM) or conditional random field (CRF) based recognizers. We believe the system's inherent flexibility and the ease of adding new detectors may provide further improvements.
我们提出了一种设计自底向上自动语音识别系统的新方法。该方法的关键组成部分是使用一组前馈人工神经网络(ann)实现的发音属性检测器库。每个检测器计算描述当前帧显示的指定语音属性的激活级别的分数。这些线索首先通过事件合并进行组合,该事件合并提供了一些关于更高级别特征存在的证据,然后由证据验证者验证,从而在电话或单词级别产生假设。我们在连续电话识别任务中评估了我们提出的系统的几种配置。在TIMIT数据库上的实验结果表明,该系统的电话错误率为25%,优于基于隐马尔可夫模型(HMM)或条件随机场(CRF)的识别器。我们相信该系统固有的灵活性和增加新探测器的便利性可能会提供进一步的改进。
{"title":"Towards bottom-up continuous phone recognition","authors":"S. Siniscalchi, T. Svendsen, Chin-Hui Lee","doi":"10.1109/ASRU.2007.4430174","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430174","url":null,"abstract":"We present a novel approach to designing bottom-up automatic speech recognition (ASR) systems. The key component of the proposed approach is a bank of articulatory attribute detectors implemented using a set of feed-forward artificial neural networks (ANNs). Each detector computes a score describing an activation level of the specified speech attributes that the current frame exhibits. These cues are first combined by an event merger that provides some evidence about the presence of a higher level feature which is then verified by an evidence verifier to produce hypotheses at the phone or word level. We evaluate several configurations of our proposed system on a continuous phone recognition task. Experimental results on the TIMIT database show that the system achieves a phone error rate of 25% which is superior to results obtained with either hidden Markov model (HMM) or conditional random field (CRF) based recognizers. We believe the system's inherent flexibility and the ease of adding new detectors may provide further improvements.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129750150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
HMM training based on CV-EM and CV Gaussian mixture optimization 基于CV- em和CV高斯混合优化的HMM训练
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430131
T. Shinozaki, Tatsuya Kawahara
A combination of the cross-validation EM (CV-EM) algorithm and the cross-validation (CV) Gaussian mixture optimization method is explored. CV-EM and CV Gaussian mixture optimization are our previously proposed training algorithms that use CV likelihood instead of the conventional training set likelihood for robust model estimation. Since CV-EM is a parameter optimization method and CV Gaussian mixture optimization is a structure optimization algorithm, these methods can be combined. Large vocabulary speech recognition experiments are performed on oral presentations. It is shown that both CV-EM and CV Gaussian mixture optimization give lower word error rates than the conventional EM, and their combination is effective to further reduce the word error rate.
探讨了交叉验证EM算法与交叉验证高斯混合优化方法的结合。CV- em和CV高斯混合优化是我们之前提出的训练算法,它们使用CV似然而不是传统的训练集似然进行鲁棒模型估计。由于CV- em是一种参数优化方法,而CV高斯混合优化是一种结构优化算法,因此这两种方法可以结合使用。在口头报告中进行了大词汇语音识别实验。结果表明,CV-EM和CV高斯混合优化均比传统EM具有更低的单词错误率,并且两者的结合可以有效地进一步降低单词错误率。
{"title":"HMM training based on CV-EM and CV Gaussian mixture optimization","authors":"T. Shinozaki, Tatsuya Kawahara","doi":"10.1109/ASRU.2007.4430131","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430131","url":null,"abstract":"A combination of the cross-validation EM (CV-EM) algorithm and the cross-validation (CV) Gaussian mixture optimization method is explored. CV-EM and CV Gaussian mixture optimization are our previously proposed training algorithms that use CV likelihood instead of the conventional training set likelihood for robust model estimation. Since CV-EM is a parameter optimization method and CV Gaussian mixture optimization is a structure optimization algorithm, these methods can be combined. Large vocabulary speech recognition experiments are performed on oral presentations. It is shown that both CV-EM and CV Gaussian mixture optimization give lower word error rates than the conventional EM, and their combination is effective to further reduce the word error rate.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128665915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Speech-translation: from domain-limited to domain-unlimited translation tasks 语音翻译:从有限领域到无限领域的翻译任务
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430141
S. Vogel
Summary form only given. In this paper we will review some of recent work done in both domain-limited and domain-unlimited speech translation. We will show where progress has been made and highlight areas, where initial expectations have not been met so far.
只提供摘要形式。在本文中,我们将回顾最近在领域有限和领域无限语音翻译方面所做的一些工作。我们将展示哪些方面取得了进展,并突出显示迄今尚未达到最初预期的领域。
{"title":"Speech-translation: from domain-limited to domain-unlimited translation tasks","authors":"S. Vogel","doi":"10.1109/ASRU.2007.4430141","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430141","url":null,"abstract":"Summary form only given. In this paper we will review some of recent work done in both domain-limited and domain-unlimited speech translation. We will show where progress has been made and highlight areas, where initial expectations have not been met so far.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122420454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixture Gaussian HMM-trajctory method using likelihood compensation 使用似然补偿的混合高斯hmm -轨迹方法
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430127
Yasuhiro Minami
We propose a new speech recognition method (HMM-trajectory method) that generates a speech trajectory from HMMs by maximizing their likelihood while accounting for the relationship between the MFCCs and dynamic MFCCs. One major advantage of this method is that this relationship, ignored in conventional speech recognition, is directly used in the speech recognition phase. This paper improves the recognition performance of the HMM-trajectory method for dealing with mixture Gaussian distributions. While the HMM-trajectory method chooses the Gaussian distribution sequence of the HMM states by selecting the best Gaussian distribution in the state during Viterbi decoding and calculating HMM trajectory likelihood along with the sequence, the proposed method compensates for HMM trajectory likelihood using ordinary HMM likelihood. In speaker-independent speech recognition experiments, the proposed method reduced the error rate about 10% for the task compared with HMMs, proving its effectiveness for Gaussian mixture components.
我们提出了一种新的语音识别方法(hmm -轨迹方法),该方法通过最大化hmm的似然度来生成语音轨迹,同时考虑了mfc和动态mfc之间的关系。这种方法的一个主要优点是,这种在传统语音识别中被忽略的关系直接用于语音识别阶段。本文改进了hmm -轨迹法处理混合高斯分布的识别性能。HMM-弹道方法通过选择Viterbi解码过程中状态的最佳高斯分布,并随序列计算HMM轨迹似然来选择HMM状态的高斯分布序列,而HMM-弹道方法使用普通HMM似然来补偿HMM轨迹似然。在与说话人无关的语音识别实验中,与hmm相比,该方法的任务错误率降低了约10%,证明了该方法对高斯混合分量的有效性。
{"title":"Mixture Gaussian HMM-trajctory method using likelihood compensation","authors":"Yasuhiro Minami","doi":"10.1109/ASRU.2007.4430127","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430127","url":null,"abstract":"We propose a new speech recognition method (HMM-trajectory method) that generates a speech trajectory from HMMs by maximizing their likelihood while accounting for the relationship between the MFCCs and dynamic MFCCs. One major advantage of this method is that this relationship, ignored in conventional speech recognition, is directly used in the speech recognition phase. This paper improves the recognition performance of the HMM-trajectory method for dealing with mixture Gaussian distributions. While the HMM-trajectory method chooses the Gaussian distribution sequence of the HMM states by selecting the best Gaussian distribution in the state during Viterbi decoding and calculating HMM trajectory likelihood along with the sequence, the proposed method compensates for HMM trajectory likelihood using ordinary HMM likelihood. In speaker-independent speech recognition experiments, the proposed method reduced the error rate about 10% for the task compared with HMMs, proving its effectiveness for Gaussian mixture components.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124363010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Uncertainty in training large vocabulary speech recognizers 训练大词汇量语音识别器的不确定性
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430160
A. Subramanya, C. Bartels, J. Bilmes, Patrick Nguyen
We propose a technique for annotating data used to train a speech recognizer. The proposed scheme is based on labeling only a single frame for every word in the training set. We make use of the virtual evidence (VE) framework within a graphical model to take advantage of such data. We apply this approach to a large vocabulary speech recognition task, and show that our VE-based training scheme can improve over the performance of a system trained using sequence labeled data by 2.8% and 2.1% on the dev01 and eva101 sets respectively. Annotating data in the proposed scheme is not significantly slower than sequence labeling. We present timing results showing that training using the proposed approach is about 10 times faster than training using sequence labeled data while using only about 75% of the memory.
我们提出了一种用于训练语音识别器的数据注释技术。所提出的方案是基于对训练集中的每个单词只标记单个帧。我们利用图形模型中的虚拟证据(VE)框架来利用这些数据。我们将这种方法应用于一个大词汇量的语音识别任务,并表明我们的基于vee的训练方案在dev01和eva101上分别比使用序列标记数据训练的系统性能提高2.8%和2.1%。在该方案中标注数据的速度并不比序列标注慢。我们给出的时序结果表明,使用该方法的训练速度比使用序列标记数据的训练速度快10倍,而仅使用约75%的内存。
{"title":"Uncertainty in training large vocabulary speech recognizers","authors":"A. Subramanya, C. Bartels, J. Bilmes, Patrick Nguyen","doi":"10.1109/ASRU.2007.4430160","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430160","url":null,"abstract":"We propose a technique for annotating data used to train a speech recognizer. The proposed scheme is based on labeling only a single frame for every word in the training set. We make use of the virtual evidence (VE) framework within a graphical model to take advantage of such data. We apply this approach to a large vocabulary speech recognition task, and show that our VE-based training scheme can improve over the performance of a system trained using sequence labeled data by 2.8% and 2.1% on the dev01 and eva101 sets respectively. Annotating data in the proposed scheme is not significantly slower than sequence labeling. We present timing results showing that training using the proposed approach is about 10 times faster than training using sequence labeled data while using only about 75% of the memory.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121582478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
State-dependent mixture tying with variable codebook size for accented speech recognition 带有可变码本大小的状态依赖混合绑定用于重音语音识别
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430128
L. Yi, Zheng Fang, He Lei, Xia Yunqing
In this paper, we propose a state-dependent tied mixture (SDTM) models with variable codebook size to improve the model robustness for accented phonetic variations while maintaining model discriminative ability. State tying and mixture tying are combined to generate SDTM models. Compared to a pure mixture tying system, the SDTM model uses state tying to reserve the state identity; compared to the sole state tying system, such model uses a small set of parameters to discard the overlapping mixture distributions for robust model estimation. The codebook size of SDTM model is varied according to the confusion probability of states. The more confusable a state is, the larger its codebook size gets for a higher degree of model resolution. The codebook size is governed by state level variation probability of accented phonetic confusions which can be automatically extracted by frame-to-state alignment based on the local model mismatch. The effectiveness of this approach is evaluated on Mandarin accented speech. Our method yields a significant 2.1%, 9.5% and 3.5% absolute word error rate reduction compared with state tying, mixture tying and state-based phonetic tied mixtures, respectively.
在本文中,我们提出了一种可变码本大小的状态相关捆绑混合(SDTM)模型,以提高模型对重音变化的鲁棒性,同时保持模型的判别能力。结合状态绑定和混合绑定生成SDTM模型。与纯混合捆绑系统相比,SDTM模型使用状态捆绑来保留状态身份;与单一状态绑定系统相比,该模型使用小的参数集来丢弃重叠的混合分布,从而实现模型的鲁棒估计。SDTM模型的码本大小随状态混淆概率的变化而变化。状态越容易混淆,其码本大小就越大,从而获得更高的模型分辨率。码本的大小由重音混淆的状态变化概率决定,该概率可以通过基于局部模型不匹配的帧-状态对齐来自动提取。本文对该方法的有效性进行了评价。与状态绑定、混合绑定和基于状态的语音绑定混合相比,我们的方法的绝对单词错误率分别降低了2.1%、9.5%和3.5%。
{"title":"State-dependent mixture tying with variable codebook size for accented speech recognition","authors":"L. Yi, Zheng Fang, He Lei, Xia Yunqing","doi":"10.1109/ASRU.2007.4430128","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430128","url":null,"abstract":"In this paper, we propose a state-dependent tied mixture (SDTM) models with variable codebook size to improve the model robustness for accented phonetic variations while maintaining model discriminative ability. State tying and mixture tying are combined to generate SDTM models. Compared to a pure mixture tying system, the SDTM model uses state tying to reserve the state identity; compared to the sole state tying system, such model uses a small set of parameters to discard the overlapping mixture distributions for robust model estimation. The codebook size of SDTM model is varied according to the confusion probability of states. The more confusable a state is, the larger its codebook size gets for a higher degree of model resolution. The codebook size is governed by state level variation probability of accented phonetic confusions which can be automatically extracted by frame-to-state alignment based on the local model mismatch. The effectiveness of this approach is evaluated on Mandarin accented speech. Our method yields a significant 2.1%, 9.5% and 3.5% absolute word error rate reduction compared with state tying, mixture tying and state-based phonetic tied mixtures, respectively.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122197992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Generalized linear interpolation of language models 语言模型的广义线性插值
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430098
B. Hsu
Despite the prevalent use of model combination techniques to improve speech recognition performance on domains with limited data, little prior research has focused on the choice of the actual interpolation model. For merging language models, the most popular approach has been the simple linear interpolation. In this work, we propose a generalization of linear interpolation that computes context-dependent mixture weights from arbitrary features. Results on a lecture transcription task yield up to a 1.0% absolute improvement in recognition word error rate (WER).
尽管普遍使用模型组合技术来提高有限数据域的语音识别性能,但很少有研究关注实际插值模型的选择。对于合并语言模型,最流行的方法是简单的线性插值。在这项工作中,我们提出了一种线性插值的推广方法,可以从任意特征中计算上下文相关的混合权重。在演讲转录任务上的结果使识别词错误率(WER)绝对提高了1.0%。
{"title":"Generalized linear interpolation of language models","authors":"B. Hsu","doi":"10.1109/ASRU.2007.4430098","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430098","url":null,"abstract":"Despite the prevalent use of model combination techniques to improve speech recognition performance on domains with limited data, little prior research has focused on the choice of the actual interpolation model. For merging language models, the most popular approach has been the simple linear interpolation. In this work, we propose a generalization of linear interpolation that computes context-dependent mixture weights from arbitrary features. Results on a lecture transcription task yield up to a 1.0% absolute improvement in recognition word error rate (WER).","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131065937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
期刊
2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1