首页 > 最新文献

2011 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Crowd-sourcing for difficult transcription of speech 为困难的语音转录提供众包
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163988
J. Williams, I. D. Melamed, Tirso Alonso, B. Hollister, J. Wilpon
Crowd-sourcing is a promising method for fast and cheap transcription of large volumes of speech data. However, this method cannot achieve the accuracy of expert transcribers on speech that is difficult to transcribe. Faced with such speech data, we developed three new methods of crowd-sourcing, which allow explicit trade-offs among precision, recall, and cost. The methods are: incremental redundancy, treating ASR as a transcriber, and using a regression model to predict transcription reliability. Even though the accuracy of individual crowd-workers is only 55% on our data, our best method achieves 90% accuracy on 93% of the utterances, using only 1.3 crowd-worker transcriptions per utterance on average. When forced to transcribe all utterances, our best method matches the accuracy of previous crowd-sourcing methods using only one third as many transcriptions. We also study the effects of various task design factors on transcription latency and accuracy, some of which have not been reported before.
对于大量语音数据的快速、廉价转录,众包是一种很有前途的方法。然而,对于难以转录的语音,这种方法无法达到专家转录员的准确性。面对这样的语音数据,我们开发了三种新的众包方法,允许在精度,召回率和成本之间进行明确的权衡。方法是:增量冗余,将ASR视为转录因子,并使用回归模型预测转录可靠性。尽管个体众工在我们的数据上的准确率只有55%,但我们最好的方法在93%的话语上达到了90%的准确率,平均每个话语只使用1.3个众工转录。当被迫转录所有话语时,我们最好的方法与以前的众包方法相匹配,只使用三分之一的转录。我们还研究了各种任务设计因素对转录延迟和准确性的影响,其中一些以前没有报道过。
{"title":"Crowd-sourcing for difficult transcription of speech","authors":"J. Williams, I. D. Melamed, Tirso Alonso, B. Hollister, J. Wilpon","doi":"10.1109/ASRU.2011.6163988","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163988","url":null,"abstract":"Crowd-sourcing is a promising method for fast and cheap transcription of large volumes of speech data. However, this method cannot achieve the accuracy of expert transcribers on speech that is difficult to transcribe. Faced with such speech data, we developed three new methods of crowd-sourcing, which allow explicit trade-offs among precision, recall, and cost. The methods are: incremental redundancy, treating ASR as a transcriber, and using a regression model to predict transcription reliability. Even though the accuracy of individual crowd-workers is only 55% on our data, our best method achieves 90% accuracy on 93% of the utterances, using only 1.3 crowd-worker transcriptions per utterance on average. When forced to transcribe all utterances, our best method matches the accuracy of previous crowd-sourcing methods using only one third as many transcriptions. We also study the effects of various task design factors on transcription latency and accuracy, some of which have not been reported before.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"16 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122365417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Maximum kurtosis beamforming with a subspace filter for distant speech recognition 远距离语音识别的子空间滤波器最大峰度波束形成
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163927
K. Kumatani, J. McDonough, B. Raj
This paper presents a new beamforming method for distant speech recognition (DSR). The dominant mode subspace is considered in order to efficiently estimate the active weight vectors for maximum kurtosis (MK) beamforming with the generalized sidelobe canceler (GSC). We demonstrated in [1], [2], [3] that the beamforming method based on the maximum kurtosis criterion can remove reverberant and noise effects without signal cancellation encountered in the conventional beamforming algorithms. The MK beamforming algorithm, however, required a relatively large amount of data for reliably estimating the active weight vector because it relies on a numerical optimization algorithm. In order to achieve efficient estimation, we propose to cascade the subspace (eigenspace) filter [4, §6.8] with the active weight vector. The subspace filter can decompose the output of the blocking matrix into directional signals and ambient noise components. Then, the ambient noise components are averaged and would be subtracted from the beamformer's output, which leads to reliable estimation as well as significant computational reduction. We show the effectiveness of our method through a set of distant speech recognition experiments on real microphone array data captured in the real environment. Our new beamforming algorithm provided the best recognition performance among conventional beamforming techniques, a word error rate (WER) of 5.3 %, which is comparable to the WER of 4.2 % obtained with a close-talking microphone. Moreover, it achieved better recognition performance with a fewer amounts of adaptation data than the conventional MK beamformer.
提出了一种新的远距离语音识别波束形成方法。为了利用广义旁瓣对消器(GSC)有效估计最大峰度波束形成的有效权向量,考虑了主模子空间。我们在[1],[2],[3]中证明了基于最大峰度准则的波束形成方法可以消除混响和噪声影响,而不会遇到传统波束形成算法中的信号抵消问题。然而,MK波束形成算法依赖于数值优化算法,需要相对大量的数据来可靠地估计有效权向量。为了实现有效的估计,我们提出将子空间(特征空间)滤波器[4,§6.8]与主动权向量级联。子空间滤波器可以将阻塞矩阵的输出分解为方向信号和环境噪声分量。然后,将环境噪声分量平均并从波束形成器的输出中减去,从而得到可靠的估计并显著减少计算量。我们通过一组在真实环境中捕获的真实麦克风阵列数据的远程语音识别实验证明了该方法的有效性。我们的新波束形成算法在传统的波束形成技术中提供了最好的识别性能,单词错误率(WER)为5.3%,与近距离说话麦克风获得的4.2%的错误率相当。此外,与传统的MK波束形成器相比,该方法在自适应数据量较少的情况下取得了更好的识别性能。
{"title":"Maximum kurtosis beamforming with a subspace filter for distant speech recognition","authors":"K. Kumatani, J. McDonough, B. Raj","doi":"10.1109/ASRU.2011.6163927","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163927","url":null,"abstract":"This paper presents a new beamforming method for distant speech recognition (DSR). The dominant mode subspace is considered in order to efficiently estimate the active weight vectors for maximum kurtosis (MK) beamforming with the generalized sidelobe canceler (GSC). We demonstrated in [1], [2], [3] that the beamforming method based on the maximum kurtosis criterion can remove reverberant and noise effects without signal cancellation encountered in the conventional beamforming algorithms. The MK beamforming algorithm, however, required a relatively large amount of data for reliably estimating the active weight vector because it relies on a numerical optimization algorithm. In order to achieve efficient estimation, we propose to cascade the subspace (eigenspace) filter [4, §6.8] with the active weight vector. The subspace filter can decompose the output of the blocking matrix into directional signals and ambient noise components. Then, the ambient noise components are averaged and would be subtracted from the beamformer's output, which leads to reliable estimation as well as significant computational reduction. We show the effectiveness of our method through a set of distant speech recognition experiments on real microphone array data captured in the real environment. Our new beamforming algorithm provided the best recognition performance among conventional beamforming techniques, a word error rate (WER) of 5.3 %, which is comparable to the WER of 4.2 % obtained with a close-talking microphone. Moreover, it achieved better recognition performance with a fewer amounts of adaptation data than the conventional MK beamformer.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127980694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
N-Best rescoring by adaboost phoneme classifiers for isolated word recognition adaboost音素分类器在孤立词识别中的N-Best评分
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163910
Hiroshi Fujimura, Masanobu Nakamura, Yusuke Shinohara, T. Masuko
This paper proposes a novel technique to exploit generative and discriminative models for speech recognition. Speech recognition using discriminative models has attracted much attention in the past decade. In particular, a rescoring framework using discriminative word classifiers with generative-model-based features was shown to be effective in small-vocabulary tasks. However, a straightforward application of the framework to large-vocabulary tasks is difficult because the number of classifiers increases in proportion to the number of word pairs. We extend this framework to exploit generative and discriminative models in large-vocabulary tasks. N-best hypotheses obtained in the first pass are rescored using AdaBoost phoneme classifiers, where generative-model-based features, i.e. difference-of-likelihood features in particular, are used for the classifiers. Special care is taken to use context-dependent hidden Markov models (CDHMMs) as generative models, since most of the state-of-the-art speech recognizers use CDHMMs. Experimental results show that the proposed method reduces word errors by 32.68% relatively in a one-million-vocabulary isolated word recognition task.
本文提出了一种利用生成和判别模型进行语音识别的新技术。在过去的十年中,基于判别模型的语音识别受到了广泛的关注。特别是,使用基于生成模型特征的判别词分类器的评分框架在小词汇量任务中被证明是有效的。然而,将该框架直接应用于大词汇量任务是困难的,因为分类器的数量与单词对的数量成比例地增加。我们将这个框架扩展到在大词汇量任务中利用生成和判别模型。在第一轮中获得的n个最佳假设使用AdaBoost音素分类器进行重新排序,其中基于生成模型的特征,特别是似然差异特征,被用于分类器。特别注意使用上下文相关的隐马尔可夫模型(cdhmm)作为生成模型,因为大多数最先进的语音识别器使用cdhmm。实验结果表明,在100万个词汇的孤立词识别任务中,该方法相对减少了32.68%的词错误率。
{"title":"N-Best rescoring by adaboost phoneme classifiers for isolated word recognition","authors":"Hiroshi Fujimura, Masanobu Nakamura, Yusuke Shinohara, T. Masuko","doi":"10.1109/ASRU.2011.6163910","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163910","url":null,"abstract":"This paper proposes a novel technique to exploit generative and discriminative models for speech recognition. Speech recognition using discriminative models has attracted much attention in the past decade. In particular, a rescoring framework using discriminative word classifiers with generative-model-based features was shown to be effective in small-vocabulary tasks. However, a straightforward application of the framework to large-vocabulary tasks is difficult because the number of classifiers increases in proportion to the number of word pairs. We extend this framework to exploit generative and discriminative models in large-vocabulary tasks. N-best hypotheses obtained in the first pass are rescored using AdaBoost phoneme classifiers, where generative-model-based features, i.e. difference-of-likelihood features in particular, are used for the classifiers. Special care is taken to use context-dependent hidden Markov models (CDHMMs) as generative models, since most of the state-of-the-art speech recognizers use CDHMMs. Experimental results show that the proposed method reduces word errors by 32.68% relatively in a one-million-vocabulary isolated word recognition task.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129867360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Evaluating prosodic features for automated scoring of non-native read speech 评估非母语阅读语音自动评分的韵律特征
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163975
K. Zechner, Xiaoming Xi, L. Chen
We evaluate two types of prosodic features utilizing automatically generated stress and tone labels for non-native read speech in terms of their applicability for automated speech scoring. oth types of features have not been used in the context of automated scoring of non-native read speech to date.
我们评估了两种类型的韵律特征,利用自动生成的重音和音调标签对非母语阅读语音进行自动语音评分的适用性。到目前为止,这两种类型的特征还没有被用于非母语读语音的自动评分。
{"title":"Evaluating prosodic features for automated scoring of non-native read speech","authors":"K. Zechner, Xiaoming Xi, L. Chen","doi":"10.1109/ASRU.2011.6163975","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163975","url":null,"abstract":"We evaluate two types of prosodic features utilizing automatically generated stress and tone labels for non-native read speech in terms of their applicability for automated speech scoring. oth types of features have not been used in the context of automated scoring of non-native read speech to date.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115198316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Study of probabilistic and Bottle-Neck features in multilingual environment 多语言环境下的概率与瓶颈特征研究
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163958
F. Grézl, M. Karafiát, M. Janda
This study is focused on the performance of Probabilistic and Bottle-Neck features on different language than they were trained for. It is shown, that such porting is possible and that the features are still competitive to PLP features. Further, several combination techniques are evaluated. The performance of combined features is close to the best performing system. Finally, bigger NNs were trained on large data from different domain. The resulting features outperformed previously trained systems and combination with them further improved the system performance.
本研究的重点是概率特征和瓶颈特征在不同语言上的表现。结果表明,这种移植是可能的,并且这些功能仍然与PLP功能具有竞争力。此外,还对几种组合技术进行了评估。组合特征的性能接近最佳性能系统。最后,在不同领域的大数据上训练更大的神经网络。所得到的特征优于先前训练过的系统,并与它们结合进一步提高了系统性能。
{"title":"Study of probabilistic and Bottle-Neck features in multilingual environment","authors":"F. Grézl, M. Karafiát, M. Janda","doi":"10.1109/ASRU.2011.6163958","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163958","url":null,"abstract":"This study is focused on the performance of Probabilistic and Bottle-Neck features on different language than they were trained for. It is shown, that such porting is possible and that the features are still competitive to PLP features. Further, several combination techniques are evaluated. The performance of combined features is close to the best performing system. Finally, bigger NNs were trained on large data from different domain. The resulting features outperformed previously trained systems and combination with them further improved the system performance.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115795579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Efficient determinization of tagged word lattices using categorial and lexicographic semirings 使用分类和词典半分割的标记词格的有效确定
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163945
Izhak Shafran, R. Sproat, M. Yarmohammadi, Brian Roark
Speech and language processing systems routinely face the need to apply finite state operations (e.g., POS tagging) on results from intermediate stages (e.g., ASR output) that are naturally represented in a compact lattice form. Currently, such needs are met by converting the lattices into linear sequences (n-best scoring sequences) before and after applying the finite state operations. In this paper, we eliminate the need for this unnecessary conversion by addressing the problem of picking only the single-best scoring output labels for every input sequence. For this purpose, we define a categorial semiring that allows determinzation over strings and incorporate it into a 〈Tropical, Categorial〉 lexicographic semiring. Through examples and empirical evaluations we show how determinization in this lexicographic semiring produces the desired output. The proposed solution is general in nature and can be applied to multi-tape weighted transducers that arise in many applications.
语音和语言处理系统通常需要将有限状态操作(例如,POS标记)应用于中间阶段(例如,ASR输出)的结果,这些结果自然地以紧凑的晶格形式表示。目前,在应用有限状态运算之前和之后,通过将格转换为线性序列(n-最佳评分序列)来满足这种需求。在本文中,我们通过解决为每个输入序列只选择单最佳得分输出标签的问题,消除了这种不必要的转换的需要。为此,我们定义了一个允许对字符串进行确定的分类半环,并将其合并到< Tropical, categorical >字典半环中。通过示例和经验评估,我们展示了词典半循环中的确定如何产生期望的输出。所提出的解决方案本质上是通用的,可以应用于许多应用中出现的多带加权传感器。
{"title":"Efficient determinization of tagged word lattices using categorial and lexicographic semirings","authors":"Izhak Shafran, R. Sproat, M. Yarmohammadi, Brian Roark","doi":"10.1109/ASRU.2011.6163945","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163945","url":null,"abstract":"Speech and language processing systems routinely face the need to apply finite state operations (e.g., POS tagging) on results from intermediate stages (e.g., ASR output) that are naturally represented in a compact lattice form. Currently, such needs are met by converting the lattices into linear sequences (n-best scoring sequences) before and after applying the finite state operations. In this paper, we eliminate the need for this unnecessary conversion by addressing the problem of picking only the single-best scoring output labels for every input sequence. For this purpose, we define a categorial semiring that allows determinzation over strings and incorporate it into a 〈Tropical, Categorial〉 lexicographic semiring. Through examples and empirical evaluations we show how determinization in this lexicographic semiring produces the desired output. The proposed solution is general in nature and can be applied to multi-tape weighted transducers that arise in many applications.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131272197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Applying Multiclass Bandit algorithms to call-type classification 应用Multiclass Bandit算法进行呼叫类型分类
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163970
L. Ralaivola, Benoit Favre, Pierre Gotab, Frédéric Béchet, Géraldine Damnati
We analyze the problem of call-type classification using data that is weakly labelled. The training data is not systematically annotated, but we consider we have a weak or lazy oracle able to answer the question “Is sample x of class q?” by a simple ‘yes’ or ‘no’ answer. This situation of learning might be encountered in many real-world problems where the cost of labelling data is very high. We prove that it is possible to learn linear classifiers in this setting, by estimating adequate expectations inspired by the Multiclass Bandit paradgim. We propose a learning strategy that builds on Kessler's construction to learn multiclass perceptrons. We test our learning procedure against two real-world datasets from spoken langage understanding and provide compelling results.
我们使用弱标记数据分析了呼叫类型分类问题。训练数据没有系统地注释,但我们认为我们有一个弱的或懒惰的oracle,能够回答“样本x是类q的吗?”用一个简单的“是”或“不是”回答。这种学习情况可能会在许多现实世界的问题中遇到,其中标记数据的成本非常高。我们证明在这种情况下,通过估计由Multiclass Bandit范式启发的足够期望,可以学习线性分类器。我们提出了一种基于Kessler结构的学习策略来学习多类感知器。我们针对口语理解的两个真实世界数据集测试了我们的学习过程,并提供了令人信服的结果。
{"title":"Applying Multiclass Bandit algorithms to call-type classification","authors":"L. Ralaivola, Benoit Favre, Pierre Gotab, Frédéric Béchet, Géraldine Damnati","doi":"10.1109/ASRU.2011.6163970","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163970","url":null,"abstract":"We analyze the problem of call-type classification using data that is weakly labelled. The training data is not systematically annotated, but we consider we have a weak or lazy oracle able to answer the question “Is sample x of class q?” by a simple ‘yes’ or ‘no’ answer. This situation of learning might be encountered in many real-world problems where the cost of labelling data is very high. We prove that it is possible to learn linear classifiers in this setting, by estimating adequate expectations inspired by the Multiclass Bandit paradgim. We propose a learning strategy that builds on Kessler's construction to learn multiclass perceptrons. We test our learning procedure against two real-world datasets from spoken langage understanding and provide compelling results.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116079144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Discriminative splitting of Gaussian/log-linear mixture HMMs for speech recognition 用于语音识别的高斯/对数线性混合hmm的判别分裂
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163896
Muhammad Ali Tahir, R. Schlüter, H. Ney
This paper presents a method to incorporate mixture density splitting into the acoustic model discriminative log-linear training. The standard method is to obtain a high resolution model by maximum likelihood training and density splitting, and then further training this model discriminatively. For a single Gaussian density per state the log-linear MMI optimization is a global maximum problem, and by further splitting and discriminative training of this model we can get a higher complexity model. The mixture training is not a global maximum problem, nevertheless experimentally we achieve large gains in the objective function and corresponding moderate gains in the word error rate on a large vocabulary corpus
提出了一种将混合密度分解方法引入声学模型判别对数线性训练的方法。标准的方法是通过极大似然训练和密度分割得到一个高分辨率的模型,然后对该模型进行判别训练。对于单态高斯密度的对数线性MMI优化是一个全局极大值问题,通过对该模型的进一步拆分和判别训练,可以得到一个更高复杂度的模型。混合训练并不是一个全局极值问题,但在实验中,我们在大语料库上实现了目标函数的大幅度提高,相应的错误率也有适度的提高
{"title":"Discriminative splitting of Gaussian/log-linear mixture HMMs for speech recognition","authors":"Muhammad Ali Tahir, R. Schlüter, H. Ney","doi":"10.1109/ASRU.2011.6163896","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163896","url":null,"abstract":"This paper presents a method to incorporate mixture density splitting into the acoustic model discriminative log-linear training. The standard method is to obtain a high resolution model by maximum likelihood training and density splitting, and then further training this model discriminatively. For a single Gaussian density per state the log-linear MMI optimization is a global maximum problem, and by further splitting and discriminative training of this model we can get a higher complexity model. The mixture training is not a global maximum problem, nevertheless experimentally we achieve large gains in the objective function and corresponding moderate gains in the word error rate on a large vocabulary corpus","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121452558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Robust seed model training for speaker adaptation using pseudo-speaker features generated by inverse CMLLR transformation 基于逆cmlr变换生成的伪说话人特征的鲁棒种子模型训练
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163925
Arata Itoh, Sunao Hara, N. Kitaoka, K. Takeda
In this paper, we propose a novel acoustic model training method which is suitable for speaker adaptation in speech recognition. Our method is based on feature generation from a small amount of speakers' data. For decades, speaker adaptation methods have been widely used. Such adaptation methods need some amount of adaptation data and if the data is not sufficient, speech recognition performance degrade significantly. If the seed models to be adapted to a specific speaker can widely cover more speakers, speaker adaptation can perform robustly. To make such robust seed models, we adopt inverse maximum likelihood linear regression (MLLR) transformation-based feature generation, and then train our seed models using these features. First we obtain MLLR transformation matrices from a limited number of existing speakers. Then we extract the bases of the MLLR transformation matrices using PCA. The distribution of the weight parameters to express the MLLR transformation matrices for the existing speakers is estimated. Next we generate pseudo-speaker MLLR transformations by sampling the weight parameters from the distribution, and apply the inverse of the transformation to the normalized existing speaker features to generate the pseudo-speakers' features. Finally, using these features, we train the acoustic seed models. Using this seed models, we obtained better speaker adaptation results than using simply environmentally adapted models.
本文提出了一种适用于语音识别中说话人自适应的声学模型训练方法。我们的方法是基于从少量说话人的数据中生成特征。几十年来,说话人适应方法得到了广泛的应用。这种自适应方法需要一定数量的自适应数据,如果数据不足,语音识别性能会显著下降。如果要适应于特定说话人的种子模型可以广泛地覆盖更多的说话人,那么说话人自适应就可以实现鲁棒性。为了建立稳健的种子模型,我们采用了基于逆极大似然线性回归(MLLR)变换的特征生成方法,然后利用这些特征对种子模型进行训练。首先,我们从有限数量的现有说话人中获得MLLR变换矩阵。然后利用主成分分析法提取MLLR变换矩阵的基。估计了现有说话人表达MLLR变换矩阵的权重参数的分布。接下来,我们通过从分布中采样权参数来生成伪说话人的MLLR变换,并将变换的逆应用于归一化的现有说话人特征来生成伪说话人的特征。最后,利用这些特征对声学种子模型进行训练。使用该种子模型,我们获得了比简单的环境适应模型更好的说话人适应结果。
{"title":"Robust seed model training for speaker adaptation using pseudo-speaker features generated by inverse CMLLR transformation","authors":"Arata Itoh, Sunao Hara, N. Kitaoka, K. Takeda","doi":"10.1109/ASRU.2011.6163925","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163925","url":null,"abstract":"In this paper, we propose a novel acoustic model training method which is suitable for speaker adaptation in speech recognition. Our method is based on feature generation from a small amount of speakers' data. For decades, speaker adaptation methods have been widely used. Such adaptation methods need some amount of adaptation data and if the data is not sufficient, speech recognition performance degrade significantly. If the seed models to be adapted to a specific speaker can widely cover more speakers, speaker adaptation can perform robustly. To make such robust seed models, we adopt inverse maximum likelihood linear regression (MLLR) transformation-based feature generation, and then train our seed models using these features. First we obtain MLLR transformation matrices from a limited number of existing speakers. Then we extract the bases of the MLLR transformation matrices using PCA. The distribution of the weight parameters to express the MLLR transformation matrices for the existing speakers is estimated. Next we generate pseudo-speaker MLLR transformations by sampling the weight parameters from the distribution, and apply the inverse of the transformation to the normalized existing speaker features to generate the pseudo-speakers' features. Finally, using these features, we train the acoustic seed models. Using this seed models, we obtained better speaker adaptation results than using simply environmentally adapted models.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115542584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Factor analysis based session variability compensation for Automatic Speech Recognition 基于因子分析的会话可变性自动语音识别补偿
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163920
Mickael Rouvier, M. Bouallegue, D. Matrouf, G. Linarès
In this paper we propose a new feature normalization based on Factor Analysis (FA) for the problem of acoustic variability in Automatic Speech Recognition (ASR). The FA paradigm was previously used in the field of ASR, in order to model the usefull information: the HMM state dependent acoustic information. In this paper, we propose to use the FA paradigm to model the useless information (speaker- or channel-variability) in order to remove it from acoustic data frames. The transformed training data frames are then used to train new HMM models using the standard training algorithm. The transformation is also applied to the test data before the decoding process. With this approach we obtain, on french broadcast news, an absolute WER reduction of 1.3%.
针对自动语音识别中的声学变异性问题,提出了一种基于因子分析的特征归一化方法。先前在ASR领域中使用了FA范式,以建模有用的信息:HMM状态相关的声学信息。在本文中,我们建议使用FA范式对无用信息(说话者或信道可变性)进行建模,以便从声学数据帧中删除无用信息。然后使用转换后的训练数据帧使用标准训练算法训练新的HMM模型。在解码之前,还对测试数据进行了转换。通过这种方法,我们在法国广播新闻中获得了绝对减少1.3%的WER。
{"title":"Factor analysis based session variability compensation for Automatic Speech Recognition","authors":"Mickael Rouvier, M. Bouallegue, D. Matrouf, G. Linarès","doi":"10.1109/ASRU.2011.6163920","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163920","url":null,"abstract":"In this paper we propose a new feature normalization based on Factor Analysis (FA) for the problem of acoustic variability in Automatic Speech Recognition (ASR). The FA paradigm was previously used in the field of ASR, in order to model the usefull information: the HMM state dependent acoustic information. In this paper, we propose to use the FA paradigm to model the useless information (speaker- or channel-variability) in order to remove it from acoustic data frames. The transformed training data frames are then used to train new HMM models using the standard training algorithm. The transformation is also applied to the test data before the decoding process. With this approach we obtain, on french broadcast news, an absolute WER reduction of 1.3%.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127510877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2011 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1