首页 > 最新文献

2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Articulatory feature detection with Support Vector Machines for integration into ASR and phone recognition 发音特征检测与支持向量机集成到ASR和电话识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373326
U. Chaudhari, M. Picheny
We study the use of Support Vector Machines (SVM) for detecting the occurrence of articulatory features in speech audio data and using the information contained in the detector outputs to improve phone and speech recognition. Our expectation is that an SVM should be able to appropriately model the separation of the classes which may have complex distributions in feature space. We show that performance improves markedly when using discriminatively trained speaker dependent parameters for the SVM inputs, and compares quite well to results in the literature using other classifiers, namely Artificial Neural Networks (ANN). Further, we show that the resulting detector outputs can be successfully integrated into a state of the art speech recognition system, with consequent performance gains. Notably, we test our system on English broadcast news data from dev04f.
我们研究了使用支持向量机(SVM)来检测语音音频数据中发音特征的出现,并使用检测器输出中包含的信息来改进电话和语音识别。我们的期望是支持向量机应该能够适当地对特征空间中可能具有复杂分布的类的分离进行建模。我们表明,当使用判别训练的说话人相关参数作为支持向量机输入时,性能显着提高,并且与文献中使用其他分类器(即人工神经网络(ANN))的结果相比相当好。此外,我们表明,所得到的检测器输出可以成功地集成到最先进的语音识别系统中,从而获得性能提升。值得注意的是,我们在dev04f的英语广播新闻数据上测试了我们的系统。
{"title":"Articulatory feature detection with Support Vector Machines for integration into ASR and phone recognition","authors":"U. Chaudhari, M. Picheny","doi":"10.1109/ASRU.2009.5373326","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373326","url":null,"abstract":"We study the use of Support Vector Machines (SVM) for detecting the occurrence of articulatory features in speech audio data and using the information contained in the detector outputs to improve phone and speech recognition. Our expectation is that an SVM should be able to appropriately model the separation of the classes which may have complex distributions in feature space. We show that performance improves markedly when using discriminatively trained speaker dependent parameters for the SVM inputs, and compares quite well to results in the literature using other classifiers, namely Artificial Neural Networks (ANN). Further, we show that the resulting detector outputs can be successfully integrated into a state of the art speech recognition system, with consequent performance gains. Notably, we test our system on English broadcast news data from dev04f.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116095019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Island-driven search using broad phonetic classes 岛屿驱动搜索使用广泛的语音类
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373547
Tara N. Sainath
Most speech recognizers do not differentiate between reliable and unreliable portions of the speech signal during search. As a result, most of the search effort is concentrated in unreliable areas. Island-driven search addresses this problem by first identifying reliable islands and directing the search out from these islands towards unreliable gaps. In this paper, we develop a technique to detect islands from knowledge of hypothesized broad phonetic classes (BPCs). Using this island/gap knowledge, we explore a method to prune the search space to limit computational effort in unreliable areas. In addition, we also investigate scoring less detailed BPC models in gap regions and more detailed phonetic models in islands. Experiments on both small and large scale vocabulary tasks indicate that our island-driven search strategy results in an improvement in recognition accuracy and computation time.
大多数语音识别器在搜索过程中不能区分语音信号的可靠部分和不可靠部分。因此,大部分搜索工作都集中在不可靠的地区。岛屿驱动搜索解决了这个问题,首先确定可靠的岛屿,并将搜索从这些岛屿引导到不可靠的空白。在本文中,我们开发了一种从假设的广义语音类(BPCs)知识中检测岛屿的技术。利用这种孤岛/间隙知识,我们探索了一种方法来修剪搜索空间,以限制不可靠区域的计算工作量。此外,我们还研究了在空白区域评分较不详细的BPC模型和在岛屿评分较详细的语音模型。在小型和大型词汇任务上的实验表明,我们的岛屿驱动搜索策略在识别精度和计算时间上都有提高。
{"title":"Island-driven search using broad phonetic classes","authors":"Tara N. Sainath","doi":"10.1109/ASRU.2009.5373547","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373547","url":null,"abstract":"Most speech recognizers do not differentiate between reliable and unreliable portions of the speech signal during search. As a result, most of the search effort is concentrated in unreliable areas. Island-driven search addresses this problem by first identifying reliable islands and directing the search out from these islands towards unreliable gaps. In this paper, we develop a technique to detect islands from knowledge of hypothesized broad phonetic classes (BPCs). Using this island/gap knowledge, we explore a method to prune the search space to limit computational effort in unreliable areas. In addition, we also investigate scoring less detailed BPC models in gap regions and more detailed phonetic models in islands. Experiments on both small and large scale vocabulary tasks indicate that our island-driven search strategy results in an improvement in recognition accuracy and computation time.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130078332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Discriminative training of n-gram language models for speech recognition via linear programming 基于线性规划的语音识别n-gram语言模型判别训练
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373248
Vladimir Magdin, Hui Jiang
This paper presents a novel discriminative training algorithm for n-gram language models for use in large vocabulary continuous speech recognition. The algorithm uses Maximum Mutual Information Estimation (MMIE) to build an objective function that involves a metric computed between correct transcriptions and their competing hypotheses, which are encoded as word graphs generated from the Viterbi decoding process. The nonlinear MMIE objective function is approximated by a linear one using an EM-style auxiliary function, thus converting the discriminative training of n-gram language models into a linear programing problem, which can be efficiently solved by many convex optimization tools. Experimental results on the SPINE1 speech recognition corpus have shown that the proposed discriminative training method can outperform the conventional discounting-based maximum likelihood estimation methods. A relative reduction in word error rate of close to 3% has been observed on the SPINE1 speech recognition task.
提出了一种基于n元语言模型的判别训练算法,用于大词汇量连续语音识别。该算法使用最大互信息估计(MMIE)来构建一个目标函数,该函数涉及在正确转录和它们的竞争假设之间计算的度量,这些度量被编码为由维特比解码过程生成的词图。将非线性MMIE目标函数近似为一个线性目标函数,利用em型辅助函数,将n-gram语言模型的判别训练转化为一个线性规划问题,可以通过多种凸优化工具有效地求解。在SPINE1语音识别语料库上的实验结果表明,该判别训练方法优于传统的基于折扣的最大似然估计方法。在SPINE1语音识别任务中,单词错误率相对降低了近3%。
{"title":"Discriminative training of n-gram language models for speech recognition via linear programming","authors":"Vladimir Magdin, Hui Jiang","doi":"10.1109/ASRU.2009.5373248","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373248","url":null,"abstract":"This paper presents a novel discriminative training algorithm for n-gram language models for use in large vocabulary continuous speech recognition. The algorithm uses Maximum Mutual Information Estimation (MMIE) to build an objective function that involves a metric computed between correct transcriptions and their competing hypotheses, which are encoded as word graphs generated from the Viterbi decoding process. The nonlinear MMIE objective function is approximated by a linear one using an EM-style auxiliary function, thus converting the discriminative training of n-gram language models into a linear programing problem, which can be efficiently solved by many convex optimization tools. Experimental results on the SPINE1 speech recognition corpus have shown that the proposed discriminative training method can outperform the conventional discounting-based maximum likelihood estimation methods. A relative reduction in word error rate of close to 3% has been observed on the SPINE1 speech recognition task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122881868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Self-supervised discriminative training of statistical language models 统计语言模型的自监督判别训练
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373401
Puyang Xu, D. Karakos, S. Khudanpur
A novel self-supervised discriminative training method for estimating language models for automatic speech recognition (ASR) is proposed. Unlike traditional discriminative training methods that require transcribed speech, only untranscribed speech and a large text corpus is required. An exponential form is assumed for the language model, as done in maximum entropy estimation, but the model is trained from the text using a discriminative criterion that targets word confusions actually witnessed in first-pass ASR output lattices. Specifically, model parameters are estimated to maximize the likelihood ratio between words w in the text corpus and w's cohorts in the test speech, i.e. other words that w competes with in the test lattices. Empirical results are presented to demonstrate statistically significant improvements over a 4-gram language model on a large vocabulary ASR task.
提出了一种新的用于自动语音识别(ASR)语言模型估计的自监督判别训练方法。不像传统的判别训练方法需要转录语音,只需要非转录语音和大的文本语料库。假设语言模型的指数形式,就像在最大熵估计中所做的那样,但是模型是使用一个判别标准从文本中训练的,该标准针对的是在第一次通过ASR输出格中实际看到的单词混淆。具体来说,模型参数的估计是为了最大化文本语料库中单词w与测试语音中单词w的队列之间的似然比,即w在测试格中与之竞争的其他单词。实证结果表明,在大词汇量ASR任务上,4克语言模型在统计学上有显著改善。
{"title":"Self-supervised discriminative training of statistical language models","authors":"Puyang Xu, D. Karakos, S. Khudanpur","doi":"10.1109/ASRU.2009.5373401","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373401","url":null,"abstract":"A novel self-supervised discriminative training method for estimating language models for automatic speech recognition (ASR) is proposed. Unlike traditional discriminative training methods that require transcribed speech, only untranscribed speech and a large text corpus is required. An exponential form is assumed for the language model, as done in maximum entropy estimation, but the model is trained from the text using a discriminative criterion that targets word confusions actually witnessed in first-pass ASR output lattices. Specifically, model parameters are estimated to maximize the likelihood ratio between words w in the text corpus and w's cohorts in the test speech, i.e. other words that w competes with in the test lattices. Empirical results are presented to demonstrate statistically significant improvements over a 4-gram language model on a large vocabulary ASR task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115497903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Mask estimation employing Posterior-based Representative Mean for missing-feature speech recognition with time-varying background noise 基于后验代表性均值的失特征语音识别掩码估计
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373398
Wooil Kim, J. Hansen
This paper proposes a novel mask estimation method for missing-feature reconstruction to improve speech recognition performance in time-varying background noise conditions. Conventional mask estimation methods based on noise estimates and spectral subtraction fail to reliably estimate the mask. The proposed mask estimation method utilizes a Posterior-based Representative Mean (PRM) vector for determining the reliability of the input speech spectrum, which is obtained as a weighted sum of the mean parameters of the speech model with posterior probabilities. To obtain the noise-corrupted speech model, a model combination method is employed, which was proposed in our previous study for a feature compensation method [1]. Experimental results demonstrate that the proposed mask estimation method is considerably more effective at increasing speech recognition performance in time-varying background noise conditions. By employing the proposed PRM-based mask estimation for missing-feature reconstruction, we obtain +36.29% and +30.45% average relative improvements in WER for speech babble and background music conditions respectively, compared to conventional mask estimation methods.
为了提高时变背景噪声条件下的语音识别性能,提出了一种新的缺失特征重构掩码估计方法。传统的基于噪声估计和谱减法的掩模估计方法不能可靠地估计掩模。所提出的掩码估计方法利用基于后验的代表性均值(PRM)向量来确定输入语音频谱的可靠性,该向量是语音模型中具有后验概率的平均参数的加权和。为了获得被噪声破坏的语音模型,我们采用了一种模型组合的方法,该方法是我们在之前的研究中提出的一种特征补偿方法[1]。实验结果表明,在时变背景噪声条件下,所提出的掩码估计方法能显著提高语音识别性能。通过采用本文提出的基于prm的掩码估计进行缺失特征重建,与传统的掩码估计方法相比,在呀呀学语和背景音乐条件下,我们的平均相对噪差分别提高了+36.29%和+30.45%。
{"title":"Mask estimation employing Posterior-based Representative Mean for missing-feature speech recognition with time-varying background noise","authors":"Wooil Kim, J. Hansen","doi":"10.1109/ASRU.2009.5373398","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373398","url":null,"abstract":"This paper proposes a novel mask estimation method for missing-feature reconstruction to improve speech recognition performance in time-varying background noise conditions. Conventional mask estimation methods based on noise estimates and spectral subtraction fail to reliably estimate the mask. The proposed mask estimation method utilizes a Posterior-based Representative Mean (PRM) vector for determining the reliability of the input speech spectrum, which is obtained as a weighted sum of the mean parameters of the speech model with posterior probabilities. To obtain the noise-corrupted speech model, a model combination method is employed, which was proposed in our previous study for a feature compensation method [1]. Experimental results demonstrate that the proposed mask estimation method is considerably more effective at increasing speech recognition performance in time-varying background noise conditions. By employing the proposed PRM-based mask estimation for missing-feature reconstruction, we obtain +36.29% and +30.45% average relative improvements in WER for speech babble and background music conditions respectively, compared to conventional mask estimation methods.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123704436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Local and global models for spontaneous speech segment detection and characterization 自发语音片段检测和表征的局部和全局模型
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372928
Richard Dufour, Y. Estève, P. Deléglise, Frédéric Béchet
Processing spontaneous speech is one of the many challenges that automatic speech recognition (ASR) systems have to deal with. The main evidences characterizing spontaneous speech are disfluencies (filled pause, repetition, repair and false start) and many studies have focused on the detection and the correction of these disfluencies. In this study we define spontaneous speech as unprepared speech, in opposition to prepared speech where utterances contain well-formed sentences close to those that can be found in written documents. Disfluencies are of course very good indicators of unprepared speech, however they are not the only ones: ungrammaticality and language register are also important as well as prosodic patterns. This paper proposes a set of acoustic and linguistic features that can be used for characterizing and detecting spontaneous speech segments from large audio databases. More, we introduce a strategy that takes advantage of a global classification procfalseess using a probabilistic model which significantly improves the spontaneous speech detection.
处理自发语音是自动语音识别(ASR)系统必须处理的众多挑战之一。自发语音的主要特征是不流畅(停顿、重复、修复和错误启动),许多研究都集中在这些不流畅的检测和纠正上。在本研究中,我们将自发言语定义为无准备的言语,与有准备的言语相反,在有准备的言语中,话语中包含的句子结构良好,接近于书面文件中可以找到的句子。不流利当然是没有准备好讲话的很好的指标,但它们不是唯一的指标:不语法和语言域以及韵律模式也很重要。本文提出了一套声学和语言特征,可用于从大型音频数据库中描述和检测自发语音片段。此外,我们引入了一种利用概率模型的全局分类过程的策略,该策略显着提高了自发语音检测。
{"title":"Local and global models for spontaneous speech segment detection and characterization","authors":"Richard Dufour, Y. Estève, P. Deléglise, Frédéric Béchet","doi":"10.1109/ASRU.2009.5372928","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372928","url":null,"abstract":"Processing spontaneous speech is one of the many challenges that automatic speech recognition (ASR) systems have to deal with. The main evidences characterizing spontaneous speech are disfluencies (filled pause, repetition, repair and false start) and many studies have focused on the detection and the correction of these disfluencies. In this study we define spontaneous speech as unprepared speech, in opposition to prepared speech where utterances contain well-formed sentences close to those that can be found in written documents. Disfluencies are of course very good indicators of unprepared speech, however they are not the only ones: ungrammaticality and language register are also important as well as prosodic patterns. This paper proposes a set of acoustic and linguistic features that can be used for characterizing and detecting spontaneous speech segments from large audio databases. More, we introduce a strategy that takes advantage of a global classification procfalseess using a probabilistic model which significantly improves the spontaneous speech detection.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"127 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114035570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Syntactic features for Arabic speech recognition 阿拉伯语语音识别的句法特征
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373470
H. Kuo, L. Mangu, Ahmad Emami, I. Zitouni, Young-suk Lee
We report word error rate improvements with syntactic features using a neural probabilistic language model through N-best re-scoring. The syntactic features we use include exposed head words and their non-terminal labels both before and after the predicted word. Neural network LMs generalize better to unseen events by modeling words and other context features in continuous space. They are suitable for incorporating many different types of features, including syntactic features, where there is no pre-defined back-off order. We choose an N-best re-scoring framework to be able to take full advantage of the complete parse tree of the entire sentence. Using syntactic features, along with morphological features, improves the word error rate (WER) by up to 5.5% relative, from 9.4% to 8.6%, on the latest GALE evaluation test set.
我们报告了通过N-best重新评分,使用神经概率语言模型提高句法特征的单词错误率。我们使用的语法特征包括暴露的头词及其在预测词前后的非终结标签。神经网络LMs通过对连续空间中的单词和其他上下文特征进行建模,可以更好地泛化到未见过的事件。它们适用于合并许多不同类型的功能,包括语法功能,在这些功能中没有预定义的退退顺序。我们选择了一个n最佳的重新评分框架,以便能够充分利用整个句子的完整解析树。在最新的GALE评价测试集上,使用句法特征和形态特征,将单词错误率(WER)从9.4%提高到8.6%,相对提高了5.5%。
{"title":"Syntactic features for Arabic speech recognition","authors":"H. Kuo, L. Mangu, Ahmad Emami, I. Zitouni, Young-suk Lee","doi":"10.1109/ASRU.2009.5373470","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373470","url":null,"abstract":"We report word error rate improvements with syntactic features using a neural probabilistic language model through N-best re-scoring. The syntactic features we use include exposed head words and their non-terminal labels both before and after the predicted word. Neural network LMs generalize better to unseen events by modeling words and other context features in continuous space. They are suitable for incorporating many different types of features, including syntactic features, where there is no pre-defined back-off order. We choose an N-best re-scoring framework to be able to take full advantage of the complete parse tree of the entire sentence. Using syntactic features, along with morphological features, improves the word error rate (WER) by up to 5.5% relative, from 9.4% to 8.6%, on the latest GALE evaluation test set.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115876611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Generalized cyclic transformations in speaker-independent speech recognition 非说话人语音识别中的广义循环变换
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373284
Florian Müller, Eugene Belilovsky, A. Mertins
A feature extraction method is presented that is robust against vocal tract length changes. It uses the generalized cyclic transformations primarily used within the field of pattern recognition. In matching training and testing conditions the resulting accuracies are comparable to the ones of MFCCs. However, in mismatching training and testing conditions with respect to the mean vocal tract length the presented features significantly outperform the MFCCs.
提出了一种对声道长度变化具有鲁棒性的特征提取方法。它使用了模式识别领域中主要使用的广义循环变换。在匹配训练和测试条件下,得到的准确度与mfccc相当。然而,在不匹配的训练和测试条件下,相对于平均声道长度,所呈现的特征明显优于mfcc。
{"title":"Generalized cyclic transformations in speaker-independent speech recognition","authors":"Florian Müller, Eugene Belilovsky, A. Mertins","doi":"10.1109/ASRU.2009.5373284","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373284","url":null,"abstract":"A feature extraction method is presented that is robust against vocal tract length changes. It uses the generalized cyclic transformations primarily used within the field of pattern recognition. In matching training and testing conditions the resulting accuracies are comparable to the ones of MFCCs. However, in mismatching training and testing conditions with respect to the mean vocal tract length the presented features significantly outperform the MFCCs.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"338 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123933384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
MLLR/MAP adaptation using pronunciation variation for non-native speech recognition 基于语音变化的MLLR/MAP自适应非母语语音识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373299
Y. Oh, H. Kim
In this paper, we propose an acoustic model adaptation method based on a maximum likelihood linear regression (MLLR) and a maximum a posteriori (MAP) adaptation using pronunciation variations for non-native speech recognition. To this end, we first obtain pronunciation variations using an indirect data-driven approach. Next, we generate two sets of regression classes: one composed of regression classes for all pronunciations and the other of classes for pronunciation variations. The former are referred to as overall regression classes and the latter as pronunciation variation regression classes. Next, we sequentially apply the two adaptations to non-native speech using the overall regression classes, while the acoustic models associated with the pronunciation variations are adapted using the pronunciation variation regression classes. In the final step, both sets of adapted acoustic models are merged. Thus, the resultant acoustic models can cover the characteristics of non-native speakers as well as the pronunciation variations of non-native speech. It is shown from non-native automatic speech recognition experiments for Korean spoken English continuous speech that an ASR system employing the proposed adaptation method can relatively reduce the average word error rate by 9.43% when compared to a traditional MLLR/MAP adaptation method.
在本文中,我们提出了一种基于最大似然线性回归(MLLR)和最大后验自适应(MAP)的声学模型自适应方法,用于非母语语音识别。为此,我们首先使用间接数据驱动方法获得发音变化。接下来,我们生成两组回归类:一组由所有发音的回归类组成,另一组由发音变化的回归类组成。前者称为整体回归类,后者称为发音变异回归类。接下来,我们依次使用整体回归类将这两种适应应用于非母语语音,同时使用发音变化回归类对与发音变化相关的声学模型进行适应。最后一步,合并两组自适应声学模型。因此,所得到的声学模型可以涵盖非母语说话者的特征以及非母语语音的发音变化。对韩语英语口语连续语音的非母语自动语音识别实验表明,与传统的MLLR/MAP自适应方法相比,采用本文提出的自适应方法的ASR系统平均单词错误率相对降低了9.43%。
{"title":"MLLR/MAP adaptation using pronunciation variation for non-native speech recognition","authors":"Y. Oh, H. Kim","doi":"10.1109/ASRU.2009.5373299","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373299","url":null,"abstract":"In this paper, we propose an acoustic model adaptation method based on a maximum likelihood linear regression (MLLR) and a maximum a posteriori (MAP) adaptation using pronunciation variations for non-native speech recognition. To this end, we first obtain pronunciation variations using an indirect data-driven approach. Next, we generate two sets of regression classes: one composed of regression classes for all pronunciations and the other of classes for pronunciation variations. The former are referred to as overall regression classes and the latter as pronunciation variation regression classes. Next, we sequentially apply the two adaptations to non-native speech using the overall regression classes, while the acoustic models associated with the pronunciation variations are adapted using the pronunciation variation regression classes. In the final step, both sets of adapted acoustic models are merged. Thus, the resultant acoustic models can cover the characteristics of non-native speakers as well as the pronunciation variations of non-native speech. It is shown from non-native automatic speech recognition experiments for Korean spoken English continuous speech that an ASR system employing the proposed adaptation method can relatively reduce the average word error rate by 9.43% when compared to a traditional MLLR/MAP adaptation method.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115062302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
The ESAT 2008 system for N-Best Dutch speech recognition benchmark ESAT 2008系统为N-Best荷兰语语音识别基准
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373311
Kris Demuynck, Antti Puurula, Dirk Van Compernolle, P. Wambacq
This paper describes the ESAT 2008 Broadcast News transcription system for the N-Best 2008 benchmark, developed in part for testing the recent SPRAAK Speech Recognition Toolkit. ESAT system was developed for the Southern Dutch Broadcast News subtask of N-Best using standard methods of modern speech recognition. A combination of improvements were made in commonly overlooked areas such as text normalization, pronunciation modeling, lexicon selection and morphological modeling, virtually solving the out-of-vocabulary (OOV) problem for Dutch by reducing OOV-rate to 0.06% on the N-Best development data and 0.23% on the evaluation data. Recognition experiments were run with several configurations comparing one-pass vs. two-pass decoding, high-order vs. low-order n-gram models, lexicon sizes and different types of morphological modeling. The system achieved 7.23% word error rate (WER) on the broadcast news development data and 20.3% on the much more difficult evaluation data of N-Best.
本文描述了ESAT 2008广播新闻转录系统的N-Best 2008基准,部分开发用于测试最近的SPRAAK语音识别工具包。ESAT系统是为N-Best的荷兰南部广播新闻子任务开发的,使用现代语音识别的标准方法。在文本规范化、发音建模、词汇选择和形态学建模等通常被忽视的领域进行了改进,通过将N-Best开发数据的OOV率降低到0.06%和评估数据的0.23%,无形中解决了荷兰语的词汇外(OOV)问题。识别实验在几种配置下运行,比较了一次和两次解码,高阶和低阶n-gram模型,词汇大小和不同类型的形态学建模。该系统在广播新闻发展数据上实现了7.23%的单词错误率(WER),在难度更高的N-Best评估数据上实现了20.3%的错误率。
{"title":"The ESAT 2008 system for N-Best Dutch speech recognition benchmark","authors":"Kris Demuynck, Antti Puurula, Dirk Van Compernolle, P. Wambacq","doi":"10.1109/ASRU.2009.5373311","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373311","url":null,"abstract":"This paper describes the ESAT 2008 Broadcast News transcription system for the N-Best 2008 benchmark, developed in part for testing the recent SPRAAK Speech Recognition Toolkit. ESAT system was developed for the Southern Dutch Broadcast News subtask of N-Best using standard methods of modern speech recognition. A combination of improvements were made in commonly overlooked areas such as text normalization, pronunciation modeling, lexicon selection and morphological modeling, virtually solving the out-of-vocabulary (OOV) problem for Dutch by reducing OOV-rate to 0.06% on the N-Best development data and 0.23% on the evaluation data. Recognition experiments were run with several configurations comparing one-pass vs. two-pass decoding, high-order vs. low-order n-gram models, lexicon sizes and different types of morphological modeling. The system achieved 7.23% word error rate (WER) on the broadcast news development data and 20.3% on the much more difficult evaluation data of N-Best.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122538037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
期刊
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1