首页 > 最新文献

2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Using temporal information for improving articulatory-acoustic feature classification 利用时间信息改进发音声学特征分类
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373314
Barbara Schuppler, Joost van Doremalen, O. Scharenborg, B. Cranen, L. Boves
This paper combines acoustic features with a high temporal and a high frequency resolution to reliably classify articulatory events of short duration, such as bursts in plosives. SVM classification experiments on TIMIT and SVArticulatory showed that articulatory-acoustic features (AFs) based on a combination of MFCCs derived from a long window of 25ms and a short window of 5ms that are both shifted with 2.5ms steps (Both) outperform standard MFCCs derived with a window of 25 ms and a shift of 10 ms (Baseline). Finally, comparison of the TIMIT and SVArticulatory results showed that for classifiers trained on data that allows for asynchronously changing AFs (SVArticulatory) the improvement from Baseline to Both is larger than for classifiers trained on data where AFs change simultaneously with the phone boundaries (TIMIT).
本文将声学特征与高时间和高频率分辨率相结合,以可靠地分类短持续时间的发音事件,例如炸药中的爆炸。基于TIMIT和SVArticulatory的SVM分类实验表明,基于长窗25ms和短窗5ms且均以2.5ms步长位移的MFCCs组合的articulatory-acoustic feature (AFs)优于窗口25ms和位移10ms的标准MFCCs (Baseline)。最后,对TIMIT和SVArticulatory结果的比较表明,在允许AFs异步变化的数据上训练的分类器(SVArticulatory)从基线到两者的改进大于在AFs与手机边界同时变化的数据上训练的分类器(TIMIT)。
{"title":"Using temporal information for improving articulatory-acoustic feature classification","authors":"Barbara Schuppler, Joost van Doremalen, O. Scharenborg, B. Cranen, L. Boves","doi":"10.1109/ASRU.2009.5373314","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373314","url":null,"abstract":"This paper combines acoustic features with a high temporal and a high frequency resolution to reliably classify articulatory events of short duration, such as bursts in plosives. SVM classification experiments on TIMIT and SVArticulatory showed that articulatory-acoustic features (AFs) based on a combination of MFCCs derived from a long window of 25ms and a short window of 5ms that are both shifted with 2.5ms steps (Both) outperform standard MFCCs derived with a window of 25 ms and a shift of 10 ms (Baseline). Finally, comparison of the TIMIT and SVArticulatory results showed that for classifiers trained on data that allows for asynchronously changing AFs (SVArticulatory) the improvement from Baseline to Both is larger than for classifiers trained on data where AFs change simultaneously with the phone boundaries (TIMIT).","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131033030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
An improved parallel model combination method for noisy speech recognition 一种改进的并行模型组合方法用于噪声语音识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373332
H. Veisi, H. Sameti
In this paper a novel method, called PC-PMC, is proposed to improve the performance of automatic speech recognition systems in noisy environments. This method is based on the parallel model combination (PMC) technique and uses the Cepstral Mean Subtraction (CMS) normalization ability and Principal Component Analysis (PCA) compression and de-correlation capabilities. It takes the advantages of both additive noise compensation of PMC and convolutive noise removal ability of CMS and PCA. The first problem to be solved in the realizing of PC-PMC is that PMC algorithm requires invertible modules in the front-end of the system while CMS normalization is not an invertible process. Also, it is required to design a framework for adaptation of the PCA transform in the presence of noise. The method proposed in this paper provides solutions to the both problems. Our evaluations are done on four different real noisy tasks using Nevisa Persian continuous speech recognition system. Experimental results demonstrate significant reduction in word error rate using PC-PMC in comparison with the standard robustness methods.
为了提高语音识别系统在噪声环境下的性能,本文提出了一种新的方法——PC-PMC。该方法基于并行模型组合(PMC)技术,利用倒谱均值减法(CMS)的归一化能力和主成分分析(PCA)的压缩和去相关能力。它既具有PMC的加性噪声补偿能力,又具有CMS和PCA的卷积去噪能力。PC-PMC的实现首先要解决的问题是PMC算法需要系统前端的可逆模块,而CMS归一化并不是一个可逆过程。此外,还需要设计一个框架来适应存在噪声的PCA变换。本文提出的方法解决了这两个问题。我们使用Nevisa波斯语连续语音识别系统对四种不同的真实噪声任务进行了评估。实验结果表明,与标准鲁棒性方法相比,PC-PMC方法显著降低了单词错误率。
{"title":"An improved parallel model combination method for noisy speech recognition","authors":"H. Veisi, H. Sameti","doi":"10.1109/ASRU.2009.5373332","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373332","url":null,"abstract":"In this paper a novel method, called PC-PMC, is proposed to improve the performance of automatic speech recognition systems in noisy environments. This method is based on the parallel model combination (PMC) technique and uses the Cepstral Mean Subtraction (CMS) normalization ability and Principal Component Analysis (PCA) compression and de-correlation capabilities. It takes the advantages of both additive noise compensation of PMC and convolutive noise removal ability of CMS and PCA. The first problem to be solved in the realizing of PC-PMC is that PMC algorithm requires invertible modules in the front-end of the system while CMS normalization is not an invertible process. Also, it is required to design a framework for adaptation of the PCA transform in the presence of noise. The method proposed in this paper provides solutions to the both problems. Our evaluations are done on four different real noisy tasks using Nevisa Persian continuous speech recognition system. Experimental results demonstrate significant reduction in word error rate using PC-PMC in comparison with the standard robustness methods.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130933593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Investigations on features for log-linear acoustic models in continuous speech recognition 连续语音识别中对数线性声学模型的特征研究
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373362
Simon Wiesler, M. Nußbaum-Thom, G. Heigold, R. Schlüter, H. Ney
Hidden Markov Models with Gaussian Mixture Models as emission probabilities (GHMMs) are the underlying structure of all state-of-the-art speech recognition systems. Using Gaussian mixture distributions follows the generative approach where the class-conditional probability is modeled, although for classification only the posterior probability is needed. Though being very successful in related tasks like Natural Language Processing (NLP), in speech recognition direct modeling of posterior probabilities with log-linear models has rarely been used and has not been applied successfully to continuous speech recognition. In this paper we report competitive results for a speech recognizer with a log-linear acoustic model on the Wall Street Journal corpus, a Large Vocabulary Continuous Speech Recognition (LVCSR) task. We trained this model from scratch, i.e. without relying on an existing GHMM system. Previously the use of data dependent sparse features for log-linear models has been proposed. We compare them with polynomial features and show that the combination of polynomial and data dependent sparse features leads to better results.
以高斯混合模型为发射概率(ghmm)的隐马尔可夫模型是所有先进语音识别系统的基础结构。使用高斯混合分布遵循生成方法,其中对类条件概率进行建模,尽管对于分类只需要后验概率。虽然在自然语言处理(NLP)等相关任务中非常成功,但在语音识别中,使用对数线性模型直接对后验概率进行建模的方法很少,也没有成功地应用于连续语音识别。在本文中,我们报告了一个具有对数线性声学模型的语音识别器在华尔街日报语料库上的竞争结果,这是一个大词汇量连续语音识别(LVCSR)任务。我们从头开始训练这个模型,即不依赖于现有的GHMM系统。以前已经提出在对数线性模型中使用数据相关的稀疏特征。我们将它们与多项式特征进行了比较,并表明多项式特征与数据相关的稀疏特征相结合可以获得更好的结果。
{"title":"Investigations on features for log-linear acoustic models in continuous speech recognition","authors":"Simon Wiesler, M. Nußbaum-Thom, G. Heigold, R. Schlüter, H. Ney","doi":"10.1109/ASRU.2009.5373362","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373362","url":null,"abstract":"Hidden Markov Models with Gaussian Mixture Models as emission probabilities (GHMMs) are the underlying structure of all state-of-the-art speech recognition systems. Using Gaussian mixture distributions follows the generative approach where the class-conditional probability is modeled, although for classification only the posterior probability is needed. Though being very successful in related tasks like Natural Language Processing (NLP), in speech recognition direct modeling of posterior probabilities with log-linear models has rarely been used and has not been applied successfully to continuous speech recognition. In this paper we report competitive results for a speech recognizer with a log-linear acoustic model on the Wall Street Journal corpus, a Large Vocabulary Continuous Speech Recognition (LVCSR) task. We trained this model from scratch, i.e. without relying on an existing GHMM system. Previously the use of data dependent sparse features for log-linear models has been proposed. We compare them with polynomial features and show that the combination of polynomial and data dependent sparse features leads to better results.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125881899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Kernel metric learning for phonetic classification 语音分类的核度量学习
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373389
J. Huang, Xi Zhou, M. Hasegawa-Johnson, Thomas S. Huang
While a sound spoken is described by a handful of frame-level spectral vectors, not all frames have equal contribution for either human perception or machine classification. In this paper, we introduce a novel framework to automatically emphasize important speech frames relevant to phonetic information. We jointly learn the importance of speech frames by a distance metric across the phone classes, attempting to satisfy a large margin constraint: the distance from a segment to its correct label class should be less than the distance to any other phone class by the largest possible margin. Furthermore, an universal background model structure is proposed to give the correspondence between statistical models of phone types and tokens, allowing us to use statistical models of each phone token in a large margin speech recognition framework. Experiments on TIMIT database demonstrated the effectiveness of our framework.
虽然说话的声音是由少数帧级光谱向量描述的,但并非所有帧对人类感知或机器分类都有相同的贡献。在本文中,我们引入了一种新的框架来自动强调与语音信息相关的重要语音框架。我们通过跨电话类的距离度量来共同学习语音帧的重要性,试图满足一个大的边界约束:从一个片段到其正确标签类的距离应该小于到任何其他电话类的距离。此外,提出了一种通用的背景模型结构,给出了电话类型和令牌统计模型之间的对应关系,使我们能够在大余量语音识别框架中使用每个电话令牌的统计模型。在TIMIT数据库上的实验验证了该框架的有效性。
{"title":"Kernel metric learning for phonetic classification","authors":"J. Huang, Xi Zhou, M. Hasegawa-Johnson, Thomas S. Huang","doi":"10.1109/ASRU.2009.5373389","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373389","url":null,"abstract":"While a sound spoken is described by a handful of frame-level spectral vectors, not all frames have equal contribution for either human perception or machine classification. In this paper, we introduce a novel framework to automatically emphasize important speech frames relevant to phonetic information. We jointly learn the importance of speech frames by a distance metric across the phone classes, attempting to satisfy a large margin constraint: the distance from a segment to its correct label class should be less than the distance to any other phone class by the largest possible margin. Furthermore, an universal background model structure is proposed to give the correspondence between statistical models of phone types and tokens, allowing us to use statistical models of each phone token in a large margin speech recognition framework. Experiments on TIMIT database demonstrated the effectiveness of our framework.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126272397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic punctuation generation for speech 语音自动标点生成
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373365
Wenzhu Shen, Roger Peng Yu, F. Seide, Ji Wu
Automatic generation of punctuation is an essential feature for many speech-to-text transcription tasks. This paper describes a Maximum A-Posteriori (MAP) approach for inserting punctuation marks into raw word sequences obtained from Automatic Speech Recognition (ASR). The system consists of an “acoustic model” (AM) for prosodic features (actually pause duration) and a “language model” (LM) for text-only features. The LM combines three components: an MLP-based trigger-word model and a forward and a backward trigram punctuation predictor. The separation into acoustic and language model allows to learn these models on different corpora, especially allowing the LM to be trained on large amounts of data (text) for which no acoustic information is available. We find that the trigger-word LM is very useful, and further improvement can be achieved when combining both prosodic and lexical information. We achieve an F-measure of 81.0% and 56.5% for voicemails and podcasts, respectively, on reference transcripts, and 69.6% for voicemails on ASR transcripts.
自动生成标点符号是许多语音到文本转录任务的基本功能。本文描述了一种将标点符号插入到自动语音识别(ASR)获得的原始单词序列中的最大后验(MAP)方法。该系统由韵律特征(实际上是暂停时间)的“声学模型”(AM)和纯文本特征的“语言模型”(LM)组成。LM结合了三个组件:一个基于mlp的触发词模型和一个前向和后向三重标点预测器。声学和语言模型的分离允许在不同的语料库上学习这些模型,特别是允许LM在没有声学信息的大量数据(文本)上进行训练。我们发现触发词LM非常有用,并且当韵律和词汇信息结合在一起时可以进一步改进。语音邮件和播客在参考文本上的f值分别为81.0%和56.5%,语音邮件在ASR文本上的f值分别为69.6%。
{"title":"Automatic punctuation generation for speech","authors":"Wenzhu Shen, Roger Peng Yu, F. Seide, Ji Wu","doi":"10.1109/ASRU.2009.5373365","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373365","url":null,"abstract":"Automatic generation of punctuation is an essential feature for many speech-to-text transcription tasks. This paper describes a Maximum A-Posteriori (MAP) approach for inserting punctuation marks into raw word sequences obtained from Automatic Speech Recognition (ASR). The system consists of an “acoustic model” (AM) for prosodic features (actually pause duration) and a “language model” (LM) for text-only features. The LM combines three components: an MLP-based trigger-word model and a forward and a backward trigram punctuation predictor. The separation into acoustic and language model allows to learn these models on different corpora, especially allowing the LM to be trained on large amounts of data (text) for which no acoustic information is available. We find that the trigger-word LM is very useful, and further improvement can be achieved when combining both prosodic and lexical information. We achieve an F-measure of 81.0% and 56.5% for voicemails and podcasts, respectively, on reference transcripts, and 69.6% for voicemails on ASR transcripts.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114277104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Constrained discriminative training of N-gram language models N-gram语言模型的约束判别训练
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373338
A. Rastrow, A. Sethy, B. Ramabhadran
In this paper, we present a novel version of discriminative training for N-gram language models. Language models impose language specific constraints on the acoustic hypothesis and are crucial in discriminating between competing acoustic hypotheses. As reported in the literature, discriminative training of acoustic models has yielded significant improvements in the performance of a speech recognition system, however, discriminative training for N-gram language models (LMs) has not yielded the same impact. In this paper, we present three techniques to improve the discriminative training of LMs, namely updating the back-off probability of unseen events, normalization of the N-gram updates to ensure a probability distribution and a relative-entropy based global constraint on the N-gram probability updates. We also present a framework for discriminative adaptation of LMs to a new domain and compare it to existing linear interpolation methods. Results are reported on the Broadcast News and the MIT lecture corpora. A modest improvement of 0.2% absolute (on Broadcast News) and 0.3% absolute (on MIT lectures) was observed with discriminatively trained LMs over state-of-the-art systems.
在本文中,我们提出了一种新的N-gram语言模型的判别训练方法。语言模型对声学假设施加了特定的语言约束,对于区分相互竞争的声学假设至关重要。据文献报道,声学模型的判别性训练已经显著提高了语音识别系统的性能,然而,N-gram语言模型(LMs)的判别性训练并没有产生同样的影响。在本文中,我们提出了三种改进LMs判别训练的技术,即更新未见事件的回退概率,对N-gram更新进行归一化以确保概率分布,以及基于相对熵的N-gram概率更新的全局约束。我们还提出了一种LMs对新域的判别适应框架,并将其与现有的线性插值方法进行了比较。结果报告在广播新闻和麻省理工学院的讲座语料库。在最先进的系统上,通过判别训练的LMs可以观察到0.2%的绝对改进(在广播新闻上)和0.3%的绝对改进(在麻省理工学院的讲座上)。
{"title":"Constrained discriminative training of N-gram language models","authors":"A. Rastrow, A. Sethy, B. Ramabhadran","doi":"10.1109/ASRU.2009.5373338","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373338","url":null,"abstract":"In this paper, we present a novel version of discriminative training for N-gram language models. Language models impose language specific constraints on the acoustic hypothesis and are crucial in discriminating between competing acoustic hypotheses. As reported in the literature, discriminative training of acoustic models has yielded significant improvements in the performance of a speech recognition system, however, discriminative training for N-gram language models (LMs) has not yielded the same impact. In this paper, we present three techniques to improve the discriminative training of LMs, namely updating the back-off probability of unseen events, normalization of the N-gram updates to ensure a probability distribution and a relative-entropy based global constraint on the N-gram probability updates. We also present a framework for discriminative adaptation of LMs to a new domain and compare it to existing linear interpolation methods. Results are reported on the Broadcast News and the MIT lecture corpora. A modest improvement of 0.2% absolute (on Broadcast News) and 0.3% absolute (on MIT lectures) was observed with discriminatively trained LMs over state-of-the-art systems.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"416 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122461206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Robust vocabulary independent keyword spotting with graphical models 具有图形模型的鲁棒词汇独立关键字发现
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373544
M. Wöllmer, F. Eyben, Björn Schuller, G. Rigoll
This paper introduces a novel graphical model architecture for robust and vocabulary independent keyword spotting which does not require the training of an explicit garbage model. We show how a graphical model structure for phoneme recognition can be extended to a keyword spotter that is robust with respect to phoneme recognition errors. We use a hidden garbage variable together with the concept of switching parents to model keywords as well as arbitrary speech. This implies that keywords can be added to the vocabulary without having to re-train the model. Thereby the design of our model architecture is optimised to reliably detect keywords rather than to decode keyword phoneme sequences as arbitrary speech, while offering a parameter to adjust the operating point on the receiver operating characteristics curve. Experiments on the TIMIT corpus reveal that our graphical model outperforms a comparable hidden Markov model based keyword spotter that uses conventional garbage modelling.
本文介绍了一种新的图形模型体系结构,用于鲁棒和词汇无关的关键字识别,不需要训练显式垃圾模型。我们展示了如何将音素识别的图形模型结构扩展到一个对音素识别错误具有鲁棒性的关键字识别器。我们使用了一个隐藏的垃圾变量以及将父变量转换为模型关键字和任意语音的概念。这意味着可以将关键字添加到词汇表中,而不必重新训练模型。从而优化了我们的模型架构设计,以可靠地检测关键字,而不是将关键字音素序列解码为任意语音,同时提供了一个参数来调整接收机工作特性曲线上的工作点。在TIMIT语料库上的实验表明,我们的图形模型优于使用传统垃圾建模的基于隐马尔可夫模型的关键字识别器。
{"title":"Robust vocabulary independent keyword spotting with graphical models","authors":"M. Wöllmer, F. Eyben, Björn Schuller, G. Rigoll","doi":"10.1109/ASRU.2009.5373544","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373544","url":null,"abstract":"This paper introduces a novel graphical model architecture for robust and vocabulary independent keyword spotting which does not require the training of an explicit garbage model. We show how a graphical model structure for phoneme recognition can be extended to a keyword spotter that is robust with respect to phoneme recognition errors. We use a hidden garbage variable together with the concept of switching parents to model keywords as well as arbitrary speech. This implies that keywords can be added to the vocabulary without having to re-train the model. Thereby the design of our model architecture is optimised to reliably detect keywords rather than to decode keyword phoneme sequences as arbitrary speech, while offering a parameter to adjust the operating point on the receiver operating characteristics curve. Experiments on the TIMIT corpus reveal that our graphical model outperforms a comparable hidden Markov model based keyword spotter that uses conventional garbage modelling.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125197866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Automatic selection of recognition errors by respeaking the intended text 通过说出预期的文本自动选择识别错误
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373347
K. Vertanen, P. Kristensson
We investigate how to automatically align spoken corrections with an initial speech recognition result. Such automatic alignment would enable one-step voice-only correction in which users simply respeak their intended text. We present three new models for automatically aligning corrections: a 1-best model, a word confusion network model, and a revision model. The revision model allows users to alter what they intended to write even when the initial recognition was completely correct. We evaluate our models with data gathered from two user studies. We show that providing just a single correct word of context dramatically improves alignment success from 65% to 84%. We find that a majority of users provide such context without being explicitly instructed to do so. We find that the revision model is superior when users modify words in their initial recognition, improving alignment success from 73% to 83%. We show how our models can easily incorporate prior information about correction location and we show that such information aids alignment success. Last, we observe that users speak their intended text faster and with fewer re-recordings than if they are forced to speak misrecognized text.
我们研究了如何将语音更正与初始语音识别结果自动对齐。这种自动校准将实现一步语音校正,用户只需说出他们想要的文本。我们提出了三种自动校准更正的新模型:1-best模型,单词混淆网络模型和修订模型。修改模型允许用户修改他们想要写的内容,即使最初的识别是完全正确的。我们用从两个用户研究中收集的数据来评估我们的模型。我们发现,仅仅提供一个正确的上下文单词就能显著地将对齐成功率从65%提高到84%。我们发现大多数用户在没有得到明确指示的情况下提供了这样的上下文。我们发现,当用户在初始识别中修改单词时,修正模型是优越的,将对齐成功率从73%提高到83%。我们展示了我们的模型如何容易地结合关于校正位置的先验信息,我们展示了这些信息有助于校准成功。最后,我们观察到,与被迫说出错误识别的文本相比,用户说出预期文本的速度更快,重复录音的次数更少。
{"title":"Automatic selection of recognition errors by respeaking the intended text","authors":"K. Vertanen, P. Kristensson","doi":"10.1109/ASRU.2009.5373347","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373347","url":null,"abstract":"We investigate how to automatically align spoken corrections with an initial speech recognition result. Such automatic alignment would enable one-step voice-only correction in which users simply respeak their intended text. We present three new models for automatically aligning corrections: a 1-best model, a word confusion network model, and a revision model. The revision model allows users to alter what they intended to write even when the initial recognition was completely correct. We evaluate our models with data gathered from two user studies. We show that providing just a single correct word of context dramatically improves alignment success from 65% to 84%. We find that a majority of users provide such context without being explicitly instructed to do so. We find that the revision model is superior when users modify words in their initial recognition, improving alignment success from 73% to 83%. We show how our models can easily incorporate prior information about correction location and we show that such information aids alignment success. Last, we observe that users speak their intended text faster and with fewer re-recordings than if they are forced to speak misrecognized text.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129630941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Discriminative adaptive training with VTS and JUD 基于VTS和JUD的判别适应性训练
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373266
F. Flego, M. Gales
Adaptive training is a powerful approach for building speech recognition systems on non-homogeneous training data. Recently approaches based on predictive model-based compensation schemes, such as Joint Uncertainty Decoding (JUD) and Vector Taylor Series (VTS), have been proposed. This paper reviews these model-based compensation schemes and relates them to factor-analysis style systems. Forms of Maximum Likelihood (ML) adaptive training with these approaches are described, based on both second-order optimisation schemes and Expectation Maximisation (EM). However, discriminative training is used in many state-of-the-art speech recognition. Hence, this paper proposes discriminative adaptive training with predictive model-compensation approaches for noise robust speech recognition. This training approach is applied to both JUD and VTS compensation with minimum phone error training. A large scale multi-environment training configuration is used and the systems evaluated on a range of in-car collected data tasks.
自适应训练是在非同构训练数据上构建语音识别系统的一种有效方法。近年来,人们提出了基于预测模型的补偿方案,如联合不确定性解码(JUD)和矢量泰勒级数(VTS)。本文综述了这些基于模型的薪酬方案,并将它们与因子分析风格的系统联系起来。基于二阶优化方案和期望最大化(EM),描述了使用这些方法的最大似然(ML)自适应训练的形式。然而,判别训练被用于许多最先进的语音识别。因此,本文提出了基于预测模型补偿的判别自适应训练方法用于噪声鲁棒语音识别。该训练方法同时应用于JUD和VTS补偿,并实现了最小的电话误差训练。使用了大规模的多环境训练配置,并对一系列车内收集的数据任务进行了系统评估。
{"title":"Discriminative adaptive training with VTS and JUD","authors":"F. Flego, M. Gales","doi":"10.1109/ASRU.2009.5373266","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373266","url":null,"abstract":"Adaptive training is a powerful approach for building speech recognition systems on non-homogeneous training data. Recently approaches based on predictive model-based compensation schemes, such as Joint Uncertainty Decoding (JUD) and Vector Taylor Series (VTS), have been proposed. This paper reviews these model-based compensation schemes and relates them to factor-analysis style systems. Forms of Maximum Likelihood (ML) adaptive training with these approaches are described, based on both second-order optimisation schemes and Expectation Maximisation (EM). However, discriminative training is used in many state-of-the-art speech recognition. Hence, this paper proposes discriminative adaptive training with predictive model-compensation approaches for noise robust speech recognition. This training approach is applied to both JUD and VTS compensation with minimum phone error training. A large scale multi-environment training configuration is used and the systems evaluated on a range of in-car collected data tasks.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117247743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Transition features for CRF-based speech recognition and boundary detection 基于crf的语音识别和边界检测的过渡特征
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373287
Spiros Dimopoulos, E. Fosler-Lussier, Chin-Hui Lee, A. Potamianos
In this paper, we investigate a variety of spectral and time domain features for explicitly modeling phonetic transitions in speech recognition. Specifically, spectral and energy distance metrics, as well as, time derivatives of phonological descriptors and MFCCs are employed. The features are integrated in an extended Conditional Random Fields statistical modeling framework that supports general-purpose transition models. For evaluation purposes, we measure both phonetic recognition task accuracy and precision/recall of boundary detection. Results show that when transition features are used in a CRF-based recognition framework, recognition performance improves significantly due to the reduction of phone deletions. The boundary detection performance also improves mainly for transitions among silence, stop, and fricative phonetic classes.
在本文中,我们研究了各种频谱和时域特征,以显式地建模语音识别中的语音转换。具体来说,使用了频谱和能量距离度量,以及语音描述符和mfcc的时间导数。这些特性集成在一个扩展的条件随机场统计建模框架中,该框架支持通用转换模型。为了评估目的,我们测量了语音识别任务的准确率和边界检测的准确率/召回率。结果表明,在基于crf的识别框架中使用过渡特征后,由于减少了手机删除,识别性能得到了显著提高。边界检测性能的提高主要体现在无声、顿音和摩擦音之间的转换。
{"title":"Transition features for CRF-based speech recognition and boundary detection","authors":"Spiros Dimopoulos, E. Fosler-Lussier, Chin-Hui Lee, A. Potamianos","doi":"10.1109/ASRU.2009.5373287","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373287","url":null,"abstract":"In this paper, we investigate a variety of spectral and time domain features for explicitly modeling phonetic transitions in speech recognition. Specifically, spectral and energy distance metrics, as well as, time derivatives of phonological descriptors and MFCCs are employed. The features are integrated in an extended Conditional Random Fields statistical modeling framework that supports general-purpose transition models. For evaluation purposes, we measure both phonetic recognition task accuracy and precision/recall of boundary detection. Results show that when transition features are used in a CRF-based recognition framework, recognition performance improves significantly due to the reduction of phone deletions. The boundary detection performance also improves mainly for transitions among silence, stop, and fricative phonetic classes.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123563996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1