首页 > 最新文献

2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Using temporal information for improving articulatory-acoustic feature classification 利用时间信息改进发音声学特征分类
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373314
Barbara Schuppler, Joost van Doremalen, O. Scharenborg, B. Cranen, L. Boves
This paper combines acoustic features with a high temporal and a high frequency resolution to reliably classify articulatory events of short duration, such as bursts in plosives. SVM classification experiments on TIMIT and SVArticulatory showed that articulatory-acoustic features (AFs) based on a combination of MFCCs derived from a long window of 25ms and a short window of 5ms that are both shifted with 2.5ms steps (Both) outperform standard MFCCs derived with a window of 25 ms and a shift of 10 ms (Baseline). Finally, comparison of the TIMIT and SVArticulatory results showed that for classifiers trained on data that allows for asynchronously changing AFs (SVArticulatory) the improvement from Baseline to Both is larger than for classifiers trained on data where AFs change simultaneously with the phone boundaries (TIMIT).
本文将声学特征与高时间和高频率分辨率相结合,以可靠地分类短持续时间的发音事件,例如炸药中的爆炸。基于TIMIT和SVArticulatory的SVM分类实验表明,基于长窗25ms和短窗5ms且均以2.5ms步长位移的MFCCs组合的articulatory-acoustic feature (AFs)优于窗口25ms和位移10ms的标准MFCCs (Baseline)。最后,对TIMIT和SVArticulatory结果的比较表明,在允许AFs异步变化的数据上训练的分类器(SVArticulatory)从基线到两者的改进大于在AFs与手机边界同时变化的数据上训练的分类器(TIMIT)。
{"title":"Using temporal information for improving articulatory-acoustic feature classification","authors":"Barbara Schuppler, Joost van Doremalen, O. Scharenborg, B. Cranen, L. Boves","doi":"10.1109/ASRU.2009.5373314","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373314","url":null,"abstract":"This paper combines acoustic features with a high temporal and a high frequency resolution to reliably classify articulatory events of short duration, such as bursts in plosives. SVM classification experiments on TIMIT and SVArticulatory showed that articulatory-acoustic features (AFs) based on a combination of MFCCs derived from a long window of 25ms and a short window of 5ms that are both shifted with 2.5ms steps (Both) outperform standard MFCCs derived with a window of 25 ms and a shift of 10 ms (Baseline). Finally, comparison of the TIMIT and SVArticulatory results showed that for classifiers trained on data that allows for asynchronously changing AFs (SVArticulatory) the improvement from Baseline to Both is larger than for classifiers trained on data where AFs change simultaneously with the phone boundaries (TIMIT).","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131033030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Investigations on features for log-linear acoustic models in continuous speech recognition 连续语音识别中对数线性声学模型的特征研究
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373362
Simon Wiesler, M. Nußbaum-Thom, G. Heigold, R. Schlüter, H. Ney
Hidden Markov Models with Gaussian Mixture Models as emission probabilities (GHMMs) are the underlying structure of all state-of-the-art speech recognition systems. Using Gaussian mixture distributions follows the generative approach where the class-conditional probability is modeled, although for classification only the posterior probability is needed. Though being very successful in related tasks like Natural Language Processing (NLP), in speech recognition direct modeling of posterior probabilities with log-linear models has rarely been used and has not been applied successfully to continuous speech recognition. In this paper we report competitive results for a speech recognizer with a log-linear acoustic model on the Wall Street Journal corpus, a Large Vocabulary Continuous Speech Recognition (LVCSR) task. We trained this model from scratch, i.e. without relying on an existing GHMM system. Previously the use of data dependent sparse features for log-linear models has been proposed. We compare them with polynomial features and show that the combination of polynomial and data dependent sparse features leads to better results.
以高斯混合模型为发射概率(ghmm)的隐马尔可夫模型是所有先进语音识别系统的基础结构。使用高斯混合分布遵循生成方法,其中对类条件概率进行建模,尽管对于分类只需要后验概率。虽然在自然语言处理(NLP)等相关任务中非常成功,但在语音识别中,使用对数线性模型直接对后验概率进行建模的方法很少,也没有成功地应用于连续语音识别。在本文中,我们报告了一个具有对数线性声学模型的语音识别器在华尔街日报语料库上的竞争结果,这是一个大词汇量连续语音识别(LVCSR)任务。我们从头开始训练这个模型,即不依赖于现有的GHMM系统。以前已经提出在对数线性模型中使用数据相关的稀疏特征。我们将它们与多项式特征进行了比较,并表明多项式特征与数据相关的稀疏特征相结合可以获得更好的结果。
{"title":"Investigations on features for log-linear acoustic models in continuous speech recognition","authors":"Simon Wiesler, M. Nußbaum-Thom, G. Heigold, R. Schlüter, H. Ney","doi":"10.1109/ASRU.2009.5373362","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373362","url":null,"abstract":"Hidden Markov Models with Gaussian Mixture Models as emission probabilities (GHMMs) are the underlying structure of all state-of-the-art speech recognition systems. Using Gaussian mixture distributions follows the generative approach where the class-conditional probability is modeled, although for classification only the posterior probability is needed. Though being very successful in related tasks like Natural Language Processing (NLP), in speech recognition direct modeling of posterior probabilities with log-linear models has rarely been used and has not been applied successfully to continuous speech recognition. In this paper we report competitive results for a speech recognizer with a log-linear acoustic model on the Wall Street Journal corpus, a Large Vocabulary Continuous Speech Recognition (LVCSR) task. We trained this model from scratch, i.e. without relying on an existing GHMM system. Previously the use of data dependent sparse features for log-linear models has been proposed. We compare them with polynomial features and show that the combination of polynomial and data dependent sparse features leads to better results.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125881899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Multi-view learning of acoustic features for speaker recognition 说话人识别声学特征的多视角学习
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373462
Karen Livescu, Mark Stoehr
We consider learning acoustic feature transformations using an additional view of the data, in this case video of the speaker's face. Specifically, we consider a scenario in which clean audio and video is available at training time, while at test time only noisy audio is available. We use canonical correlation analysis (CCA) to learn linear projections of the acoustic observations that have maximum correlation with the video frames. We provide an initial demonstration of the approach on a speaker recognition task using data from the VidTIMIT corpus. The projected features, in combination with baseline MFCCs, outperform the baseline recognizer in noisy conditions. The techniques we present are quite general, although here we apply them to the case of a specific speaker recognition task. This is the first work of which we are aware in which multiple views are used to learn an acoustic feature projection at training time, while using only the acoustics at test time.
我们考虑使用额外的数据视图来学习声学特征转换,在这种情况下是说话人的面部视频。具体来说,我们考虑这样一种场景:在训练时可以使用干净的音频和视频,而在测试时只能使用嘈杂的音频。我们使用典型相关分析(CCA)来学习与视频帧具有最大相关性的声学观测的线性投影。我们使用VidTIMIT语料库中的数据对说话人识别任务的方法进行了初步演示。在噪声条件下,与基线mfc相结合的投影特征优于基线识别器。我们介绍的技术是相当通用的,尽管在这里我们将它们应用于特定的说话人识别任务。这是我们所知道的第一个在训练时使用多个视图来学习声学特征投影,而在测试时只使用声学的工作。
{"title":"Multi-view learning of acoustic features for speaker recognition","authors":"Karen Livescu, Mark Stoehr","doi":"10.1109/ASRU.2009.5373462","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373462","url":null,"abstract":"We consider learning acoustic feature transformations using an additional view of the data, in this case video of the speaker's face. Specifically, we consider a scenario in which clean audio and video is available at training time, while at test time only noisy audio is available. We use canonical correlation analysis (CCA) to learn linear projections of the acoustic observations that have maximum correlation with the video frames. We provide an initial demonstration of the approach on a speaker recognition task using data from the VidTIMIT corpus. The projected features, in combination with baseline MFCCs, outperform the baseline recognizer in noisy conditions. The techniques we present are quite general, although here we apply them to the case of a specific speaker recognition task. This is the first work of which we are aware in which multiple views are used to learn an acoustic feature projection at training time, while using only the acoustics at test time.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124613608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Kernel metric learning for phonetic classification 语音分类的核度量学习
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373389
J. Huang, Xi Zhou, M. Hasegawa-Johnson, Thomas S. Huang
While a sound spoken is described by a handful of frame-level spectral vectors, not all frames have equal contribution for either human perception or machine classification. In this paper, we introduce a novel framework to automatically emphasize important speech frames relevant to phonetic information. We jointly learn the importance of speech frames by a distance metric across the phone classes, attempting to satisfy a large margin constraint: the distance from a segment to its correct label class should be less than the distance to any other phone class by the largest possible margin. Furthermore, an universal background model structure is proposed to give the correspondence between statistical models of phone types and tokens, allowing us to use statistical models of each phone token in a large margin speech recognition framework. Experiments on TIMIT database demonstrated the effectiveness of our framework.
虽然说话的声音是由少数帧级光谱向量描述的,但并非所有帧对人类感知或机器分类都有相同的贡献。在本文中,我们引入了一种新的框架来自动强调与语音信息相关的重要语音框架。我们通过跨电话类的距离度量来共同学习语音帧的重要性,试图满足一个大的边界约束:从一个片段到其正确标签类的距离应该小于到任何其他电话类的距离。此外,提出了一种通用的背景模型结构,给出了电话类型和令牌统计模型之间的对应关系,使我们能够在大余量语音识别框架中使用每个电话令牌的统计模型。在TIMIT数据库上的实验验证了该框架的有效性。
{"title":"Kernel metric learning for phonetic classification","authors":"J. Huang, Xi Zhou, M. Hasegawa-Johnson, Thomas S. Huang","doi":"10.1109/ASRU.2009.5373389","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373389","url":null,"abstract":"While a sound spoken is described by a handful of frame-level spectral vectors, not all frames have equal contribution for either human perception or machine classification. In this paper, we introduce a novel framework to automatically emphasize important speech frames relevant to phonetic information. We jointly learn the importance of speech frames by a distance metric across the phone classes, attempting to satisfy a large margin constraint: the distance from a segment to its correct label class should be less than the distance to any other phone class by the largest possible margin. Furthermore, an universal background model structure is proposed to give the correspondence between statistical models of phone types and tokens, allowing us to use statistical models of each phone token in a large margin speech recognition framework. Experiments on TIMIT database demonstrated the effectiveness of our framework.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126272397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Transition features for CRF-based speech recognition and boundary detection 基于crf的语音识别和边界检测的过渡特征
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373287
Spiros Dimopoulos, E. Fosler-Lussier, Chin-Hui Lee, A. Potamianos
In this paper, we investigate a variety of spectral and time domain features for explicitly modeling phonetic transitions in speech recognition. Specifically, spectral and energy distance metrics, as well as, time derivatives of phonological descriptors and MFCCs are employed. The features are integrated in an extended Conditional Random Fields statistical modeling framework that supports general-purpose transition models. For evaluation purposes, we measure both phonetic recognition task accuracy and precision/recall of boundary detection. Results show that when transition features are used in a CRF-based recognition framework, recognition performance improves significantly due to the reduction of phone deletions. The boundary detection performance also improves mainly for transitions among silence, stop, and fricative phonetic classes.
在本文中,我们研究了各种频谱和时域特征,以显式地建模语音识别中的语音转换。具体来说,使用了频谱和能量距离度量,以及语音描述符和mfcc的时间导数。这些特性集成在一个扩展的条件随机场统计建模框架中,该框架支持通用转换模型。为了评估目的,我们测量了语音识别任务的准确率和边界检测的准确率/召回率。结果表明,在基于crf的识别框架中使用过渡特征后,由于减少了手机删除,识别性能得到了显著提高。边界检测性能的提高主要体现在无声、顿音和摩擦音之间的转换。
{"title":"Transition features for CRF-based speech recognition and boundary detection","authors":"Spiros Dimopoulos, E. Fosler-Lussier, Chin-Hui Lee, A. Potamianos","doi":"10.1109/ASRU.2009.5373287","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373287","url":null,"abstract":"In this paper, we investigate a variety of spectral and time domain features for explicitly modeling phonetic transitions in speech recognition. Specifically, spectral and energy distance metrics, as well as, time derivatives of phonological descriptors and MFCCs are employed. The features are integrated in an extended Conditional Random Fields statistical modeling framework that supports general-purpose transition models. For evaluation purposes, we measure both phonetic recognition task accuracy and precision/recall of boundary detection. Results show that when transition features are used in a CRF-based recognition framework, recognition performance improves significantly due to the reduction of phone deletions. The boundary detection performance also improves mainly for transitions among silence, stop, and fricative phonetic classes.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123563996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Robust vocabulary independent keyword spotting with graphical models 具有图形模型的鲁棒词汇独立关键字发现
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373544
M. Wöllmer, F. Eyben, Björn Schuller, G. Rigoll
This paper introduces a novel graphical model architecture for robust and vocabulary independent keyword spotting which does not require the training of an explicit garbage model. We show how a graphical model structure for phoneme recognition can be extended to a keyword spotter that is robust with respect to phoneme recognition errors. We use a hidden garbage variable together with the concept of switching parents to model keywords as well as arbitrary speech. This implies that keywords can be added to the vocabulary without having to re-train the model. Thereby the design of our model architecture is optimised to reliably detect keywords rather than to decode keyword phoneme sequences as arbitrary speech, while offering a parameter to adjust the operating point on the receiver operating characteristics curve. Experiments on the TIMIT corpus reveal that our graphical model outperforms a comparable hidden Markov model based keyword spotter that uses conventional garbage modelling.
本文介绍了一种新的图形模型体系结构,用于鲁棒和词汇无关的关键字识别,不需要训练显式垃圾模型。我们展示了如何将音素识别的图形模型结构扩展到一个对音素识别错误具有鲁棒性的关键字识别器。我们使用了一个隐藏的垃圾变量以及将父变量转换为模型关键字和任意语音的概念。这意味着可以将关键字添加到词汇表中,而不必重新训练模型。从而优化了我们的模型架构设计,以可靠地检测关键字,而不是将关键字音素序列解码为任意语音,同时提供了一个参数来调整接收机工作特性曲线上的工作点。在TIMIT语料库上的实验表明,我们的图形模型优于使用传统垃圾建模的基于隐马尔可夫模型的关键字识别器。
{"title":"Robust vocabulary independent keyword spotting with graphical models","authors":"M. Wöllmer, F. Eyben, Björn Schuller, G. Rigoll","doi":"10.1109/ASRU.2009.5373544","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373544","url":null,"abstract":"This paper introduces a novel graphical model architecture for robust and vocabulary independent keyword spotting which does not require the training of an explicit garbage model. We show how a graphical model structure for phoneme recognition can be extended to a keyword spotter that is robust with respect to phoneme recognition errors. We use a hidden garbage variable together with the concept of switching parents to model keywords as well as arbitrary speech. This implies that keywords can be added to the vocabulary without having to re-train the model. Thereby the design of our model architecture is optimised to reliably detect keywords rather than to decode keyword phoneme sequences as arbitrary speech, while offering a parameter to adjust the operating point on the receiver operating characteristics curve. Experiments on the TIMIT corpus reveal that our graphical model outperforms a comparable hidden Markov model based keyword spotter that uses conventional garbage modelling.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125197866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Discriminative adaptive training with VTS and JUD 基于VTS和JUD的判别适应性训练
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373266
F. Flego, M. Gales
Adaptive training is a powerful approach for building speech recognition systems on non-homogeneous training data. Recently approaches based on predictive model-based compensation schemes, such as Joint Uncertainty Decoding (JUD) and Vector Taylor Series (VTS), have been proposed. This paper reviews these model-based compensation schemes and relates them to factor-analysis style systems. Forms of Maximum Likelihood (ML) adaptive training with these approaches are described, based on both second-order optimisation schemes and Expectation Maximisation (EM). However, discriminative training is used in many state-of-the-art speech recognition. Hence, this paper proposes discriminative adaptive training with predictive model-compensation approaches for noise robust speech recognition. This training approach is applied to both JUD and VTS compensation with minimum phone error training. A large scale multi-environment training configuration is used and the systems evaluated on a range of in-car collected data tasks.
自适应训练是在非同构训练数据上构建语音识别系统的一种有效方法。近年来,人们提出了基于预测模型的补偿方案,如联合不确定性解码(JUD)和矢量泰勒级数(VTS)。本文综述了这些基于模型的薪酬方案,并将它们与因子分析风格的系统联系起来。基于二阶优化方案和期望最大化(EM),描述了使用这些方法的最大似然(ML)自适应训练的形式。然而,判别训练被用于许多最先进的语音识别。因此,本文提出了基于预测模型补偿的判别自适应训练方法用于噪声鲁棒语音识别。该训练方法同时应用于JUD和VTS补偿,并实现了最小的电话误差训练。使用了大规模的多环境训练配置,并对一系列车内收集的数据任务进行了系统评估。
{"title":"Discriminative adaptive training with VTS and JUD","authors":"F. Flego, M. Gales","doi":"10.1109/ASRU.2009.5373266","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373266","url":null,"abstract":"Adaptive training is a powerful approach for building speech recognition systems on non-homogeneous training data. Recently approaches based on predictive model-based compensation schemes, such as Joint Uncertainty Decoding (JUD) and Vector Taylor Series (VTS), have been proposed. This paper reviews these model-based compensation schemes and relates them to factor-analysis style systems. Forms of Maximum Likelihood (ML) adaptive training with these approaches are described, based on both second-order optimisation schemes and Expectation Maximisation (EM). However, discriminative training is used in many state-of-the-art speech recognition. Hence, this paper proposes discriminative adaptive training with predictive model-compensation approaches for noise robust speech recognition. This training approach is applied to both JUD and VTS compensation with minimum phone error training. A large scale multi-environment training configuration is used and the systems evaluated on a range of in-car collected data tasks.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117247743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Garbage modeling with decoys for a sequential recognition scenario 针对顺序识别场景的带有诱饵的垃圾建模
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372919
Michael Levit, Shuangyu Chang, B. Buntschuh
This paper is concerned with a speech recognition scenario where two unequal ASR systems, one fast with constrained resources, the other significantly slower but also much more powerful, work together in a sequential manner. In particular, we focus on decisions when to accept the results of the first recognizer and when the second recognizer needs to be consulted. As a kind of application-dependent garbage modeling, we suggest an algorithm that augments the grammar of the first recognizer with those valid paths through the language model of the second recognizer that are confusable with the phrases from this grammar. We show how this algorithm outperforms a system that only looks at recognition confidences by about 20% relative.
本文关注的是一个语音识别场景,其中两个不相等的ASR系统,一个快速与有限的资源,另一个显着慢但也更强大,以顺序的方式一起工作。特别是,我们关注于何时接受第一个识别器的结果以及何时需要咨询第二个识别器的决定。作为一种依赖于应用程序的垃圾建模,我们提出了一种算法,该算法通过第二个识别器的语言模型,用那些与该语法中的短语混淆的有效路径来增强第一个识别器的语法。我们展示了该算法如何比只关注识别置信度的系统高出约20%。
{"title":"Garbage modeling with decoys for a sequential recognition scenario","authors":"Michael Levit, Shuangyu Chang, B. Buntschuh","doi":"10.1109/ASRU.2009.5372919","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372919","url":null,"abstract":"This paper is concerned with a speech recognition scenario where two unequal ASR systems, one fast with constrained resources, the other significantly slower but also much more powerful, work together in a sequential manner. In particular, we focus on decisions when to accept the results of the first recognizer and when the second recognizer needs to be consulted. As a kind of application-dependent garbage modeling, we suggest an algorithm that augments the grammar of the first recognizer with those valid paths through the language model of the second recognizer that are confusable with the phrases from this grammar. We show how this algorithm outperforms a system that only looks at recognition confidences by about 20% relative.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116844984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Automatic selection of recognition errors by respeaking the intended text 通过说出预期的文本自动选择识别错误
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373347
K. Vertanen, P. Kristensson
We investigate how to automatically align spoken corrections with an initial speech recognition result. Such automatic alignment would enable one-step voice-only correction in which users simply respeak their intended text. We present three new models for automatically aligning corrections: a 1-best model, a word confusion network model, and a revision model. The revision model allows users to alter what they intended to write even when the initial recognition was completely correct. We evaluate our models with data gathered from two user studies. We show that providing just a single correct word of context dramatically improves alignment success from 65% to 84%. We find that a majority of users provide such context without being explicitly instructed to do so. We find that the revision model is superior when users modify words in their initial recognition, improving alignment success from 73% to 83%. We show how our models can easily incorporate prior information about correction location and we show that such information aids alignment success. Last, we observe that users speak their intended text faster and with fewer re-recordings than if they are forced to speak misrecognized text.
我们研究了如何将语音更正与初始语音识别结果自动对齐。这种自动校准将实现一步语音校正,用户只需说出他们想要的文本。我们提出了三种自动校准更正的新模型:1-best模型,单词混淆网络模型和修订模型。修改模型允许用户修改他们想要写的内容,即使最初的识别是完全正确的。我们用从两个用户研究中收集的数据来评估我们的模型。我们发现,仅仅提供一个正确的上下文单词就能显著地将对齐成功率从65%提高到84%。我们发现大多数用户在没有得到明确指示的情况下提供了这样的上下文。我们发现,当用户在初始识别中修改单词时,修正模型是优越的,将对齐成功率从73%提高到83%。我们展示了我们的模型如何容易地结合关于校正位置的先验信息,我们展示了这些信息有助于校准成功。最后,我们观察到,与被迫说出错误识别的文本相比,用户说出预期文本的速度更快,重复录音的次数更少。
{"title":"Automatic selection of recognition errors by respeaking the intended text","authors":"K. Vertanen, P. Kristensson","doi":"10.1109/ASRU.2009.5373347","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373347","url":null,"abstract":"We investigate how to automatically align spoken corrections with an initial speech recognition result. Such automatic alignment would enable one-step voice-only correction in which users simply respeak their intended text. We present three new models for automatically aligning corrections: a 1-best model, a word confusion network model, and a revision model. The revision model allows users to alter what they intended to write even when the initial recognition was completely correct. We evaluate our models with data gathered from two user studies. We show that providing just a single correct word of context dramatically improves alignment success from 65% to 84%. We find that a majority of users provide such context without being explicitly instructed to do so. We find that the revision model is superior when users modify words in their initial recognition, improving alignment success from 73% to 83%. We show how our models can easily incorporate prior information about correction location and we show that such information aids alignment success. Last, we observe that users speak their intended text faster and with fewer re-recordings than if they are forced to speak misrecognized text.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129630941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Robust distributed speech recognition using two-stage Filtered Minima Controlled Recursive Averaging 基于两阶段滤波最小控制递归平均的鲁棒分布式语音识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372925
Negar Ghourchian, S. Selouani, D. O'Shaughnessy
This paper examines the use of a new Filtered Minima-Controlled Recursive Averaging (FMCRA) noise estimation technique as a robust front-end processing to improve the performance of a Distributed Speech Recognition (DSR) system in noisy environments. The noisy speech is enhanced by using a two-stage framework in order to simultaneously address the inefficiency of the Voice Activity Detector (VAD) and to remedy the inadequacies of MCRA. The performance evaluation carried out on the Aurora 2 task showed that the inclusion of FMCRA in the front-end side leads to a significant improvement in DSR accuracy.
本文研究了一种新的滤波最小控制递归平均(FMCRA)噪声估计技术作为鲁棒前端处理的使用,以提高分布式语音识别(DSR)系统在噪声环境中的性能。为了同时解决语音活动检测器(VAD)的低效率问题和弥补MCRA的不足,采用两阶段框架对噪声语音进行增强。对极光2号任务进行的性能评估表明,在前端加入FMCRA可以显著提高DSR精度。
{"title":"Robust distributed speech recognition using two-stage Filtered Minima Controlled Recursive Averaging","authors":"Negar Ghourchian, S. Selouani, D. O'Shaughnessy","doi":"10.1109/ASRU.2009.5372925","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372925","url":null,"abstract":"This paper examines the use of a new Filtered Minima-Controlled Recursive Averaging (FMCRA) noise estimation technique as a robust front-end processing to improve the performance of a Distributed Speech Recognition (DSR) system in noisy environments. The noisy speech is enhanced by using a two-stage framework in order to simultaneously address the inefficiency of the Voice Activity Detector (VAD) and to remedy the inadequacies of MCRA. The performance evaluation carried out on the Aurora 2 task showed that the inclusion of FMCRA in the front-end side leads to a significant improvement in DSR accuracy.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126363374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1