首页 > 最新文献

2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文 中文
Learning better lexical properties for recurrent OOV words 学习更好的反复出现的OOV单词的词汇特性
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707699
Longlu Qin, Alexander I. Rudnicky
Out-of-vocabulary (OOV) words can appear more than once in a conversation or over a period of time. Such multiple instances of the same OOV word provide valuable information for learning the lexical properties of the word. Therefore, we investigated how to estimate better pronunciation, spelling and part-of-speech (POS) label for recurrent OOV words. We first identified recurrent OOV words from the output of a hybrid decoder by applying a bottom-up clustering approach. Then, multiple instances of the same OOV word were used simultaneously to learn properties of the OOV word. The experimental results showed that the bottom-up clustering approach is very effective at detecting the recurrence of OOV words. Furthermore, by using evidence from multiple instances of the same word, the pronunciation accuracy, recovery rate and POS label accuracy of recurrent OOV words can be substantially improved.
在对话中或在一段时间内,词汇外的单词可能会出现不止一次。同一个OOV单词的这种多个实例为学习该单词的词汇特性提供了有价值的信息。因此,我们研究了如何估计更好的发音,拼写和词性(词性)标签的重复OOV词。我们首先通过应用自下而上的聚类方法从混合解码器的输出中识别出重复的OOV单词。然后,同时使用同一个OOV单词的多个实例来学习该OOV单词的属性。实验结果表明,自底向上聚类方法对OOV词的重复检测是非常有效的。此外,通过使用来自同一单词的多个实例的证据,可以大大提高重复出现的OOV单词的发音准确率、回收率和POS标签准确率。
{"title":"Learning better lexical properties for recurrent OOV words","authors":"Longlu Qin, Alexander I. Rudnicky","doi":"10.1109/ASRU.2013.6707699","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707699","url":null,"abstract":"Out-of-vocabulary (OOV) words can appear more than once in a conversation or over a period of time. Such multiple instances of the same OOV word provide valuable information for learning the lexical properties of the word. Therefore, we investigated how to estimate better pronunciation, spelling and part-of-speech (POS) label for recurrent OOV words. We first identified recurrent OOV words from the output of a hybrid decoder by applying a bottom-up clustering approach. Then, multiple instances of the same OOV word were used simultaneously to learn properties of the OOV word. The experimental results showed that the bottom-up clustering approach is very effective at detecting the recurrence of OOV words. Furthermore, by using evidence from multiple instances of the same word, the pronunciation accuracy, recovery rate and POS label accuracy of recurrent OOV words can be substantially improved.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117076308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Compact acoustic modeling based on acoustic manifold using a mixture of factor analyzers 基于混合因子分析的声流形紧凑声学建模
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707702
Wenlin Zhang, Bi-cheng Li, Weiqiang Zhang
A compact acoustic model for speech recognition is proposed based on nonlinear manifold modeling of the acoustic feature space. Acoustic features of the speech signal is assumed to form a low-dimensional manifold, which is modeled by a mixture of factor analyzers. Each factor analyzer describes a local area of the manifold using a low-dimensional linear model. For an HMM-based speech recognition system, observations of a particular state are constrained to be located on part of the manifold, which may cover several factor analyzers. For each tied-state, a sparse weight vector is obtained through an iteration shrinkage algorithm, in which the sparseness is determined automatically by the training data. For each nonzero component of the weight vector, a low-dimensional factor is estimated for the corresponding factor model according to the maximum a posteriori (MAP) criterion, resulting in a compact state model. Experimental results show that compared with the conventional HMM-GMM system and the SGMM system, the new method not only contains fewer parameters, but also yields better recognition results.
基于声学特征空间的非线性流形建模,提出了一种紧凑的语音识别声学模型。假设语音信号的声学特征形成一个低维流形,该流形由混合因子分析仪建模。每个因子分析器使用低维线性模型描述流形的局部区域。对于基于hmm的语音识别系统,特定状态的观察被限制在歧管的一部分,这可能涵盖几个因素分析仪。对于每个绑定状态,通过迭代收缩算法获得一个稀疏权向量,其中稀疏度由训练数据自动确定。对于权向量的每个非零分量,根据最大后验(MAP)准则对相应的因子模型估计一个低维因子,得到一个紧凑的状态模型。实验结果表明,与传统的HMM-GMM系统和SGMM系统相比,新方法不仅包含更少的参数,而且具有更好的识别效果。
{"title":"Compact acoustic modeling based on acoustic manifold using a mixture of factor analyzers","authors":"Wenlin Zhang, Bi-cheng Li, Weiqiang Zhang","doi":"10.1109/ASRU.2013.6707702","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707702","url":null,"abstract":"A compact acoustic model for speech recognition is proposed based on nonlinear manifold modeling of the acoustic feature space. Acoustic features of the speech signal is assumed to form a low-dimensional manifold, which is modeled by a mixture of factor analyzers. Each factor analyzer describes a local area of the manifold using a low-dimensional linear model. For an HMM-based speech recognition system, observations of a particular state are constrained to be located on part of the manifold, which may cover several factor analyzers. For each tied-state, a sparse weight vector is obtained through an iteration shrinkage algorithm, in which the sparseness is determined automatically by the training data. For each nonzero component of the weight vector, a low-dimensional factor is estimated for the corresponding factor model according to the maximum a posteriori (MAP) criterion, resulting in a compact state model. Experimental results show that compared with the conventional HMM-GMM system and the SGMM system, the new method not only contains fewer parameters, but also yields better recognition results.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129363371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Discriminative semi-supervised training for keyword search in low resource languages 针对低资源语言关键词搜索的判别式半监督训练
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707770
Roger Hsiao, Tim Ng, F. Grézl, D. Karakos, S. Tsakalidis, L. Nguyen, R. Schwartz
In this paper, we investigate semi-supervised training for low resource languages where the initial systems may have high error rate (≥ 70.0% word eror rate). To handle the lack of data, we study semi-supervised techniques including data selection, data weighting, discriminative training and multilayer perceptron learning to improve system performance. The entire suite of semi-supervised methods presented in this paper was evaluated under the IARPA Babel program for the keyword spotting tasks. Our semi-supervised system had the best performance in the OpenKWS13 surprise language evaluation for the limited condition. In this paper, we describe our work on the Turkish and Vietnamese systems.
在本文中,我们研究了针对低资源语言的半监督训练,在低资源语言中,初始系统可能具有较高的错误率(单词错误率≥ 70.0%)。为了解决数据缺乏的问题,我们研究了包括数据选择、数据加权、判别训练和多层感知器学习在内的半监督技术,以提高系统性能。本文介绍的整套半监督方法是在 IARPA 巴别计划下针对关键词搜索任务进行评估的。我们的半监督系统在有限条件下的 OpenKWS13 意外语言评估中表现最佳。本文将介绍我们在土耳其语和越南语系统方面所做的工作。
{"title":"Discriminative semi-supervised training for keyword search in low resource languages","authors":"Roger Hsiao, Tim Ng, F. Grézl, D. Karakos, S. Tsakalidis, L. Nguyen, R. Schwartz","doi":"10.1109/ASRU.2013.6707770","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707770","url":null,"abstract":"In this paper, we investigate semi-supervised training for low resource languages where the initial systems may have high error rate (≥ 70.0% word eror rate). To handle the lack of data, we study semi-supervised techniques including data selection, data weighting, discriminative training and multilayer perceptron learning to improve system performance. The entire suite of semi-supervised methods presented in this paper was evaluated under the IARPA Babel program for the keyword spotting tasks. Our semi-supervised system had the best performance in the OpenKWS13 surprise language evaluation for the limited condition. In this paper, we describe our work on the Turkish and Vietnamese systems.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127426312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
An empirical study of confusion modeling in keyword search for low resource languages 低资源语言关键词搜索中的混淆建模实证研究
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707774
M. Saraçlar, A. Sethy, B. Ramabhadran, L. Mangu, Jia Cui, Xiaodong Cui, Brian Kingsbury, Jonathan Mamou
Keyword search, in the context of low resource languages, has emerged as a key area of research. The dominant approach in keyword search is to use Automatic Speech Recognition (ASR) as a front end to produce a representation of audio that can be indexed. The biggest drawback of this approach lies in its the inability to deal with out-of-vocabulary words and query terms that are not in the ASR system output. In this paper we present an empirical study evaluating various approaches based on using confusion models as query expansion techniques to address this problem. We present results across four languages using a range of confusion models which lead to significant improvements in keyword search performance as measured by the Maximum Term Weighted Value (MTWV) metric.
关键词搜索,在低资源语言的背景下,已经成为一个重要的研究领域。关键字搜索的主要方法是使用自动语音识别(ASR)作为前端来生成可索引的音频表示。这种方法的最大缺点在于它无法处理不在ASR系统输出中的词汇表外的单词和查询术语。在本文中,我们提出了一项实证研究,评估了基于使用混淆模型作为查询扩展技术来解决这个问题的各种方法。我们使用一系列混淆模型展示了四种语言的结果,这些模型通过最大术语加权值(MTWV)度量指标显著改善了关键字搜索性能。
{"title":"An empirical study of confusion modeling in keyword search for low resource languages","authors":"M. Saraçlar, A. Sethy, B. Ramabhadran, L. Mangu, Jia Cui, Xiaodong Cui, Brian Kingsbury, Jonathan Mamou","doi":"10.1109/ASRU.2013.6707774","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707774","url":null,"abstract":"Keyword search, in the context of low resource languages, has emerged as a key area of research. The dominant approach in keyword search is to use Automatic Speech Recognition (ASR) as a front end to produce a representation of audio that can be indexed. The biggest drawback of this approach lies in its the inability to deal with out-of-vocabulary words and query terms that are not in the ASR system output. In this paper we present an empirical study evaluating various approaches based on using confusion models as query expansion techniques to address this problem. We present results across four languages using a range of confusion models which lead to significant improvements in keyword search performance as measured by the Maximum Term Weighted Value (MTWV) metric.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124150874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
Joint training of interpolated exponential n-gram models 插值指数n图模型的联合训练
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707700
A. Sethy, Stanley F. Chen, E. Arisoy, B. Ramabhadran, Kartik Audhkhasi, Shrikanth S. Narayanan, Paul Vozila
For many speech recognition tasks, the best language model performance is achieved by collecting text from multiple sources or domains, and interpolating language models built separately on each individual corpus. When multiple corpora are available, it has also been shown that when using a domain adaptation technique such as feature augmentation [1], the performance on each individual domain can be improved by training a joint model across all of the corpora. In this paper, we explore whether improving each domain model via joint training also improves performance when interpolating the models together. We show that the diversity of the individual models is an important consideration, and propose a method for adjusting diversity to optimize overall performance. We present results using word n-gram models and Model M, a class-based n-gram model, and demonstrate improvements in both perplexity and word-error rate relative to state-of-the-art results on a Broadcast News transcription task.
对于许多语音识别任务,通过从多个来源或领域收集文本,并在每个单独的语料库上单独构建语言模型来实现最佳的语言模型性能。当多个语料库可用时,也有研究表明,当使用领域自适应技术(如特征增强[1])时,可以通过跨所有语料库训练联合模型来提高每个单独领域的性能。在本文中,我们探讨了通过联合训练来改进每个领域模型是否也能在模型一起插值时提高性能。我们证明了个体模型的多样性是一个重要的考虑因素,并提出了一种调整多样性以优化整体性能的方法。我们展示了使用单词n-gram模型和模型M(一个基于类的n-gram模型)的结果,并展示了相对于广播新闻转录任务的最新结果,在困惑和单词错误率方面的改进。
{"title":"Joint training of interpolated exponential n-gram models","authors":"A. Sethy, Stanley F. Chen, E. Arisoy, B. Ramabhadran, Kartik Audhkhasi, Shrikanth S. Narayanan, Paul Vozila","doi":"10.1109/ASRU.2013.6707700","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707700","url":null,"abstract":"For many speech recognition tasks, the best language model performance is achieved by collecting text from multiple sources or domains, and interpolating language models built separately on each individual corpus. When multiple corpora are available, it has also been shown that when using a domain adaptation technique such as feature augmentation [1], the performance on each individual domain can be improved by training a joint model across all of the corpora. In this paper, we explore whether improving each domain model via joint training also improves performance when interpolating the models together. We show that the diversity of the individual models is an important consideration, and propose a method for adjusting diversity to optimize overall performance. We present results using word n-gram models and Model M, a class-based n-gram model, and demonstrate improvements in both perplexity and word-error rate relative to state-of-the-art results on a Broadcast News transcription task.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126690312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The TAO of ATWV: Probing the mysteries of keyword search performance ATWV的TAO:探索关键字搜索性能的奥秘
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707728
S. Wegmann, Arlo Faria, Adam L. Janin, K. Riedhammer, N. Morgan
In this paper we apply diagnostic analysis to gain a deeper understanding of the performance of the the keyword search system that we have developed for conversational telephone speech in the IARPA Babel program. We summarize the Babel task, its primary performance metric, “actual term weighted value” (ATWV), and our recognition and keyword search systems. Our analysis uses two new oracle ATWV measures, a bootstrap-based ATWV confidence interval, and includes a study of the underpinnings of the large ATWV gains due to system combination. This analysis quantifies the potential ATWV gains from improving the number of true hits and the overall quality of the detection scores in our system's posting lists. It also shows that system combination improves our systems' ATWV via a small increase in the number of true hits in the posting lists.
在本文中,我们应用诊断分析来更深入地了解我们在IARPA巴别塔计划中为会话电话语音开发的关键字搜索系统的性能。我们总结了Babel任务,它的主要性能指标,“实际术语加权值”(ATWV),以及我们的识别和关键字搜索系统。我们的分析使用了两个新的oracle ATWV度量,一个基于引导的ATWV置信区间,并包括对由于系统组合而产生的大ATWV增益的基础的研究。该分析量化了通过提高真实命中次数和系统发布列表中检测分数的整体质量而获得的潜在ATWV收益。它还表明,系统组合通过在张贴列表中增加真实命中数来提高系统的ATWV。
{"title":"The TAO of ATWV: Probing the mysteries of keyword search performance","authors":"S. Wegmann, Arlo Faria, Adam L. Janin, K. Riedhammer, N. Morgan","doi":"10.1109/ASRU.2013.6707728","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707728","url":null,"abstract":"In this paper we apply diagnostic analysis to gain a deeper understanding of the performance of the the keyword search system that we have developed for conversational telephone speech in the IARPA Babel program. We summarize the Babel task, its primary performance metric, “actual term weighted value” (ATWV), and our recognition and keyword search systems. Our analysis uses two new oracle ATWV measures, a bootstrap-based ATWV confidence interval, and includes a study of the underpinnings of the large ATWV gains due to system combination. This analysis quantifies the potential ATWV gains from improving the number of true hits and the overall quality of the detection scores in our system's posting lists. It also shows that system combination improves our systems' ATWV via a small increase in the number of true hits in the posting lists.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"87 9 Suppl 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132658268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Hierarchical neural networks and enhanced class posteriors for social signal classification 层次神经网络与增强类后验的社会信号分类
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707757
Raymond Brueckner, Björn Schuller
With the impressive advances of deep learning in recent years the interest in neural networks has resurged in the fields of automatic speech recognition and emotion recognition. In this paper we apply neural networks to address speaker-independent detection and classification of laughter and filler vocalizations in speech. We first explore modeling class posteriors with standard neural networks and deep stacked autoencoders. Then, we adopt a hierarchical neural architecture to compute enhanced class posteriors and demonstrate that this approach introduces significant and consistent improvements on the Social Signals Sub-Challenge of the Interspeech 2013 Computational Paralinguistics Challenge (ComParE). On this task we achieve a value of 92.4% of the unweighted average area-under-the-curve, which is the official competition measure, on the test set. This constitutes an improvement of 9.1% over the baseline and is the best result obtained so far on this task.
近年来,随着深度学习的显著进步,神经网络在自动语音识别和情感识别领域的兴趣重新燃起。在本文中,我们应用神经网络来解决与说话人无关的笑声和填充发声的检测和分类问题。我们首先探索用标准神经网络和深度堆叠自编码器建模类后验。然后,我们采用分层神经结构来计算增强的类后验,并证明该方法对Interspeech 2013计算副语言学挑战(ComParE)的社会信号子挑战(Social Signals Sub-Challenge)带来了显著且一致的改进。在这个任务中,我们在测试集中实现了未加权平均曲线下面积的92.4%,这是官方的竞争指标。这比基线提高了9.1%,是迄今为止在此任务中获得的最佳结果。
{"title":"Hierarchical neural networks and enhanced class posteriors for social signal classification","authors":"Raymond Brueckner, Björn Schuller","doi":"10.1109/ASRU.2013.6707757","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707757","url":null,"abstract":"With the impressive advances of deep learning in recent years the interest in neural networks has resurged in the fields of automatic speech recognition and emotion recognition. In this paper we apply neural networks to address speaker-independent detection and classification of laughter and filler vocalizations in speech. We first explore modeling class posteriors with standard neural networks and deep stacked autoencoders. Then, we adopt a hierarchical neural architecture to compute enhanced class posteriors and demonstrate that this approach introduces significant and consistent improvements on the Social Signals Sub-Challenge of the Interspeech 2013 Computational Paralinguistics Challenge (ComParE). On this task we achieve a value of 92.4% of the unweighted average area-under-the-curve, which is the official competition measure, on the test set. This constitutes an improvement of 9.1% over the baseline and is the best result obtained so far on this task.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130663355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Deep maxout networks for low-resource speech recognition 低资源语音识别的深度最大输出网络
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707763
Yajie Miao, Florian Metze, Shourabh Rawat
As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show state-of-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks. Our focus is on the particular advantage of DMNs under low-resource conditions with limited transcribed speech. We extend DMNs to hybrid and bottleneck feature systems, and explore optimal network structures (number of maxout layers, pooling strategy, etc) for both setups. On the newly released Babel corpus, behaviors of DMNs are extensively studied under different levels of data availability. Experiments show that DMNs improve low-resource speech recognition significantly. Moreover, DMNs introduce sparsity to their hidden activations and thus can act as sparse feature extractors.
作为一种前馈结构,最近提出的maxout网络自然地集成了dropout,并在各种计算机视觉数据集上显示出最先进的结果。研究了深度最大输出网络(DMNs)在大词汇量连续语音识别中的应用。我们的重点是DMNs在低资源条件下具有有限转录语音的特殊优势。我们将DMNs扩展到混合和瓶颈特征系统,并探索两种设置的最佳网络结构(maxout层数,池化策略等)。在新发布的Babel语料库上,广泛研究了DMNs在不同数据可用性水平下的行为。实验表明,DMNs对低资源语音识别有显著改善。此外,dmn为其隐藏激活引入了稀疏性,因此可以作为稀疏特征提取器。
{"title":"Deep maxout networks for low-resource speech recognition","authors":"Yajie Miao, Florian Metze, Shourabh Rawat","doi":"10.1109/ASRU.2013.6707763","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707763","url":null,"abstract":"As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show state-of-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks. Our focus is on the particular advantage of DMNs under low-resource conditions with limited transcribed speech. We extend DMNs to hybrid and bottleneck feature systems, and explore optimal network structures (number of maxout layers, pooling strategy, etc) for both setups. On the newly released Babel corpus, behaviors of DMNs are extensively studied under different levels of data availability. Experiments show that DMNs improve low-resource speech recognition significantly. Moreover, DMNs introduce sparsity to their hidden activations and thus can act as sparse feature extractors.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133457477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 99
Semi-supervised bootstrapping approach for neural network feature extractor training 神经网络特征提取器训练的半监督自举方法
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707775
F. Grézl, M. Karafiát
This paper presents bootstrapping approach for neural network training. The neural networks serve as bottle-neck feature extractor for subsequent GMM-HMM recognizer. The recognizer is also used for transcription and confidence assignment of untranscribed data. Based on the confidence, segments are selected and mixed with supervised data and new NNs are trained. With this approach, it is possible to recover 40-55% of the difference between partially and fully transcribed data (3 to 5% absolute improvement over NN trained on supervised data only). Using 70-85% of automatically transcribed segments with the highest confidence was found optimal to achieve this result.
提出了一种神经网络训练的自举方法。神经网络作为瓶颈特征提取器用于后续的GMM-HMM识别。识别器还用于未转录数据的转录和置信度分配。基于置信度,选择片段并与监督数据混合,训练新的神经网络。使用这种方法,可以恢复部分转录和完全转录数据之间40-55%的差异(比仅在监督数据上训练的神经网络绝对提高3 - 5%)。使用70-85%具有最高置信度的自动转录片段被认为是实现这一结果的最佳选择。
{"title":"Semi-supervised bootstrapping approach for neural network feature extractor training","authors":"F. Grézl, M. Karafiát","doi":"10.1109/ASRU.2013.6707775","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707775","url":null,"abstract":"This paper presents bootstrapping approach for neural network training. The neural networks serve as bottle-neck feature extractor for subsequent GMM-HMM recognizer. The recognizer is also used for transcription and confidence assignment of untranscribed data. Based on the confidence, segments are selected and mixed with supervised data and new NNs are trained. With this approach, it is possible to recover 40-55% of the difference between partially and fully transcribed data (3 to 5% absolute improvement over NN trained on supervised data only). Using 70-85% of automatically transcribed segments with the highest confidence was found optimal to achieve this result.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133392264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Probabilistic lexical modeling and unsupervised training for zero-resourced ASR 零资源ASR的概率词汇建模与无监督训练
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707771
Ramya Rasipuram, Marzieh Razavi, M. Magimai.-Doss
Standard automatic speech recognition (ASR) systems rely on transcribed speech, language models, and pronunciation dictionaries to achieve state-of-the-art performance. The unavailability of these resources constrains the ASR technology to be available for many languages. In this paper, we propose a novel zero-resourced ASR approach to train acoustic models that only uses list of probable words from the language of interest. The proposed approach is based on Kullback-Leibler divergence based hidden Markov model (KL-HMM), grapheme subword units, knowledge of grapheme-to-phoneme mapping, and graphemic constraints derived from the word list. The approach also exploits existing acoustic and lexical resources available in other resource rich languages. Furthermore, we propose unsupervised adaptation of KL-HMM acoustic model parameters if untranscribed speech data in the target language is available. We demonstrate the potential of the proposed approach through a simulated study on Greek language.
标准的自动语音识别(ASR)系统依赖于转录语音、语言模型和发音字典来实现最先进的性能。这些资源的不可用性限制了ASR技术对许多语言的可用性。在本文中,我们提出了一种新的零资源ASR方法来训练声学模型,该模型仅使用感兴趣语言中的可能单词列表。该方法基于基于Kullback-Leibler散度的隐马尔可夫模型(KL-HMM)、字素子词单元、字素到音素映射的知识以及从单词列表中导出的字素约束。该方法还利用了其他资源丰富的语言中现有的声学和词汇资源。此外,如果目标语言中有未转录的语音数据,我们建议对KL-HMM声学模型参数进行无监督自适应。我们通过对希腊语的模拟研究证明了所提出方法的潜力。
{"title":"Probabilistic lexical modeling and unsupervised training for zero-resourced ASR","authors":"Ramya Rasipuram, Marzieh Razavi, M. Magimai.-Doss","doi":"10.1109/ASRU.2013.6707771","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707771","url":null,"abstract":"Standard automatic speech recognition (ASR) systems rely on transcribed speech, language models, and pronunciation dictionaries to achieve state-of-the-art performance. The unavailability of these resources constrains the ASR technology to be available for many languages. In this paper, we propose a novel zero-resourced ASR approach to train acoustic models that only uses list of probable words from the language of interest. The proposed approach is based on Kullback-Leibler divergence based hidden Markov model (KL-HMM), grapheme subword units, knowledge of grapheme-to-phoneme mapping, and graphemic constraints derived from the word list. The approach also exploits existing acoustic and lexical resources available in other resource rich languages. Furthermore, we propose unsupervised adaptation of KL-HMM acoustic model parameters if untranscribed speech data in the target language is available. We demonstrate the potential of the proposed approach through a simulated study on Greek language.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133914607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2013 IEEE Workshop on Automatic Speech Recognition and Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1