2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文中文

Acoustic modeling using transform-based phone-cluster adaptive training 基于变换的电话簇自适应训练声学建模

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707704

Vimal Manohar, S. C. Bhargav, S. Umesh

In this paper, we propose a new acoustic modeling technique called the Phone-Cluster Adaptive Training. In this approach, the parameters of context-dependent states are obtained by the linear interpolation of several monophone cluster models, which are themselves obtained by adaptation using linear transformation of a canonical Gaussian Mixture Model (GMM). This approach is inspired from the Cluster Adaptive Training (CAT) for speaker adaptation and the Subspace Gaussian Mixture Model (SGMM). The parameters of the model are updated in an adaptive training framework. The interpolation vectors implicitly capture the phonetic context information. The proposed approach shows substantial improvement over the Continuous Density Hidden Markov Model (CDHMM) and a similar performance to that of the SGMM, while using significantly fewer parameters than both the CDHMM and the SGMM.

在本文中，我们提出了一种新的声学建模技术，称为电话簇自适应训练。该方法利用经典高斯混合模型(GMM)的线性变换自适应，对多个单声道聚类模型进行线性插值，得到与上下文相关的状态参数。该方法的灵感来自于用于说话人自适应的聚类自适应训练(CAT)和子空间高斯混合模型(SGMM)。在自适应训练框架中更新模型参数。插值向量隐式地捕获语音上下文信息。该方法比连续密度隐马尔可夫模型(CDHMM)有了很大的改进，性能与SGMM相似，同时使用的参数比CDHMM和SGMM都少得多。

引用次数: 7

Acoustic data-driven pronunciation lexicon for large vocabulary speech recognition 用于大词汇量语音识别的声学数据驱动语音词典

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707759

Liang Lu, Arnab Ghoshal, S. Renals

Speech recognition systems normally use handcrafted pronunciation lexicons designed by linguistic experts. Building and maintaining such a lexicon is expensive and time consuming. This paper concerns automatically learning a pronunciation lexicon for speech recognition. We assume the availability of a small seed lexicon and then learn the pronunciations of new words directly from speech that is transcribed at word-level. We present two implementations for refining the putative pronunciations of new words based on acoustic evidence. The first one is an expectation maximization (EM) algorithm based on weighted finite state transducers (WFSTs) and the other is its Viterbi approximation. We carried out experiments on the Switchboard corpus of conversational telephone speech. The expert lexicon has a size of more than 30,000 words, from which we randomly selected 5,000 words to form the seed lexicon. By using the proposed lexicon learning method, we have significantly improved the accuracy compared with a lexicon learned using a grapheme-to-phoneme transformation, and have obtained a word error rate that approaches that achieved using a fully handcrafted lexicon.

语音识别系统通常使用语言专家设计的手工发音词典。构建和维护这样的词典既昂贵又耗时。本文研究语音识别中语音词汇的自动学习。我们假设有一个小的种子词典，然后直接从单词级别转录的语音中学习新单词的发音。我们提出了两种基于声学证据来提炼新词的假定发音的实现。一种是基于加权有限状态传感器的期望最大化算法，另一种是基于加权有限状态传感器的Viterbi逼近算法。我们对电话会话语音的总机语料库进行了实验。专家词典有3万多字，我们从中随机抽取5000字组成种子词典。通过使用所提出的词典学习方法，与使用字形-音素转换学习的词典相比，我们显著提高了准确性，并且获得了接近使用完全手工制作的词典的单词错误率。

{"title":"Acoustic data-driven pronunciation lexicon for large vocabulary speech recognition","authors":"Liang Lu, Arnab Ghoshal, S. Renals","doi":"10.1109/ASRU.2013.6707759","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707759","url":null,"abstract":"Speech recognition systems normally use handcrafted pronunciation lexicons designed by linguistic experts. Building and maintaining such a lexicon is expensive and time consuming. This paper concerns automatically learning a pronunciation lexicon for speech recognition. We assume the availability of a small seed lexicon and then learn the pronunciations of new words directly from speech that is transcribed at word-level. We present two implementations for refining the putative pronunciations of new words based on acoustic evidence. The first one is an expectation maximization (EM) algorithm based on weighted finite state transducers (WFSTs) and the other is its Viterbi approximation. We carried out experiments on the Switchboard corpus of conversational telephone speech. The expert lexicon has a size of more than 30,000 words, from which we randomly selected 5,000 words to form the seed lexicon. By using the proposed lexicon learning method, we have significantly improved the accuracy compared with a lexicon learned using a grapheme-to-phoneme transformation, and have obtained a word error rate that approaches that achieved using a fully handcrafted lexicon.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124769340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

On-line adaptation of semantic models for spoken language understanding 语义模型在口语理解中的在线适应

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707711

Ali Orkan Bayer, G. Riccardi

Spoken language understanding (SLU) systems extract semantic information from speech signals, which is usually mapped onto concept sequences. The distribution of concepts in dialogues are usually sparse. Therefore, general models may fail to model the concept distribution for a dialogue and semantic models can benefit from adaptation. In this paper, we present an instance-based approach for on-line adaptation of semantic models. We show that we can improve the performance of an SLU system on an utterance, by retrieving relevant instances from the training data and using them for on-line adapting the semantic models. The instance-based adaptation scheme uses two different similarity metrics edit distance and n-gram match score on three different to-kenizations; word-concept pairs, words, and concepts. We have achieved a significant improvement (6% relative) in the understanding performance by conducting rescoring experiments on the n-best lists that the SLU outputs. We have also applied a two-level adaptation scheme, where adaptation is first applied to the automatic speech recognizer (ASR) and then to the SLU.

口语理解系统从语音信号中提取语义信息，这些信息通常映射到概念序列上。对话中概念的分布通常是稀疏的。因此，一般模型可能无法对对话的概念分布进行建模，而语义模型可以从适应中受益。本文提出了一种基于实例的语义模型在线自适应方法。我们表明，我们可以通过从训练数据中检索相关实例并使用它们在线适应语义模型来提高SLU系统对话语的性能。基于实例的自适应方案使用了两种不同的相似度度量，编辑距离和n-gram匹配分数在三种不同的自定义上;单词-概念对，单词和概念。通过对SLU输出的n个最佳列表进行评分实验，我们在理解性能方面取得了显著的改进(相对6%)。我们还应用了两级自适应方案，其中自适应首先应用于自动语音识别器(ASR)，然后应用于SLU。

引用次数: 6

K-component recurrent neural network language models using curriculum learning 使用课程学习的k分量递归神经网络语言模型

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707696

Yangyang Shi, M. Larson, C. Jonker

Conventional n-gram language models are known for their limited ability to capture long-distance dependencies and their brittleness with respect to within-domain variations. In this paper, we propose a k-component recurrent neural network language model using curriculum learning (CL-KRNNLM) to address within-domain variations. Based on a Dutch-language corpus, we investigate three methods of curriculum learning that exploit dedicated component models for specific sub-domains. Under an oracle situation in which context information is known during testing, we experimentally test three hypotheses. The first is that domain-dedicated models perform better than general models on their specific domains. The second is that curriculum learning can be used to train recurrent neural network language models (RNNLMs) from general patterns to specific patterns. The third is that curriculum learning, used as an implicit weighting method to adjust the relative contributions of general and specific patterns, outperforms conventional linear interpolation. Under the condition that context information is unknown during testing, the CL-KRNNLM also achieves improvement over conventional RNNLM by 13% relative in terms of word prediction accuracy. Finally, the CL-KRNNLM is tested in an additional experiment involving N-best rescoring on a standard data set. Here, the context domains are created by clustering the training data using Latent Dirichlet Allocation and k-means clustering.

传统的n-gram语言模型以其捕获长距离依赖关系的能力有限以及相对于域内变化的脆弱性而闻名。在本文中，我们提出了一个使用课程学习(CL-KRNNLM)的k分量递归神经网络语言模型来解决域内变化。基于荷兰语语料库，我们研究了三种课程学习方法，这些方法利用了特定子领域的专用组件模型。在测试过程中上下文信息已知的oracle情况下，我们通过实验验证了三个假设。首先，领域专用模型在其特定领域上比一般模型表现得更好。第二，课程学习可以用来训练从一般模式到特定模式的递归神经网络语言模型(rnnlm)。第三，课程学习作为一种隐式加权方法来调整一般模式和特定模式的相对贡献，优于传统的线性插值。在测试过程中上下文信息未知的情况下，CL-KRNNLM在单词预测准确率方面也比传统RNNLM相对提高了13%。最后，在一个标准数据集上的N-best评分的附加实验中对CL-KRNNLM进行了测试。在这里，上下文域是通过使用Latent Dirichlet Allocation和k-means聚类对训练数据进行聚类而创建的。

{"title":"K-component recurrent neural network language models using curriculum learning","authors":"Yangyang Shi, M. Larson, C. Jonker","doi":"10.1109/ASRU.2013.6707696","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707696","url":null,"abstract":"Conventional n-gram language models are known for their limited ability to capture long-distance dependencies and their brittleness with respect to within-domain variations. In this paper, we propose a k-component recurrent neural network language model using curriculum learning (CL-KRNNLM) to address within-domain variations. Based on a Dutch-language corpus, we investigate three methods of curriculum learning that exploit dedicated component models for specific sub-domains. Under an oracle situation in which context information is known during testing, we experimentally test three hypotheses. The first is that domain-dedicated models perform better than general models on their specific domains. The second is that curriculum learning can be used to train recurrent neural network language models (RNNLMs) from general patterns to specific patterns. The third is that curriculum learning, used as an implicit weighting method to adjust the relative contributions of general and specific patterns, outperforms conventional linear interpolation. Under the condition that context information is unknown during testing, the CL-KRNNLM also achieves improvement over conventional RNNLM by 13% relative in terms of word prediction accuracy. Finally, the CL-KRNNLM is tested in an additional experiment involving N-best rescoring on a standard data set. Here, the context domains are created by clustering the training data using Latent Dirichlet Allocation and k-means clustering.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121062119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Using web text to improve keyword spotting in speech 使用网页文本提高语音中的关键词识别能力

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707768

Ankur Gandhe, Longlu Qin, Florian Metze, Alexander I. Rudnicky, Ian Lane, Matthias Eck

For low resource languages, collecting sufficient training data to build acoustic and language models is time consuming and often expensive. But large amounts of text data, such as online newspapers, web forums or online encyclopedias, usually exist for languages that have a large population of native speakers. This text data can be easily collected from the web and then used to both expand the recognizer's vocabulary and improve the language model. One challenge, however, is normalizing and filtering the web data for a specific task. In this paper, we investigate the use of online text resources to improve the performance of speech recognition specifically for the task of keyword spotting. For the five languages provided in the base period of the IARPA BABEL project, we automatically collected text data from the web using only Limited LP resources. We then compared two methods for filtering the web data, one based on perplexity ranking and the other based on out-of-vocabulary (OOV) word detection. By integrating the web text into our systems, we observed significant improvements in keyword spotting accuracy for four out of the five languages. The best approach obtained an improvement in actual term weighted value (ATWV) of 0.0424 compared to a baseline system trained only on LimitedLP resources. On average, ATWV was improved by 0.0243 across five languages.

对于低资源语言，收集足够的训练数据来构建声学和语言模型是耗时且昂贵的。但是大量的文本数据，如在线报纸、网络论坛或在线百科全书，通常存在于有大量母语人口的语言中。这些文本数据可以很容易地从网络上收集，然后用于扩展识别器的词汇量和改进语言模型。然而，一个挑战是为特定任务规范化和过滤web数据。在本文中，我们研究了使用在线文本资源来提高语音识别的性能，特别是对于关键字的识别任务。对于IARPA BABEL项目基期提供的五种语言，我们仅使用有限的LP资源从网络上自动收集文本数据。然后，我们比较了两种过滤web数据的方法，一种是基于困惑度排序，另一种是基于词汇外(OOV)单词检测。通过将网络文本集成到我们的系统中，我们发现五种语言中有四种的关键词识别准确率有了显著提高。与仅在LimitedLP资源上训练的基线系统相比，最佳方法获得了0.0424的实际术语加权值(ATWV)改进。五种语言的ATWV平均提高了0.0243。

{"title":"Using web text to improve keyword spotting in speech","authors":"Ankur Gandhe, Longlu Qin, Florian Metze, Alexander I. Rudnicky, Ian Lane, Matthias Eck","doi":"10.1109/ASRU.2013.6707768","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707768","url":null,"abstract":"For low resource languages, collecting sufficient training data to build acoustic and language models is time consuming and often expensive. But large amounts of text data, such as online newspapers, web forums or online encyclopedias, usually exist for languages that have a large population of native speakers. This text data can be easily collected from the web and then used to both expand the recognizer's vocabulary and improve the language model. One challenge, however, is normalizing and filtering the web data for a specific task. In this paper, we investigate the use of online text resources to improve the performance of speech recognition specifically for the task of keyword spotting. For the five languages provided in the base period of the IARPA BABEL project, we automatically collected text data from the web using only Limited LP resources. We then compared two methods for filtering the web data, one based on perplexity ranking and the other based on out-of-vocabulary (OOV) word detection. By integrating the web text into our systems, we observed significant improvements in keyword spotting accuracy for four out of the five languages. The best approach obtained an improvement in actual term weighted value (ATWV) of 0.0424 compared to a baseline system trained only on LimitedLP resources. On average, ATWV was improved by 0.0243 across five languages.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126622811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Combination of data borrowing strategies for low-resource LVCSR 低资源LVCSR的数据借用策略组合

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707764

Y. Qian, Kai Yu, Jia Liu

Large vocabulary continuous speech recognition (LVCSR) is particularly difficult for low-resource languages, where only very limited manually transcribed data are available. However, it is often feasible to obtain large amount of untranscribed data of the low-resource target language or sufficient transcribed data of some non-target languages. Borrowing data from these additional sources to help LVCSR for low-resource language becomes an important research direction. This paper presents an integrated data borrowing framework in this scenario. Three data borrowing approaches were first investigated in detail, including feature, model and data corpus. They borrow data at different levels from additional sources, and all get substantial performance improvements. As these strategies work independently, the obtained gains are likely additive. The three strategies are then combined to form an integrated data borrowing framework. Experiments showed that with the integrated data borrowing framework, significant improvement of more than 10% absolute WER reduction over a conventional baseline was obtained. In particular, the gain under the extreme limited low-resource scenario is 16%.

大词汇量连续语音识别(LVCSR)对于资源匮乏的语言尤其困难，因为只有非常有限的人工转录数据可用。然而，获取大量低资源目标语言的未转录数据或某些非目标语言的充分转录数据往往是可行的。利用这些额外来源的数据来帮助低资源语言的LVCSR成为重要的研究方向。本文提出了一个集成的数据借用框架。首先详细研究了三种数据借用方法，包括特征、模型和数据语料库。它们从其他来源借用不同级别的数据，并且都得到了实质性的性能改进。由于这些策略是独立工作的，所获得的收益可能是相加的。然后将这三种策略结合起来，形成一个综合的数据借用框架。实验表明，在综合数据借用框架下，与传统基线相比，WER的绝对降低率显著提高10%以上。特别是，在极端有限的低资源情景下，收益为16%。

{"title":"Combination of data borrowing strategies for low-resource LVCSR","authors":"Y. Qian, Kai Yu, Jia Liu","doi":"10.1109/ASRU.2013.6707764","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707764","url":null,"abstract":"Large vocabulary continuous speech recognition (LVCSR) is particularly difficult for low-resource languages, where only very limited manually transcribed data are available. However, it is often feasible to obtain large amount of untranscribed data of the low-resource target language or sufficient transcribed data of some non-target languages. Borrowing data from these additional sources to help LVCSR for low-resource language becomes an important research direction. This paper presents an integrated data borrowing framework in this scenario. Three data borrowing approaches were first investigated in detail, including feature, model and data corpus. They borrow data at different levels from additional sources, and all get substantial performance improvements. As these strategies work independently, the obtained gains are likely additive. The three strategies are then combined to form an integrated data borrowing framework. Experiments showed that with the integrated data borrowing framework, significant improvement of more than 10% absolute WER reduction over a conventional baseline was obtained. In particular, the gain under the extreme limited low-resource scenario is 16%.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126509334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Combining stochastic average gradient and Hessian-free optimization for sequence training of deep neural networks 结合随机平均梯度和无hessian优化的深度神经网络序列训练

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707750

Pierre L. Dognin, V. Goel

Minimum phone error (MPE) training of deep neural networks (DNN) is an effective technique for reducing word error rate of automatic speech recognition tasks. This training is often carried out using a Hessian-free (HF) quasi-Newton approach, although other methods such as stochastic gradient descent have also been applied successfully. In this paper we present a novel stochastic approach to HF sequence training inspired by recently proposed stochastic average gradient (SAG) method. SAG reuses gradient information from past updates, and consequently simulates the presence of more training data than is really observed for each model update. We extend SAG by dynamically weighting the contribution of previous gradients, and by combining it to a stochastic HF optimization. We term the resulting procedure DSAG-HF. Experimental results for training DNNs on 1500h of audio data show that compared to baseline HF training, DSAG-HF leads to better held-out MPE loss after each model parameter update, and converges to an overall better loss value. Furthermore, since each update in DSAG-HF takes place over smaller amount of data, this procedure converges in about half the time as baseline HF sequence training.

深度神经网络(DNN)的最小电话错误(MPE)训练是降低自动语音识别任务中单词错误率的有效技术。这种训练通常使用无hessian (HF)准牛顿方法进行，尽管其他方法如随机梯度下降也已成功应用。本文在随机平均梯度法的启发下，提出了一种新的高频序列随机训练方法。SAG重用来自过去更新的梯度信息，因此模拟出比每次模型更新实际观察到的更多的训练数据。我们通过动态加权之前梯度的贡献来扩展SAG，并将其结合到随机HF优化中。我们将结果过程命名为DSAG-HF。在1500h音频数据上训练dnn的实验结果表明，与基线HF训练相比，每次更新模型参数后，DSAG-HF的持有MPE损失更好，并且收敛到一个更好的整体损失值。此外，由于DSAG-HF的每次更新都是在较小的数据量上进行的，因此该过程的收敛时间约为基线HF序列训练的一半。

{"title":"Combining stochastic average gradient and Hessian-free optimization for sequence training of deep neural networks","authors":"Pierre L. Dognin, V. Goel","doi":"10.1109/ASRU.2013.6707750","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707750","url":null,"abstract":"Minimum phone error (MPE) training of deep neural networks (DNN) is an effective technique for reducing word error rate of automatic speech recognition tasks. This training is often carried out using a Hessian-free (HF) quasi-Newton approach, although other methods such as stochastic gradient descent have also been applied successfully. In this paper we present a novel stochastic approach to HF sequence training inspired by recently proposed stochastic average gradient (SAG) method. SAG reuses gradient information from past updates, and consequently simulates the presence of more training data than is really observed for each model update. We extend SAG by dynamically weighting the contribution of previous gradients, and by combining it to a stochastic HF optimization. We term the resulting procedure DSAG-HF. Experimental results for training DNNs on 1500h of audio data show that compared to baseline HF training, DSAG-HF leads to better held-out MPE loss after each model parameter update, and converges to an overall better loss value. Furthermore, since each update in DSAG-HF takes place over smaller amount of data, this procedure converges in about half the time as baseline HF sequence training.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122545013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Lightly supervised automatic subtitling of weather forecasts 轻微监督的自动字幕天气预报

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707772

Joris Driesen, S. Renals

Since subtitling television content is a costly process, there are large potential advantages to automating it, using automatic speech recognition (ASR). However, training the necessary acoustic models can be a challenge, since the available training data usually lacks verbatim orthographic transcriptions. If there are approximate transcriptions, this problem can be overcome using light supervision methods. In this paper, we perform speech recognition on broadcasts of Weatherview, BBC's daily weather report, as a first step towards automatic subtitling. For training, we use a large set of past broadcasts, using their manually created subtitles as approximate transcriptions. We discuss and and compare two different light supervision methods, applying them to this data. The best training set finally obtained with these methods is used to create a hybrid deep neural network-based recognition system, which yields high recognition accuracies on three separate Weatherview evaluation sets.

由于电视内容字幕是一个昂贵的过程，使用自动语音识别(ASR)将其自动化有很大的潜在优势。然而，训练必要的声学模型可能是一个挑战，因为可用的训练数据通常缺乏逐字的正字法转录。如果有近似的转录，这个问题可以用轻监督方法来克服。在本文中，我们对BBC的每日天气报告Weatherview的广播进行语音识别，作为自动字幕的第一步。对于训练，我们使用大量过去的广播，使用它们手动创建的字幕作为近似的转录。我们讨论和比较了两种不同的光监督方法，并将其应用于该数据。使用这些方法最终获得的最佳训练集创建了一个基于深度神经网络的混合识别系统，该系统在三个独立的Weatherview评估集上产生了很高的识别精度。

引用次数: 14

Elastic spectral distortion for low resource speech recognition with deep neural networks 基于弹性频谱畸变的深度神经网络低资源语音识别

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707748

Naoyuki Kanda, Ryu Takeda, Y. Obuchi

An acoustic model based on hidden Markov models with deep neural networks (DNN-HMM) has recently been proposed and achieved high recognition accuracy. In this paper, we investigated an elastic spectral distortion method to artificially augment training samples to help DNN-HMMs acquire enough robustness even when there are a limited number of training samples. We investigated three distortion methods - vocal tract length distortion, speech rate distortion, and frequency-axis random distortion - and evaluated those methods with Japanese lecture recordings. In a large vocabulary continuous speech recognition task with only 10 hours of training samples, a DNN-HMM trained with the elastic spectral distortion method achieved a 10.1% relative word error reduction compared with a normally trained DNN-HMM.

最近提出了一种基于隐马尔可夫模型和深度神经网络(DNN-HMM)的声学模型，并取得了较高的识别精度。在本文中，我们研究了一种弹性谱失真方法来人为地增加训练样本，以帮助dnn - hmm在训练样本数量有限的情况下获得足够的鲁棒性。我们研究了三种失真方法-声道长度失真、语速失真和频率轴随机失真-并使用日语演讲录音对这些方法进行了评估。在仅10小时训练样本的大词汇量连续语音识别任务中，与常规训练的DNN-HMM相比，使用弹性谱失真方法训练的DNN-HMM相对单词误差降低了10.1%。

引用次数: 112

Porting concepts from DNNs back to GMMs 将概念从dnn移植回gmm

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707756

Kris Demuynck, Fabian Triefenbach

Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the `deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination.

深度神经网络(dnn)已被证明在各种语音识别基准上优于高斯混合模型(GMM)。在本文中，我们分析了DNN和GMM建模技术之间的差异，并将基于DNN建模的最佳思想移植到基于GMM的系统中。通过深入(多层)和广泛(多个并行子模型)以及共享模型参数，我们能够缩小TIMIT数据库上两种建模技术之间的差距。由于“深度”gmm保留了最大似然训练的高斯函数作为第一层，因此可以很容易地结合诸如说话人自适应和基于模型的噪声鲁棒性等先进技术。抛开它们的相似性，dnn和深层GMMs仍然显示出足够的互补性，从而允许有效的系统组合。

引用次数: 9

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀