首页 > 最新文献

2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文 中文
Convolutional neural network based triangular CRF for joint intent detection and slot filling 基于卷积神经网络的三角CRF联合意图检测与缝隙填充
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707709
Puyang Xu, R. Sarikaya
We describe a joint model for intent detection and slot filling based on convolutional neural networks (CNN). The proposed architecture can be perceived as a neural network (NN) version of the triangular CRF model (TriCRF), in which the intent label and the slot sequence are modeled jointly and their dependencies are exploited. Our slot filling component is a globally normalized CRF style model, as opposed to left-to-right models in recent NN based slot taggers. Its features are automatically extracted through CNN layers and shared by the intent model. We show that our slot model component generates state-of-the-art results, outperforming CRF significantly. Our joint model outperforms the standard TriCRF by 1% absolute for both intent and slot. On a number of other domains, our joint model achieves 0.7-1%, and 0.9-2.1% absolute gains over the independent modeling approach for intent and slot respectively.
我们描述了一种基于卷积神经网络(CNN)的意图检测和槽填充联合模型。所提出的体系结构可以被视为三角CRF模型(TriCRF)的神经网络(NN)版本,其中意图标签和槽序列联合建模,并利用它们的依赖关系。我们的槽填充组件是一个全局归一化的CRF样式模型,而不是最近基于神经网络的槽标记器中的从左到右模型。其特征通过CNN层自动提取,并由意图模型共享。我们表明,我们的槽模型组件产生了最先进的结果,显著优于CRF。我们的联合模型在意图和插槽方面都比标准TriCRF高出1%。在许多其他领域,我们的联合模型分别比意图和槽的独立建模方法获得0.7-1%和0.9-2.1%的绝对增益。
{"title":"Convolutional neural network based triangular CRF for joint intent detection and slot filling","authors":"Puyang Xu, R. Sarikaya","doi":"10.1109/ASRU.2013.6707709","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707709","url":null,"abstract":"We describe a joint model for intent detection and slot filling based on convolutional neural networks (CNN). The proposed architecture can be perceived as a neural network (NN) version of the triangular CRF model (TriCRF), in which the intent label and the slot sequence are modeled jointly and their dependencies are exploited. Our slot filling component is a globally normalized CRF style model, as opposed to left-to-right models in recent NN based slot taggers. Its features are automatically extracted through CNN layers and shared by the intent model. We show that our slot model component generates state-of-the-art results, outperforming CRF significantly. Our joint model outperforms the standard TriCRF by 1% absolute for both intent and slot. On a number of other domains, our joint model achieves 0.7-1%, and 0.9-2.1% absolute gains over the independent modeling approach for intent and slot respectively.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126556789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 316
Mixture of mixture n-gram language models 混合n-gram语言模型的混合
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707701
H. Sak, Cyril Allauzen, Kaisuke Nakajima, F. Beaufays
This paper presents a language model adaptation technique to build a single static language model from a set of language models each trained on a separate text corpus while aiming to maximize the likelihood of an adaptation data set given as a development set of sentences. The proposed model can be considered as a mixture of mixture language models. The mixture model at the top level is a sentence-level mixture model where each sentence is assumed to be drawn from one of a discrete set of topic or task clusters. After selecting a cluster, each n-gram is assumed to be drawn from one of the given n-gram language models. We estimate cluster mixture weights and n-gram language model mixture weights for each cluster using the expectation-maximization (EM) algorithm to seek the parameter estimates maximizing the likelihood of the development sentences. This mixture of mixture models can be represented efficiently as a static n-gram language model using the previously proposed Bayesian language model interpolation technique. We show a significant improvement with this technique (both perplexity and WER) compared to the standard one level interpolation scheme.
本文提出了一种语言模型自适应技术,该技术从一组语言模型中构建单个静态语言模型,每个模型都在一个单独的文本语料库上训练,同时旨在最大限度地提高作为句子开发集的自适应数据集的可能性。所提出的模型可以看作是混合语言模型的混合。顶层的混合模型是句子级混合模型,其中每个句子都假定是从一组离散的主题或任务集群中提取的。在选择集群之后,假设每个n-gram都是从给定的n-gram语言模型之一中绘制的。我们使用期望最大化(EM)算法估计每个聚类的混合权值和n-gram语言模型的混合权值,以寻求使发展句子的可能性最大化的参数估计。使用先前提出的贝叶斯语言模型插值技术,这种混合模型可以有效地表示为静态n-gram语言模型。与标准的一级插值方案相比,我们展示了该技术的显着改进(包括困惑度和WER)。
{"title":"Mixture of mixture n-gram language models","authors":"H. Sak, Cyril Allauzen, Kaisuke Nakajima, F. Beaufays","doi":"10.1109/ASRU.2013.6707701","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707701","url":null,"abstract":"This paper presents a language model adaptation technique to build a single static language model from a set of language models each trained on a separate text corpus while aiming to maximize the likelihood of an adaptation data set given as a development set of sentences. The proposed model can be considered as a mixture of mixture language models. The mixture model at the top level is a sentence-level mixture model where each sentence is assumed to be drawn from one of a discrete set of topic or task clusters. After selecting a cluster, each n-gram is assumed to be drawn from one of the given n-gram language models. We estimate cluster mixture weights and n-gram language model mixture weights for each cluster using the expectation-maximization (EM) algorithm to seek the parameter estimates maximizing the likelihood of the development sentences. This mixture of mixture models can be represented efficiently as a static n-gram language model using the previously proposed Bayesian language model interpolation technique. We show a significant improvement with this technique (both perplexity and WER) compared to the standard one level interpolation scheme.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129989742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Towards unsupervised semantic retrieval of spoken content with query expansion based on automatically discovered acoustic patterns 基于自动发现声学模式的查询扩展的口语内容无监督语义检索
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707729
Yun-Chiao Li, Hung-yi Lee, Cheng-Tao Chung, Chun-an Chan, Lin-Shan Lee
This paper presents an initial effort to retrieve semantically related spoken content in a completely unsupervised way. Unsupervised approaches of spoken content retrieval is attractive because the need for annotated data reasonably matched to the spoken content for training acoustic and language models can be bypassed. However, almost all such unsupervised approaches focus on spoken term detection, or returning the spoken segments containing the query, using either template matching techniques such as dynamic time warping (DTW) or model-based approaches. However, users usually prefer to retrieve all objects semantically related to the query, but not necessarily including the query terms. This paper proposes a different approach. We transcribe the spoken segments in the archive to be retrieved through into sequences of acoustic patterns automatically discovered in an unsupervised method. For an input query in spoken form, the top-N spoken segments from the archive obtained with the first-pass retrieval with DTW are taken as pseudo-relevant. The acoustic patterns frequently occurring in these segments are therefore considered as query-related and used for query expansion. Preliminary experiments performed on Mandarin broadcast news offered very encouraging results.
本文提出了一种以完全无监督的方式检索语义相关口语内容的初步尝试。口语内容检索的无监督方法很有吸引力,因为可以绕过对与口语内容合理匹配的注释数据的需求来训练声学和语言模型。然而,几乎所有这样的无监督方法都侧重于口语术语检测,或者使用模板匹配技术(如动态时间规整(DTW))或基于模型的方法返回包含查询的口语片段。然而,用户通常更喜欢检索与查询在语义上相关的所有对象,但不一定包括查询术语。本文提出了一种不同的方法。我们将档案中的语音片段转录成以无监督方法自动发现的声学模式序列。对于语音形式的输入查询,使用DTW首次检索获得的存档中top-N个语音段被视为伪相关的。因此,在这些片段中经常出现的声学模式被认为是与查询相关的,并用于查询扩展。对普通话广播新闻进行的初步实验提供了非常令人鼓舞的结果。
{"title":"Towards unsupervised semantic retrieval of spoken content with query expansion based on automatically discovered acoustic patterns","authors":"Yun-Chiao Li, Hung-yi Lee, Cheng-Tao Chung, Chun-an Chan, Lin-Shan Lee","doi":"10.1109/ASRU.2013.6707729","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707729","url":null,"abstract":"This paper presents an initial effort to retrieve semantically related spoken content in a completely unsupervised way. Unsupervised approaches of spoken content retrieval is attractive because the need for annotated data reasonably matched to the spoken content for training acoustic and language models can be bypassed. However, almost all such unsupervised approaches focus on spoken term detection, or returning the spoken segments containing the query, using either template matching techniques such as dynamic time warping (DTW) or model-based approaches. However, users usually prefer to retrieve all objects semantically related to the query, but not necessarily including the query terms. This paper proposes a different approach. We transcribe the spoken segments in the archive to be retrieved through into sequences of acoustic patterns automatically discovered in an unsupervised method. For an input query in spoken form, the top-N spoken segments from the archive obtained with the first-pass retrieval with DTW are taken as pseudo-relevant. The acoustic patterns frequently occurring in these segments are therefore considered as query-related and used for query expansion. Preliminary experiments performed on Mandarin broadcast news offered very encouraging results.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133765456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Automatic sentiment extraction from YouTube videos 从YouTube视频中自动提取情感
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707736
L. Kaushik, A. Sangwan, J. Hansen
Extracting speaker sentiment from natural audio streams such as YouTube is challenging. A number of factors contribute to the task difficulty, namely, Automatic Speech Recognition (ASR) of spontaneous speech, unknown background environments, variable source and channel characteristics, accents, diverse topics, etc. In this study, we build upon our previous work [5], where we had proposed a system for detecting sentiment in YouTube videos. Particularly, we propose several enhancements including (i) better text-based sentiment model due to training on larger and more diverse dataset, (ii) an iterative scheme to reduce sentiment model complexity with minimal impact on performance accuracy, (iii) better speech recognition due to superior acoustic modeling and focused (domain dependent) vocabulary/language models, and (iv) a larger evaluation dataset. Collectively, our enhancements provide an absolute 10% improvement over our previous system in terms of sentiment detection accuracy. Additionally, we also present analysis that helps understand the impact of WER (word error rate) on sentiment detection accuracy. Finally, we investigate the relative importance of different Parts-of-Speech (POS) tag features towards sentiment detection. Our analysis reveals the practicality of this technology and also provides several potential directions for future work.
从YouTube等自然音频流中提取说话人的情感是一项挑战。造成任务困难的因素有很多,即自发语音的自动语音识别(ASR)、未知的背景环境、不同的来源和通道特征、口音、不同的话题等。在这项研究中,我们在之前的工作[5]的基础上,提出了一个检测YouTube视频情绪的系统。特别是,我们提出了几个增强功能,包括(i)由于在更大和更多样化的数据集上进行训练而更好的基于文本的情感模型,(ii)在对性能准确性影响最小的情况下降低情感模型复杂性的迭代方案,(iii)由于卓越的声学建模和集中的(领域依赖的)词汇/语言模型而更好的语音识别,以及(iv)更大的评估数据集。总的来说,我们的增强在情感检测精度方面比之前的系统提高了10%。此外,我们还提供了有助于理解WER(单词错误率)对情感检测准确性的影响的分析。最后,我们研究了不同词性标签特征对情感检测的相对重要性。我们的分析揭示了该技术的实用性,并为未来的工作提供了几个潜在的方向。
{"title":"Automatic sentiment extraction from YouTube videos","authors":"L. Kaushik, A. Sangwan, J. Hansen","doi":"10.1109/ASRU.2013.6707736","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707736","url":null,"abstract":"Extracting speaker sentiment from natural audio streams such as YouTube is challenging. A number of factors contribute to the task difficulty, namely, Automatic Speech Recognition (ASR) of spontaneous speech, unknown background environments, variable source and channel characteristics, accents, diverse topics, etc. In this study, we build upon our previous work [5], where we had proposed a system for detecting sentiment in YouTube videos. Particularly, we propose several enhancements including (i) better text-based sentiment model due to training on larger and more diverse dataset, (ii) an iterative scheme to reduce sentiment model complexity with minimal impact on performance accuracy, (iii) better speech recognition due to superior acoustic modeling and focused (domain dependent) vocabulary/language models, and (iv) a larger evaluation dataset. Collectively, our enhancements provide an absolute 10% improvement over our previous system in terms of sentiment detection accuracy. Additionally, we also present analysis that helps understand the impact of WER (word error rate) on sentiment detection accuracy. Finally, we investigate the relative importance of different Parts-of-Speech (POS) tag features towards sentiment detection. Our analysis reveals the practicality of this technology and also provides several potential directions for future work.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116639731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Dysfluent speech detection by image forensics techniques 基于图像取证技术的语音障碍检测
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707712
Juraj Pálfy, Sakhia Darjaa, Jiri Pospíchal
As speech recognition has become popular, the importance of dysfluency detection increased considerably. Once a dysfluent event in spontaneous speech is identified, the speech recognition performance could be enhanced by eliminating its negative effect. Most existing techniques to detect such dysfluent events are based on statistical models. Sparse regularity of dysfluent events and complexity to describe such events in a speech recognition system makes its recognition rigorous. These problems are addressed by our algorithm inspired by image forensics. This paper suggests our algorithm developed to extract novel features of complex dysfluencies. The common steps of classifier design were used to statistically evaluate the proposed features of complex dysfluencies in spectral and cepstral domains. Support vector machines perform objective assessment of MFCC features, MFCC based derived features, PCA based derived features and kernel PCA based derived features of complex dysfluencies, where our derived features increased the performance by 46% opposite to MFCC.
随着语音识别的普及,语言不流畅检测的重要性大大增加。一旦识别出自发语音中的不流利事件,就可以通过消除其负面影响来提高语音识别性能。大多数现有的检测此类不流畅事件的技术都是基于统计模型的。语音识别系统中异常事件的稀疏规律性和描述异常事件的复杂性使其识别更加严格。我们的算法受到图像取证的启发,解决了这些问题。本文提出该算法用于提取复杂语言障碍的新特征。分类器设计的常见步骤用于统计评估谱域和倒谱域的复杂不流畅特征。支持向量机对复杂不流畅的MFCC特征、基于MFCC的衍生特征、基于PCA的衍生特征和基于核PCA的衍生特征进行客观评估,其中我们的衍生特征比MFCC的性能提高了46%。
{"title":"Dysfluent speech detection by image forensics techniques","authors":"Juraj Pálfy, Sakhia Darjaa, Jiri Pospíchal","doi":"10.1109/ASRU.2013.6707712","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707712","url":null,"abstract":"As speech recognition has become popular, the importance of dysfluency detection increased considerably. Once a dysfluent event in spontaneous speech is identified, the speech recognition performance could be enhanced by eliminating its negative effect. Most existing techniques to detect such dysfluent events are based on statistical models. Sparse regularity of dysfluent events and complexity to describe such events in a speech recognition system makes its recognition rigorous. These problems are addressed by our algorithm inspired by image forensics. This paper suggests our algorithm developed to extract novel features of complex dysfluencies. The common steps of classifier design were used to statistically evaluate the proposed features of complex dysfluencies in spectral and cepstral domains. Support vector machines perform objective assessment of MFCC features, MFCC based derived features, PCA based derived features and kernel PCA based derived features of complex dysfluencies, where our derived features increased the performance by 46% opposite to MFCC.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122121230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hybrid speech recognition with Deep Bidirectional LSTM 基于深度双向LSTM的混合语音识别
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707742
Alex Graves, N. Jaitly, Abdel-rahman Mohamed
Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.
深度双向LSTM (DBLSTM)递归神经网络最近被证明在TIMIT语音数据库上具有最先进的性能。然而,这项工作的结果依赖于循环神经网络特定的目标函数,这很难与现有的大词汇量语音识别系统集成。本文研究了将DBLSTM作为标准神经网络- hmm混合系统的声学模型。我们发现DBLSTM-HMM混合在TIMIT上的效果与之前的工作一样好。在《华尔街日报》语料库的一个子集上,它的性能也优于GMM和深度网络基准。然而,在深度网络上,字错误率的改善是适度的,尽管帧级精度有很大的提高。我们得出结论,DBLSTM的混合方法似乎非常适合声学建模占主导地位的任务。需要进行进一步的研究,以了解如何更好地利用帧级准确性的改进来提高单词错误率。
{"title":"Hybrid speech recognition with Deep Bidirectional LSTM","authors":"Alex Graves, N. Jaitly, Abdel-rahman Mohamed","doi":"10.1109/ASRU.2013.6707742","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707742","url":null,"abstract":"Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130083297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1517
Barge-in effects in Bayesian dialogue act recognition and simulation 贝叶斯对话行为识别与模拟中的碰撞效应
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707713
H. Cuayáhuitl, Nina Dethlefs, H. Hastie, Oliver Lemon
Dialogue act recognition and simulation are traditionally considered separate processes. Here, we argue that both can be fruitfully treated as interleaved processes within the same probabilistic model, leading to a synchronous improvement of performance in both. To demonstrate this, we train multiple Bayes Nets that predict the timing and content of the next user utterance. A specific focus is on providing support for barge-ins. We describe experiments using the Let's Go data that show an improvement in classification accuracy (+5%) in Bayesian dialogue act recognition involving barge-ins using partial context compared to using full context. Our results also indicate that simulated dialogues with user barge-in are more realistic than simulations without barge-in events.
对话行为识别和模拟传统上被认为是两个独立的过程。在这里,我们认为两者都可以作为相同概率模型中的交错过程进行有效处理,从而导致两者的性能同步改进。为了证明这一点,我们训练了多个贝叶斯网络来预测下一个用户话语的时间和内容。具体的重点是为驳船装载提供支持。我们描述了使用Let’s Go数据的实验,与使用完整上下文相比,使用部分上下文的贝叶斯对话行为识别在涉及驳船的分类准确率(+5%)方面有所提高。我们的结果还表明,与没有用户入侵事件的模拟相比,具有用户入侵的模拟对话更真实。
{"title":"Barge-in effects in Bayesian dialogue act recognition and simulation","authors":"H. Cuayáhuitl, Nina Dethlefs, H. Hastie, Oliver Lemon","doi":"10.1109/ASRU.2013.6707713","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707713","url":null,"abstract":"Dialogue act recognition and simulation are traditionally considered separate processes. Here, we argue that both can be fruitfully treated as interleaved processes within the same probabilistic model, leading to a synchronous improvement of performance in both. To demonstrate this, we train multiple Bayes Nets that predict the timing and content of the next user utterance. A specific focus is on providing support for barge-ins. We describe experiments using the Let's Go data that show an improvement in classification accuracy (+5%) in Bayesian dialogue act recognition involving barge-ins using partial context compared to using full context. Our results also indicate that simulated dialogues with user barge-in are more realistic than simulations without barge-in events.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129749783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Vector Taylor series based HMM adaptation for generalized cepstrum in noisy environment 噪声环境下基于矢量泰勒级数的广义倒谱HMM自适应
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707727
Soonho Baek, Hong-Goo Kang
This paper proposes a novel HMM adaptation algorithm for robust automatic speech recognition (ASR) system in noisy environments. The HMM adaptation using vector Taylor series (VTS) significantly improves the ASR performance in noisy environments. Recently, the power normalized cepstral coefficient (PNCC) that replaces a logarithmic mapping function with a power mapping function has been proposed and it is proved that the replacement of the mapping function is robust to additive noise. In this paper, we extend the VTS based approach to the cepstral coefficients obtained by using a power mapping function instead of a logarithmic mapping function. Experimental results indicate that HMM adaptation in the cepstrum obtained by using a power mapping function improves the ASR performance comparing the VTS based conventional approach for mel-frequency cepstral coefficients (MFCCs).
针对噪声环境下的鲁棒自动语音识别系统,提出了一种新的HMM自适应算法。采用矢量泰勒级数(VTS)的HMM自适应显著提高了噪声环境下的自适应性能。近年来,提出了用幂映射函数代替对数映射函数的幂归一化倒谱系数(PNCC),并证明了用幂映射函数代替对数映射函数对加性噪声具有鲁棒性。在本文中,我们将基于VTS的方法扩展到使用幂映射函数代替对数映射函数获得的倒谱系数。实验结果表明,与基于VTS的传统方法相比,利用幂映射函数获得倒谱的HMM自适应方法提高了mel-frequency倒谱系数(MFCCs)的ASR性能。
{"title":"Vector Taylor series based HMM adaptation for generalized cepstrum in noisy environment","authors":"Soonho Baek, Hong-Goo Kang","doi":"10.1109/ASRU.2013.6707727","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707727","url":null,"abstract":"This paper proposes a novel HMM adaptation algorithm for robust automatic speech recognition (ASR) system in noisy environments. The HMM adaptation using vector Taylor series (VTS) significantly improves the ASR performance in noisy environments. Recently, the power normalized cepstral coefficient (PNCC) that replaces a logarithmic mapping function with a power mapping function has been proposed and it is proved that the replacement of the mapping function is robust to additive noise. In this paper, we extend the VTS based approach to the cepstral coefficients obtained by using a power mapping function instead of a logarithmic mapping function. Experimental results indicate that HMM adaptation in the cepstrum obtained by using a power mapping function improves the ASR performance comparing the VTS based conventional approach for mel-frequency cepstral coefficients (MFCCs).","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127589387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving robustness of deep neural networks via spectral masking for automatic speech recognition 基于频谱掩蔽的深度神经网络鲁棒性研究
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707743
Bo Li, K. Sim
The performance of human listeners degrades rather slowly compared to machines in noisy environments. This has been attributed to the ability of performing auditory scene analysis which separates the speech prior to recognition. In this work, we investigate two mask estimation approaches, namely the state dependent and the deep neural network (DNN) based estimations, to separate speech from noises for improving DNN acoustic models' noise robustness. The second approach has been experimentally shown to outperform the first one. Due to the stereo data based training and ill-defined masks for speech with channel distortions, both methods do not generalize well to unseen conditions and fail to beat the performance of the multi-style trained baseline system. However, the model trained on masked features demonstrates strong complementariness to the baseline model. The simple average of the two system's posteriors yields word error rates of 4.4% on Aurora2 and 12.3% on Aurora4.
与嘈杂环境中的机器相比,人类听众的表现下降得相当慢。这归功于进行听觉场景分析的能力,这种能力在识别之前将语音分离出来。在这项工作中,我们研究了两种掩模估计方法,即状态依赖估计和基于深度神经网络(DNN)的估计,以将语音从噪声中分离出来,以提高DNN声学模型的噪声鲁棒性。实验表明,第二种方法优于第一种方法。由于基于立体声数据的训练和对具有信道失真的语音的模糊掩码,这两种方法都不能很好地泛化到不可见的条件下,并且无法击败多风格训练基线系统的性能。然而,基于掩蔽特征训练的模型与基线模型具有很强的互补性。两个系统后验的简单平均值显示,Aurora2的单词错误率为4.4%,而Aurora4的错误率为12.3%。
{"title":"Improving robustness of deep neural networks via spectral masking for automatic speech recognition","authors":"Bo Li, K. Sim","doi":"10.1109/ASRU.2013.6707743","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707743","url":null,"abstract":"The performance of human listeners degrades rather slowly compared to machines in noisy environments. This has been attributed to the ability of performing auditory scene analysis which separates the speech prior to recognition. In this work, we investigate two mask estimation approaches, namely the state dependent and the deep neural network (DNN) based estimations, to separate speech from noises for improving DNN acoustic models' noise robustness. The second approach has been experimentally shown to outperform the first one. Due to the stereo data based training and ill-defined masks for speech with channel distortions, both methods do not generalize well to unseen conditions and fail to beat the performance of the multi-style trained baseline system. However, the model trained on masked features demonstrates strong complementariness to the baseline model. The simple average of the two system's posteriors yields word error rates of 4.4% on Aurora2 and 12.3% on Aurora4.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122771959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Impact of deep MLP architecture on different acoustic modeling techniques for under-resourced speech recognition 深度MLP架构对资源不足语音识别中不同声学建模技术的影响
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707752
David Imseng, P. Motlícek, Philip N. Garner, H. Bourlard
Posterior based acoustic modeling techniques such as Kullback-Leibler divergence based HMM (KL-HMM) and Tandem are able to exploit out-of-language data through posterior features, estimated by a Multi-Layer Perceptron (MLP). In this paper, we investigate the performance of posterior based approaches in the context of under-resourced speech recognition when a standard three-layer MLP is replaced by a deeper five-layer MLP. The deeper MLP architecture yields similar gains of about 15% (relative) for Tandem, KL-HMM as well as for a hybrid HMM/MLP system that directly uses the posterior estimates as emission probabilities. The best performing system, a bilingual KL-HMM based on a deep MLP, jointly trained on Afrikaans and Dutch data, performs 13% better than a hybrid system using the same bilingual MLP and 26% better than a subspace Gaussian mixture system only trained on Afrikaans data.
基于后验的声学建模技术,如基于Kullback-Leibler散度的HMM (KL-HMM)和Tandem,能够通过多层感知器(MLP)估计的后验特征来利用语言外数据。在本文中,我们研究了在资源不足的语音识别背景下,当标准的三层MLP被更深的五层MLP取代时,基于后验的方法的性能。对于Tandem、KL-HMM以及直接使用后验估计作为发射概率的混合HMM/MLP系统,更深层次的MLP架构产生了类似的15%(相对)增益。表现最好的系统是基于深度MLP的双语KL-HMM,它在南非荷兰语和荷兰语数据上进行了联合训练,比使用相同双语MLP的混合系统性能好13%,比仅在南非荷兰语数据上训练的子空间高斯混合系统性能好26%。
{"title":"Impact of deep MLP architecture on different acoustic modeling techniques for under-resourced speech recognition","authors":"David Imseng, P. Motlícek, Philip N. Garner, H. Bourlard","doi":"10.1109/ASRU.2013.6707752","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707752","url":null,"abstract":"Posterior based acoustic modeling techniques such as Kullback-Leibler divergence based HMM (KL-HMM) and Tandem are able to exploit out-of-language data through posterior features, estimated by a Multi-Layer Perceptron (MLP). In this paper, we investigate the performance of posterior based approaches in the context of under-resourced speech recognition when a standard three-layer MLP is replaced by a deeper five-layer MLP. The deeper MLP architecture yields similar gains of about 15% (relative) for Tandem, KL-HMM as well as for a hybrid HMM/MLP system that directly uses the posterior estimates as emission probabilities. The best performing system, a bilingual KL-HMM based on a deep MLP, jointly trained on Afrikaans and Dutch data, performs 13% better than a hybrid system using the same bilingual MLP and 26% better than a subspace Gaussian mixture system only trained on Afrikaans data.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124121241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
期刊
2013 IEEE Workshop on Automatic Speech Recognition and Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1