首页 > 最新文献

2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文 中文
Context-dependent modelling of deep neural network using logistic regression 基于逻辑回归的深度神经网络上下文相关建模
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707753
Guangsen Wang, K. Sim
The data sparsity problem of context-dependent acoustic modelling in automatic speech recognition is addressed by using the decision tree state clusters as the training targets in the standard context-dependent (CD) deep neural network (DNN) systems. As a result, the CD states within a cluster cannot be distinguished during decoding. This problem, referred to as the clustering problem, is not explicitly addressed in the current literature. In this paper, we formulate the CD DNN as an instance of the canonical state modelling technique based on a set of broad phone classes to address both the data sparsity and the clustering problems. The triphone is clustered into multiple sets of shorter biphones using broad phone contexts to address the data sparsity issue. A DNN is trained to discriminate the biphones within each set. The canonical states are represented by the concatenated log posteriors of all the broad phone DNNs. Logistic regression is used to transform the canonical states into the triphone state output probability. Clustering of the regression parameters is used to reduce model complexity while still achieving unique acoustic scores for all possible triphones. The experimental results on a broadcast news transcription task reveal that the proposed regression-based CD DNN significantly outperforms the standard CD DNN. The best system provides a 2.7% absolute WER reduction compared to the best standard CD DNN system.
采用决策树状态聚类作为标准上下文相关深度神经网络(DNN)系统的训练目标,解决了自动语音识别中上下文相关声学建模的数据稀疏性问题。因此,在解码期间无法区分集群内的CD状态。这个问题被称为聚类问题,在目前的文献中没有明确地解决。在本文中,我们将CD DNN作为基于一组广义电话类的规范状态建模技术的一个实例来表述,以解决数据稀疏性和聚类问题。使用广泛的电话上下文将三通电话聚类成多组较短的双通电话,以解决数据稀疏性问题。DNN被训练来区分每组中的双话筒。规范状态由所有广义电话dnn的连接日志后验表示。采用逻辑回归将正则态转化为三音态输出概率。回归参数的聚类用于降低模型复杂性,同时仍然为所有可能的三音器获得独特的声学分数。在广播新闻转录任务上的实验结果表明,基于回归的CD深度神经网络显著优于标准CD深度神经网络。与最佳标准CD DNN系统相比,最佳系统提供2.7%的绝对WER降低。
{"title":"Context-dependent modelling of deep neural network using logistic regression","authors":"Guangsen Wang, K. Sim","doi":"10.1109/ASRU.2013.6707753","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707753","url":null,"abstract":"The data sparsity problem of context-dependent acoustic modelling in automatic speech recognition is addressed by using the decision tree state clusters as the training targets in the standard context-dependent (CD) deep neural network (DNN) systems. As a result, the CD states within a cluster cannot be distinguished during decoding. This problem, referred to as the clustering problem, is not explicitly addressed in the current literature. In this paper, we formulate the CD DNN as an instance of the canonical state modelling technique based on a set of broad phone classes to address both the data sparsity and the clustering problems. The triphone is clustered into multiple sets of shorter biphones using broad phone contexts to address the data sparsity issue. A DNN is trained to discriminate the biphones within each set. The canonical states are represented by the concatenated log posteriors of all the broad phone DNNs. Logistic regression is used to transform the canonical states into the triphone state output probability. Clustering of the regression parameters is used to reduce model complexity while still achieving unique acoustic scores for all possible triphones. The experimental results on a broadcast news transcription task reveal that the proposed regression-based CD DNN significantly outperforms the standard CD DNN. The best system provides a 2.7% absolute WER reduction compared to the best standard CD DNN system.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126464659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Models of tone for tonal and non-tonal languages 声调和非声调语言的声调模型
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707740
Florian Metze, Zaid A. W. Sheikh, A. Waibel, Jonas Gehring, Kevin Kilgour, Quoc Bao Nguyen, V. Nguyen
Conventional wisdom in automatic speech recognition asserts that pitch information is not helpful in building speech recognizers for non-tonal languages and contributes only modestly to performance in speech recognizers for tonal languages. To maintain consistency between different systems, pitch is therefore often ignored, trading the slight performance benefits for greater system uniformity/ simplicity. In this paper, we report results that challenge this conventional approach. We present new models of tone that deliver consistent performance improvements for tonal languages (Cantonese, Vietnamese) and even modest improvements for non-tonal languages. Using neural networks for feature integration and fusion, these models achieve significant gains throughout, and provide us with system uniformity and standardization across all languages, tonal and non-tonal.
自动语音识别的传统观点认为,音高信息对构建非调性语言的语音识别器没有帮助,对调性语言的语音识别器的性能贡献也不大。因此,为了保持不同系统之间的一致性,pitch经常被忽略,为了更大的系统一致性/简单性而牺牲了轻微的性能优势。在本文中,我们报告了挑战这种传统方法的结果。我们提出了新的声调模型,为声调语言(粤语、越南语)提供一致的性能改进,甚至对非声调语言也有适度的改进。使用神经网络进行特征集成和融合,这些模型在整个过程中取得了显著的进步,并为我们提供了跨所有语言、声调和非声调的系统一致性和标准化。
{"title":"Models of tone for tonal and non-tonal languages","authors":"Florian Metze, Zaid A. W. Sheikh, A. Waibel, Jonas Gehring, Kevin Kilgour, Quoc Bao Nguyen, V. Nguyen","doi":"10.1109/ASRU.2013.6707740","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707740","url":null,"abstract":"Conventional wisdom in automatic speech recognition asserts that pitch information is not helpful in building speech recognizers for non-tonal languages and contributes only modestly to performance in speech recognizers for tonal languages. To maintain consistency between different systems, pitch is therefore often ignored, trading the slight performance benefits for greater system uniformity/ simplicity. In this paper, we report results that challenge this conventional approach. We present new models of tone that deliver consistent performance improvements for tonal languages (Cantonese, Vietnamese) and even modest improvements for non-tonal languages. Using neural networks for feature integration and fusion, these models achieve significant gains throughout, and provide us with system uniformity and standardization across all languages, tonal and non-tonal.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116825248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription 基于半监督训练数据的大规模深度神经网络声学建模用于YouTube视频转录
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707758
H. Liao, E. McDermott, A. Senior
YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Improving accessibility to these videos for the hearing impaired and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely challenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009, YouTube has provided automatic generation of closed captions for videos detected to have English speech; the service now supports ten different languages. This paper describes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an “island of confidence” filtering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13% relative compared to previously reported sequence trained DNN results for this task.
YouTube是一个访问量很高的视频分享网站,每月有超过10亿人观看60亿小时的视频。提高这些视频的可访问性,为听力受损和搜索和索引的目的是一个很好的应用自动语音识别。然而,YouTube视频对自动语音识别系统来说是极具挑战性的。基于标准自适应高斯混合模型(GMM)的声学模型的单词错误率可能超过50%,这使得它成为最困难的报告任务之一。自2009年以来,YouTube为检测到有英语语音的视频自动生成封闭字幕;该服务现在支持10种不同的语言。本文描述了对原始系统的最新改进,特别是使用所有者上传的视频记录来生成额外的半监督训练数据和具有大状态清单的深度神经网络声学模型。应用“置信岛”过滤启发式方法来选择有用的训练段,并通过使用44,526个上下文相关状态和低秩最终层权重矩阵近似来增加模型大小,相对于先前报道的序列训练DNN结果,该任务的性能提高了约13%。
{"title":"Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription","authors":"H. Liao, E. McDermott, A. Senior","doi":"10.1109/ASRU.2013.6707758","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707758","url":null,"abstract":"YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Improving accessibility to these videos for the hearing impaired and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely challenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009, YouTube has provided automatic generation of closed captions for videos detected to have English speech; the service now supports ten different languages. This paper describes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an “island of confidence” filtering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13% relative compared to previously reported sequence trained DNN results for this task.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121653733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 192
Score normalization and system combination for improved keyword spotting 评分归一化和系统组合,以改进关键字识别
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707731
D. Karakos, R. Schwartz, S. Tsakalidis, Le Zhang, Shivesh Ranjan, Tim Ng, Roger Hsiao, G. Saikumar, I. Bulyko, L. Nguyen, J. Makhoul, F. Grézl, M. Hannemann, M. Karafiát, Igor Szöke, Karel Veselý, L. Lamel, V. Le
We present two techniques that are shown to yield improved Keyword Spotting (KWS) performance when using the ATWV/MTWV performance measures: (i) score normalization, where the scores of different keywords become commensurate with each other and they more closely correspond to the probability of being correct than raw posteriors; and (ii) system combination, where the detections of multiple systems are merged together, and their scores are interpolated with weights which are optimized using MTWV as the maximization criterion. Both score normalization and system combination approaches show that significant gains in ATWV/MTWV can be obtained, sometimes on the order of 8-10 points (absolute), in five different languages. A variant of these methods resulted in the highest performance for the official surprise language evaluation for the IARPA-funded Babel project in April 2013.
我们提出了两种技术,当使用ATWV/MTWV性能度量时,可以提高关键字发现(KWS)性能:(i)得分归一化,其中不同关键字的得分彼此相称,并且它们比原始后验更接近于正确的概率;(ii)系统组合,其中将多个系统的检测合并在一起,并将其分数内插以MTWV作为最大化准则进行优化的权重。得分归一化和系统组合方法都表明,在五种不同的语言中,ATWV/MTWV可以获得显著的提高,有时在8-10分(绝对)的量级上。2013年4月,这些方法的一个变体在iarpa资助的Babel项目的官方惊喜语言评估中获得了最高的表现。
{"title":"Score normalization and system combination for improved keyword spotting","authors":"D. Karakos, R. Schwartz, S. Tsakalidis, Le Zhang, Shivesh Ranjan, Tim Ng, Roger Hsiao, G. Saikumar, I. Bulyko, L. Nguyen, J. Makhoul, F. Grézl, M. Hannemann, M. Karafiát, Igor Szöke, Karel Veselý, L. Lamel, V. Le","doi":"10.1109/ASRU.2013.6707731","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707731","url":null,"abstract":"We present two techniques that are shown to yield improved Keyword Spotting (KWS) performance when using the ATWV/MTWV performance measures: (i) score normalization, where the scores of different keywords become commensurate with each other and they more closely correspond to the probability of being correct than raw posteriors; and (ii) system combination, where the detections of multiple systems are merged together, and their scores are interpolated with weights which are optimized using MTWV as the maximization criterion. Both score normalization and system combination approaches show that significant gains in ATWV/MTWV can be obtained, sometimes on the order of 8-10 points (absolute), in five different languages. A variant of these methods resulted in the highest performance for the official surprise language evaluation for the IARPA-funded Babel project in April 2013.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131058900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 104
Improved punctuation recovery through combination of multiple speech streams 通过组合多个语音流改进标点恢复
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707718
João Miranda, J. Neto, A. Black
In this paper, we present a technique to use the information in multiple parallel speech streams, which are approximate translations of each other, in order to improve performance in a punctuation recovery task. We first build a phraselevel alignment of these multiple streams, using phrase tables to link the phrase pairs together. The information so collected is then used to make it more likely that sentence units are equivalent across streams. We applied this technique to a number of simultaneously interpreted speeches of the European Parliament Committees, for the recovery of the full stop, in four different languages (English, Italian, Portuguese and Spanish). We observed an average improvement in SER of 37% when compared to an existing baseline, in Portuguese and English.
在本文中,我们提出了一种利用多个并行语音流中的信息的技术,这些并行语音流是彼此的近似翻译,以提高标点恢复任务的性能。我们首先构建这些多个流的短语级对齐,使用短语表将短语对链接在一起。然后,收集到的信息被用来使句子单位更有可能在不同的流中是相等的。我们将此技术应用于四种不同语言(英语、意大利语、葡萄牙语和西班牙语)的一些欧洲议会委员会的同声传译演讲,以恢复句号。我们观察到,与现有基线相比,葡萄牙语和英语的SER平均提高了37%。
{"title":"Improved punctuation recovery through combination of multiple speech streams","authors":"João Miranda, J. Neto, A. Black","doi":"10.1109/ASRU.2013.6707718","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707718","url":null,"abstract":"In this paper, we present a technique to use the information in multiple parallel speech streams, which are approximate translations of each other, in order to improve performance in a punctuation recovery task. We first build a phraselevel alignment of these multiple streams, using phrase tables to link the phrase pairs together. The information so collected is then used to make it more likely that sentence units are equivalent across streams. We applied this technique to a number of simultaneously interpreted speeches of the European Parliament Committees, for the recovery of the full stop, in four different languages (English, Italian, Portuguese and Spanish). We observed an average improvement in SER of 37% when compared to an existing baseline, in Portuguese and English.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122285835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Unsupervised word segmentation from noisy input 噪声输入的无监督分词
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707773
Jahn Heymann, Oliver Walter, Reinhold Häb-Umbach, B. Raj
In this paper we present an algorithm for the unsupervised segmentation of a character or phoneme lattice into words. Using a lattice at the input rather than a single string accounts for the uncertainty of the character/phoneme recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model is known. Recently a Weighted Finite State Transducer (WFST) based approach has been published which we show to suffer from an issue: language model probabilities of known words are computed incorrectly. Fixing this issue leads to greatly improved precision and recall rates, however at the cost of increased computational complexity. It is therefore practical only for single input strings. To allow for a lattice input and thus for errors in the character/phoneme recognizer, we propose a computationally efficient suboptimal two-stage approach, which is shown to significantly improve the word segmentation performance compared to the earlier WFST approach.
本文提出了一种字符或音素格的无监督分割算法。在输入处使用格而不是单个字符串可以解决字符/音素识别器对真实标签序列的不确定性。一个示例应用程序是在零资源设置下,从容易出错的音素识别器的输出中发现词汇单位,在这种情况下,词典和语言模型都不知道。最近发表了一种基于加权有限状态传感器(WFST)的方法,我们发现它存在一个问题:已知词的语言模型概率计算不正确。解决这个问题可以大大提高精度和召回率,但代价是增加了计算复杂性。因此,它只适用于单个输入字符串。为了允许格子输入,从而消除字符/音素识别器中的错误,我们提出了一种计算效率高的次优两阶段方法,与早期的WFST方法相比,该方法显着提高了分词性能。
{"title":"Unsupervised word segmentation from noisy input","authors":"Jahn Heymann, Oliver Walter, Reinhold Häb-Umbach, B. Raj","doi":"10.1109/ASRU.2013.6707773","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707773","url":null,"abstract":"In this paper we present an algorithm for the unsupervised segmentation of a character or phoneme lattice into words. Using a lattice at the input rather than a single string accounts for the uncertainty of the character/phoneme recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model is known. Recently a Weighted Finite State Transducer (WFST) based approach has been published which we show to suffer from an issue: language model probabilities of known words are computed incorrectly. Fixing this issue leads to greatly improved precision and recall rates, however at the cost of increased computational complexity. It is therefore practical only for single input strings. To allow for a lattice input and thus for errors in the character/phoneme recognizer, we propose a computationally efficient suboptimal two-stage approach, which is shown to significantly improve the word segmentation performance compared to the earlier WFST approach.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129759280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Query understanding enhanced by hierarchical parsing structures 通过分层解析结构增强查询理解能力
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707708
Jingjing Liu, Panupong Pasupat, Yining Wang, D. S. Cyphers, James R. Glass
Query understanding has been well studied in the areas of information retrieval and spoken language understanding (SLU). There are generally three layers of query understanding: domain classification, user intent detection, and semantic tagging. Classifiers can be applied to domain and intent detection in real systems, and semantic tagging (or slot filling) is commonly defined as a sequence-labeling task - mapping a sequence of words to a sequence of labels. Various statistical features (e.g., n-grams) can be extracted from annotated queries for learning label prediction models; however, linguistic characteristics of queries, such as hierarchical structures and semantic relationships, are usually neglected in the feature extraction process. In this work, we propose an approach that leverages linguistic knowledge encoded in hierarchical parse trees for query understanding. Specifically, for natural language queries, we extract a set of syntactic structural features and semantic dependency features from query parse trees to enhance inference model learning. Experiments on real natural language queries show that augmenting sequence labeling models with linguistic knowledge can improve query understanding performance in various domains.
查询理解在信息检索和口语理解(SLU)领域得到了很好的研究。查询理解通常有三层:领域分类、用户意图检测和语义标记。分类器可以应用于实际系统中的领域和意图检测,语义标记(或槽填充)通常被定义为序列标记任务-将单词序列映射到标签序列。可以从标注查询中提取各种统计特征(例如n-grams),用于学习标签预测模型;然而,查询的语言特征,如层次结构和语义关系,通常在特征提取过程中被忽略。在这项工作中,我们提出了一种利用分层解析树中编码的语言知识进行查询理解的方法。具体而言,对于自然语言查询,我们从查询解析树中提取一组语法结构特征和语义依赖特征,以增强推理模型的学习。对真实自然语言查询的实验表明,利用语言知识增强序列标记模型可以提高各个领域的查询理解性能。
{"title":"Query understanding enhanced by hierarchical parsing structures","authors":"Jingjing Liu, Panupong Pasupat, Yining Wang, D. S. Cyphers, James R. Glass","doi":"10.1109/ASRU.2013.6707708","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707708","url":null,"abstract":"Query understanding has been well studied in the areas of information retrieval and spoken language understanding (SLU). There are generally three layers of query understanding: domain classification, user intent detection, and semantic tagging. Classifiers can be applied to domain and intent detection in real systems, and semantic tagging (or slot filling) is commonly defined as a sequence-labeling task - mapping a sequence of words to a sequence of labels. Various statistical features (e.g., n-grams) can be extracted from annotated queries for learning label prediction models; however, linguistic characteristics of queries, such as hierarchical structures and semantic relationships, are usually neglected in the feature extraction process. In this work, we propose an approach that leverages linguistic knowledge encoded in hierarchical parse trees for query understanding. Specifically, for natural language queries, we extract a set of syntactic structural features and semantic dependency features from query parse trees to enhance inference model learning. Experiments on real natural language queries show that augmenting sequence labeling models with linguistic knowledge can improve query understanding performance in various domains.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125665599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Efficient nearly error-less LVCSR decoding based on incremental forward and backward passes 基于增量向前和向后传递的高效几乎无错误的LVCSR解码
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707707
David Nolden, R. Schlüter, H. Ney
We show that most search errors can be identified by aligning the results of a symmetric forward and backward decoding pass. Based on this knowledge, we introduce an efficient high-level decoding architecture which yields virtually no search errors, and requires virtually no manual tuning. We perform an initial forward- and backward decoding with tight initial beams, then we identify search errors, and then we recursively increment the beam sizes and perform new forward and backward decodings for erroneous intervals until no more search errors are detected. Consequently, each utterance and even each single word is decoded with the smallest beam size required to decode it correctly. On all tested systems we achieve an error rate equal or very close to classical decoding with ideally tuned beam size, but unsupervisedly without specific tuning, and at around 2 times faster runtime. An additional speedup by factor 2 can be achieved by decoding the forward and backward pass in separate threads.
我们表明,大多数搜索错误可以通过对齐对称的前向和后向解码传递的结果来识别。基于这些知识,我们介绍了一种高效的高级解码架构,它几乎不会产生搜索错误,并且几乎不需要手动调优。我们使用紧凑的初始波束执行初始前向和后向解码,然后识别搜索错误,然后递归地增加波束大小,并针对错误间隔执行新的前向和后向解码,直到不再检测到搜索错误。因此,每个话语甚至每个单词都可以用正确解码所需的最小波束大小进行解码。在所有测试的系统中,我们实现了与经典解码相同或非常接近的错误率,具有理想的调谐波束大小,但没有特定的调谐,并且运行时间大约快了2倍。通过在单独的线程中解码向前和向后传递,可以实现2倍的额外加速。
{"title":"Efficient nearly error-less LVCSR decoding based on incremental forward and backward passes","authors":"David Nolden, R. Schlüter, H. Ney","doi":"10.1109/ASRU.2013.6707707","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707707","url":null,"abstract":"We show that most search errors can be identified by aligning the results of a symmetric forward and backward decoding pass. Based on this knowledge, we introduce an efficient high-level decoding architecture which yields virtually no search errors, and requires virtually no manual tuning. We perform an initial forward- and backward decoding with tight initial beams, then we identify search errors, and then we recursively increment the beam sizes and perform new forward and backward decodings for erroneous intervals until no more search errors are detected. Consequently, each utterance and even each single word is decoded with the smallest beam size required to decode it correctly. On all tested systems we achieve an error rate equal or very close to classical decoding with ideally tuned beam size, but unsupervisedly without specific tuning, and at around 2 times faster runtime. An additional speedup by factor 2 can be achieved by decoding the forward and backward pass in separate threads.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114515173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Accelerating recurrent neural network training via two stage classes and parallelization 通过两阶段分类和并行化加速递归神经网络训练
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707751
Zhiheng Huang, G. Zweig, Michael Levit, Benoît Dumoulin, Barlas Oğuz, Shawn Chang
Recurrent neural network (RNN) language models have proven to be successful to lower the perplexity and word error rate in automatic speech recognition (ASR). However, one challenge to adopt RNN language models is due to their heavy computational cost in training. In this paper, we propose two techniques to accelerate RNN training: 1) two stage class RNN and 2) parallel RNN training. In experiments on Microsoft internal short message dictation (SMD) data set, two stage class RNNs and parallel RNNs not only result in equal or lower WERs compared to original RNNs but also accelerate training by 2 and 10 times respectively. It is worth noting that two stage class RNN speedup can also be applied to test stage, which is essential to reduce the latency in real time ASR applications.
递归神经网络(RNN)语言模型在降低自动语音识别(ASR)中的困惑度和单词错误率方面取得了成功。然而,采用RNN语言模型的一个挑战是由于它们在训练中的计算成本很高。在本文中,我们提出了两种加速RNN训练的技术:1)两阶段类RNN和2)并行RNN训练。在微软内部短消息听写(SMD)数据集上的实验中,两阶段类rnn和并行rnn不仅与原始rnn的训练结果相等或更低,而且训练速度分别提高了2倍和10倍。值得注意的是,两阶段类RNN加速也可以应用于测试阶段,这对于减少实时ASR应用中的延迟至关重要。
{"title":"Accelerating recurrent neural network training via two stage classes and parallelization","authors":"Zhiheng Huang, G. Zweig, Michael Levit, Benoît Dumoulin, Barlas Oğuz, Shawn Chang","doi":"10.1109/ASRU.2013.6707751","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707751","url":null,"abstract":"Recurrent neural network (RNN) language models have proven to be successful to lower the perplexity and word error rate in automatic speech recognition (ASR). However, one challenge to adopt RNN language models is due to their heavy computational cost in training. In this paper, we propose two techniques to accelerate RNN training: 1) two stage class RNN and 2) parallel RNN training. In experiments on Microsoft internal short message dictation (SMD) data set, two stage class RNNs and parallel RNNs not only result in equal or lower WERs compared to original RNNs but also accelerate training by 2 and 10 times respectively. It is worth noting that two stage class RNN speedup can also be applied to test stage, which is essential to reduce the latency in real time ASR applications.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130117254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Using proxies for OOV keywords in the keyword search task 在关键字搜索任务中为OOV关键字使用代理
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707766
Guoguo Chen, Oguz Yilmaz, J. Trmal, Daniel Povey, S. Khudanpur
We propose a simple but effective weighted finite state transducer (WFST) based framework for handling out-of-vocabulary (OOV) keywords in a speech search task. State-of-the-art large vocabulary continuous speech recognition (LVCSR) and keyword search (KWS) systems are developed for conversational telephone speech in Tagalog. Word-based and phone-based indexes are created from word lattices, the latter by using the LVCSR system's pronunciation lexicon. Pronunciations of OOV keywords are hypothesized via a standard grapheme-to-phoneme method. In-vocabulary proxies (word or phone sequences) are generated for each OOV keyword using WFST techniques that permit incorporation of a phone confusion matrix. Empirical results when searching for the Babel/NIST evaluation keywords in the Babel 10 hour development-test speech collection show that (i) searching for word proxies in the word index significantly outperforms searching for phonetic representations of OOV words in a phone index, and (ii) while phone confusion information yields minor improvement when searching a phone index, it yields up to 40% improvement in actual term weighted value when searching a word index with word proxies.
我们提出了一个简单而有效的基于加权有限状态换能器(WFST)的框架来处理语音搜索任务中的词汇外(OOV)关键字。针对他加禄语的电话会话语音,开发了最新的大词汇连续语音识别(LVCSR)和关键字搜索(KWS)系统。基于单词和基于电话的索引是从单词格中创建的,后者使用LVCSR系统的发音词典。OOV关键词的发音是通过标准的字素到音素的方法来假设的。使用允许合并电话混淆矩阵的WFST技术为每个OOV关键字生成词汇表内代理(单词或电话序列)。在Babel 10小时发展测试语音集合中搜索Babel/NIST评价关键词的实证结果表明:(i)在单词索引中搜索单词代理显著优于在电话索引中搜索OOV单词的语音表示;(ii)在搜索电话索引时,虽然电话混淆信息的改进很小,但在使用单词代理搜索单词索引时,实际术语加权值的改进高达40%。
{"title":"Using proxies for OOV keywords in the keyword search task","authors":"Guoguo Chen, Oguz Yilmaz, J. Trmal, Daniel Povey, S. Khudanpur","doi":"10.1109/ASRU.2013.6707766","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707766","url":null,"abstract":"We propose a simple but effective weighted finite state transducer (WFST) based framework for handling out-of-vocabulary (OOV) keywords in a speech search task. State-of-the-art large vocabulary continuous speech recognition (LVCSR) and keyword search (KWS) systems are developed for conversational telephone speech in Tagalog. Word-based and phone-based indexes are created from word lattices, the latter by using the LVCSR system's pronunciation lexicon. Pronunciations of OOV keywords are hypothesized via a standard grapheme-to-phoneme method. In-vocabulary proxies (word or phone sequences) are generated for each OOV keyword using WFST techniques that permit incorporation of a phone confusion matrix. Empirical results when searching for the Babel/NIST evaluation keywords in the Babel 10 hour development-test speech collection show that (i) searching for word proxies in the word index significantly outperforms searching for phonetic representations of OOV words in a phone index, and (ii) while phone confusion information yields minor improvement when searching a phone index, it yields up to 40% improvement in actual term weighted value when searching a word index with word proxies.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117304124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 100
期刊
2013 IEEE Workshop on Automatic Speech Recognition and Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1