2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文中文

Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing 基于框架语义分析的口语对话系统语义槽的无监督归纳和填充

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707716

Yun-Nung (Vivian) Chen, William Yang Wang, Alexander I. Rudnicky

Spoken dialogue systems typically use predefined semantic slots to parse users' natural language inputs into unified semantic representations. To define the slots, domain experts and professional annotators are often involved, and the cost can be expensive. In this paper, we ask the following question: given a collection of unlabeled raw audios, can we use the frame semantics theory to automatically induce and fill the semantic slots in an unsupervised fashion? To do this, we propose the use of a state-of-the-art frame-semantic parser, and a spectral clustering based slot ranking model that adapts the generic output of the parser to the target semantic space. Empirical experiments on a real-world spoken dialogue dataset show that the automatically induced semantic slots are in line with the reference slots created by domain experts: we observe a mean averaged precision of 69.36% using ASR-transcribed data. Our slot filling evaluations also indicate the promising future of this proposed approach.

口语对话系统通常使用预定义的语义槽将用户的自然语言输入解析为统一的语义表示。要定义插槽，通常需要领域专家和专业注释人员，而且成本可能很高。在本文中，我们提出了以下问题:给定一组未标记的原始音频，我们是否可以使用框架语义理论以无监督的方式自动归纳和填充语义槽?为此，我们建议使用最先进的帧语义解析器，以及基于谱聚类的槽排序模型，该模型使解析器的一般输出适应目标语义空间。在真实口语对话数据集上的实证实验表明，自动生成的语义槽与领域专家创建的参考槽一致:我们观察到使用asr转录数据的平均精度为69.36%。我们的槽填充评估也表明了该方法的良好前景。

引用次数: 88

DNN acoustic modeling with modular multi-lingual feature extraction networks 基于模块化多语言特征提取网络的深度神经网络声学建模

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707754

Jonas Gehring, Quoc Bao Nguyen, Florian Metze, A. Waibel

In this work, we propose several deep neural network architectures that are able to leverage data from multiple languages. Modularity is achieved by training networks for extracting high-level features and for estimating phoneme state posteriors separately, and then combining them for decoding in a hybrid DNN/HMM setup. This approach has been shown to achieve superior performance for single-language systems, and here we demonstrate that feature extractors benefit significantly from being trained as multi-lingual networks with shared hidden representations. We also show that existing mono-lingual networks can be re-used in a modular fashion to achieve a similar level of performance without having to train new networks on multi-lingual data. Furthermore, we investigate in extending these architectures to make use of language-specific acoustic features. Evaluations are performed on a low-resource conversational telephone speech transcription task in Vietnamese, while additional data for acoustic model training is provided in Pashto, Tagalog, Turkish, and Cantonese. Improvements of up to 17.4% and 13.8% over mono-lingual GMMs and DNNs, respectively, are obtained.

在这项工作中，我们提出了几种能够利用多种语言数据的深度神经网络架构。模块化是通过训练网络分别提取高级特征和估计音素状态后验来实现的，然后在混合DNN/HMM设置中将它们组合在一起进行解码。这种方法已经被证明可以在单语言系统中获得卓越的性能，在这里我们证明了特征提取器从被训练成具有共享隐藏表示的多语言网络中受益匪浅。我们还表明，现有的单语言网络可以以模块化的方式重用，以达到类似的性能水平，而无需在多语言数据上训练新的网络。此外，我们还研究了如何扩展这些架构以利用特定语言的声学特征。对越南语的低资源会话电话语音转录任务进行了评估，同时提供了普什图语、他加禄语、土耳其语和粤语的声学模型训练的附加数据。与单语GMMs和dnn相比，分别提高了17.4%和13.8%。

{"title":"DNN acoustic modeling with modular multi-lingual feature extraction networks","authors":"Jonas Gehring, Quoc Bao Nguyen, Florian Metze, A. Waibel","doi":"10.1109/ASRU.2013.6707754","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707754","url":null,"abstract":"In this work, we propose several deep neural network architectures that are able to leverage data from multiple languages. Modularity is achieved by training networks for extracting high-level features and for estimating phoneme state posteriors separately, and then combining them for decoding in a hybrid DNN/HMM setup. This approach has been shown to achieve superior performance for single-language systems, and here we demonstrate that feature extractors benefit significantly from being trained as multi-lingual networks with shared hidden representations. We also show that existing mono-lingual networks can be re-used in a modular fashion to achieve a similar level of performance without having to train new networks on multi-lingual data. Furthermore, we investigate in extending these architectures to make use of language-specific acoustic features. Evaluations are performed on a low-resource conversational telephone speech transcription task in Vietnamese, while additional data for acoustic model training is provided in Pashto, Tagalog, Turkish, and Cantonese. Improvements of up to 17.4% and 13.8% over mono-lingual GMMs and DNNs, respectively, are obtained.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125431380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings 低资源环境下可变长度段的固定维声学嵌入

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707765

Keith D. Levin, Katharine Henry, A. Jansen, Karen Livescu

Measures of acoustic similarity between words or other units are critical for segmental exemplar-based acoustic models, spoken term discovery, and query-by-example search. Dynamic time warping (DTW) alignment cost has been the most commonly used measure, but it has well-known inadequacies. Some recently proposed alternatives require large amounts of training data. In the interest of finding more efficient, accurate, and low-resource alternatives, we consider the problem of embedding speech segments of arbitrary length into fixed-dimensional spaces in which simple distances (such as cosine or Euclidean) serve as a proxy for linguistically meaningful (phonetic, lexical, etc.) dissimilarities. Such embeddings would enable efficient audio indexing and permit application of standard distance learning techniques to segmental acoustic modeling. In this paper, we explore several supervised and unsupervised approaches to this problem and evaluate them on an acoustic word discrimination task. We identify several embedding algorithms that match or improve upon the DTW baseline in low-resource settings.

单词或其他单元之间的声学相似性度量对于基于分段范例的声学模型、口语术语发现和按例查询搜索至关重要。动态时间规整(DTW)对齐成本是最常用的度量方法，但它存在着众所周知的不足。最近提出的一些替代方案需要大量的训练数据。为了寻找更高效、准确和低资源的替代方案，我们考虑了将任意长度的语音片段嵌入固定维空间的问题，其中简单距离(如余弦或欧几里得)作为语言意义(语音、词汇等)差异的代理。这种嵌入将使有效的音频索引和允许标准的远程学习技术应用于分段声学建模。在本文中，我们探索了几种有监督和无监督的方法来解决这个问题，并在一个声学单词识别任务上对它们进行了评价。我们确定了几种在低资源设置下匹配或改进DTW基线的嵌入算法。

{"title":"Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings","authors":"Keith D. Levin, Katharine Henry, A. Jansen, Karen Livescu","doi":"10.1109/ASRU.2013.6707765","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707765","url":null,"abstract":"Measures of acoustic similarity between words or other units are critical for segmental exemplar-based acoustic models, spoken term discovery, and query-by-example search. Dynamic time warping (DTW) alignment cost has been the most commonly used measure, but it has well-known inadequacies. Some recently proposed alternatives require large amounts of training data. In the interest of finding more efficient, accurate, and low-resource alternatives, we consider the problem of embedding speech segments of arbitrary length into fixed-dimensional spaces in which simple distances (such as cosine or Euclidean) serve as a proxy for linguistically meaningful (phonetic, lexical, etc.) dissimilarities. Such embeddings would enable efficient audio indexing and permit application of standard distance learning techniques to segmental acoustic modeling. In this paper, we explore several supervised and unsupervised approaches to this problem and evaluate them on an acoustic word discrimination task. We identify several embedding algorithms that match or improve upon the DTW baseline in low-resource settings.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120876196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

Semi-supervised training of Deep Neural Networks 深度神经网络的半监督训练

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707741

Karel Veselý, M. Hannemann, L. Burget

In this paper we search for an optimal strategy for semi-supervised Deep Neural Network (DNN) training. We assume that a small part of the data is transcribed, while the majority of the data is untranscribed. We explore self-training strategies with data selection based on both the utterance-level and frame-level confidences. Further on, we study the interactions between semi-supervised frame-discriminative training and sequence-discriminative sMBR training. We found it beneficial to reduce the disproportion in amounts of transcribed and untranscribed data by including the transcribed data several times, as well as to do a frame-selection based on per-frame confidences derived from confusion in a lattice. For the experiments, we used the Limited language pack condition for the Surprise language task (Vietnamese) from the IARPA Babel program. The absolute Word Error Rate (WER) improvement for frame cross-entropy training is 2.2%, this corresponds to WER recovery of 36% when compared to the identical system, where the DNN is built on the fully transcribed data.

本文研究半监督深度神经网络(DNN)训练的最优策略。我们假设一小部分数据已转录，而大部分数据未转录。我们探索了基于话语级和框架级自信的数据选择的自我训练策略。在此基础上，我们进一步研究了半监督框架判别训练和序列判别sMBR训练之间的相互作用。我们发现，通过多次包含转录数据来减少转录和未转录数据数量的不比例，以及基于从晶格中混乱派生的每帧置信度进行帧选择是有益的。对于实验，我们使用了来自IARPA Babel计划的惊喜语言任务(越南语)的有限语言包条件。帧交叉熵训练的绝对字错误率(WER)改善为2.2%，与相同系统相比，这相当于36%的WER恢复，其中DNN建立在完全转录的数据上。

引用次数: 137

Effective pseudo-relevance feedback for language modeling in speech recognition 语音识别中有效的伪相关反馈语言建模

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707698

Berlin Chen, Yi-Wen Chen, Kuan-Yu Chen, E. Jan

A part and parcel of any automatic speech recognition (ASR) system is language modeling (LM), which helps to constrain the acoustic analysis, guide the search through multiple candidate word strings, and quantify the acceptability of the final output hypothesis given an input utterance. Despite the fact that the n-gram model remains the predominant one, a number of novel and ingenious LM methods have been developed to complement or be used in place of the n-gram model. A more recent line of research is to leverage information cues gleaned from pseudo-relevance feedback (PRF) to derive an utterance-regularized language model for complementing the n-gram model. This paper presents a continuation of this general line of research and its main contribution is two-fold. First, we explore an alternative and more efficient formulation to construct such an utterance-regularized language model for ASR. Second, the utilities of various utterance-regularized language models are analyzed and compared extensively. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task demonstrate that our proposed language models can offer substantial improvements over the baseline n-gram system, and achieve performance competitive to, or better than, some state-of-the-art language models.

任何自动语音识别(ASR)系统的重要组成部分都是语言建模(LM)，它有助于约束声学分析，指导在多个候选词串中的搜索，并量化给定输入话语的最终输出假设的可接受性。尽管n-gram模型仍然是主要的模型，但已经开发了许多新颖而巧妙的LM方法来补充或代替n-gram模型。最近的一项研究是利用从伪相关反馈(PRF)中收集的信息线索来推导一个话语正则化语言模型，以补充n-gram模型。本文提出了这一研究总路线的延续，其主要贡献有两个方面。首先，我们探索了一种更有效的替代方案来构建ASR的话语正则化语言模型。其次，对各种话语正则化语言模型的效用进行了广泛的分析和比较。在大词汇量连续语音识别(LVCSR)任务上的经验实验表明，我们提出的语言模型可以在基线n-gram系统上提供实质性的改进，并实现与一些最先进的语言模型相媲美或更好的性能。

{"title":"Effective pseudo-relevance feedback for language modeling in speech recognition","authors":"Berlin Chen, Yi-Wen Chen, Kuan-Yu Chen, E. Jan","doi":"10.1109/ASRU.2013.6707698","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707698","url":null,"abstract":"A part and parcel of any automatic speech recognition (ASR) system is language modeling (LM), which helps to constrain the acoustic analysis, guide the search through multiple candidate word strings, and quantify the acceptability of the final output hypothesis given an input utterance. Despite the fact that the n-gram model remains the predominant one, a number of novel and ingenious LM methods have been developed to complement or be used in place of the n-gram model. A more recent line of research is to leverage information cues gleaned from pseudo-relevance feedback (PRF) to derive an utterance-regularized language model for complementing the n-gram model. This paper presents a continuation of this general line of research and its main contribution is two-fold. First, we explore an alternative and more efficient formulation to construct such an utterance-regularized language model for ASR. Second, the utilities of various utterance-regularized language models are analyzed and compared extensively. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task demonstrate that our proposed language models can offer substantial improvements over the baseline n-gram system, and achieve performance competitive to, or better than, some state-of-the-art language models.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131620207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantic entity detection from multiple ASR hypotheses within the WFST framework 在WFST框架内从多个ASR假设中进行语义实体检测

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707710

J. Svec, P. Ircing, L. Smídl

The paper presents a novel approach to named entity detection from ASR lattices. Since the described method not only detects the named entities but also assigns a detailed semantic interpretation to them, we call our approach the semantic entity detection. All the algorithms are designed to use automata operations defined within the framework of weighted finite state transducers (WFST) - the ASR lattices are nowadays frequently represented as weighted acceptors. The expert knowledge about the semantics of the task at hand can be first expressed in the form of a context free grammar and then converted to the FST form. We use a WFST optimization to obtain compact representation of the ASR lattice. The WFST framework also allows to use the word confusion networks as another representation of multiple ASR hypotheses. That way we can use the full power of composition and optimization operations implemented in the OpenFST toolkit for our semantic entity detection algorithm. The devised method also employs the concept of a factor automaton; this approach allows us to overcome the need for a filler model and consequently makes the method more general. The paper includes experimental evaluation of the proposed algorithm and compares the performance obtained by using the one-best word hypothesis, optimized lattices and word confusion networks.

本文提出了一种从ASR格中检测命名实体的新方法。由于所描述的方法不仅检测命名实体，而且还为它们分配详细的语义解释，因此我们称我们的方法为语义实体检测。所有的算法都被设计为使用在加权有限状态传感器(WFST)框架内定义的自动机操作- ASR晶格现在经常被表示为加权受体。关于手头任务的语义的专业知识可以首先以上下文无关语法的形式表示，然后转换为FST形式。我们使用WFST优化来获得ASR晶格的紧凑表示。WFST框架还允许使用词混淆网络作为多个ASR假设的另一种表示。这样，我们就可以利用OpenFST工具包中实现的所有组合和优化操作的全部功能来实现我们的语义实体检测算法。所设计的方法还采用了因子自动机的概念;这种方法使我们能够克服对填充模型的需求，从而使该方法更加通用。本文对所提算法进行了实验评价，并比较了采用单优词假设、优化格和混淆词网络所获得的性能。

{"title":"Semantic entity detection from multiple ASR hypotheses within the WFST framework","authors":"J. Svec, P. Ircing, L. Smídl","doi":"10.1109/ASRU.2013.6707710","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707710","url":null,"abstract":"The paper presents a novel approach to named entity detection from ASR lattices. Since the described method not only detects the named entities but also assigns a detailed semantic interpretation to them, we call our approach the semantic entity detection. All the algorithms are designed to use automata operations defined within the framework of weighted finite state transducers (WFST) - the ASR lattices are nowadays frequently represented as weighted acceptors. The expert knowledge about the semantics of the task at hand can be first expressed in the form of a context free grammar and then converted to the FST form. We use a WFST optimization to obtain compact representation of the ASR lattice. The WFST framework also allows to use the word confusion networks as another representation of multiple ASR hypotheses. That way we can use the full power of composition and optimization operations implemented in the OpenFST toolkit for our semantic entity detection algorithm. The devised method also employs the concept of a factor automaton; this approach allows us to overcome the need for a filler model and consequently makes the method more general. The paper includes experimental evaluation of the proposed algorithm and compares the performance obtained by using the one-best word hypothesis, optimized lattices and word confusion networks.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124045880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Learning better lexical properties for recurrent OOV words 学习更好的反复出现的OOV单词的词汇特性

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707699

Longlu Qin, Alexander I. Rudnicky

Out-of-vocabulary (OOV) words can appear more than once in a conversation or over a period of time. Such multiple instances of the same OOV word provide valuable information for learning the lexical properties of the word. Therefore, we investigated how to estimate better pronunciation, spelling and part-of-speech (POS) label for recurrent OOV words. We first identified recurrent OOV words from the output of a hybrid decoder by applying a bottom-up clustering approach. Then, multiple instances of the same OOV word were used simultaneously to learn properties of the OOV word. The experimental results showed that the bottom-up clustering approach is very effective at detecting the recurrence of OOV words. Furthermore, by using evidence from multiple instances of the same word, the pronunciation accuracy, recovery rate and POS label accuracy of recurrent OOV words can be substantially improved.

在对话中或在一段时间内，词汇外的单词可能会出现不止一次。同一个OOV单词的这种多个实例为学习该单词的词汇特性提供了有价值的信息。因此，我们研究了如何估计更好的发音，拼写和词性(词性)标签的重复OOV词。我们首先通过应用自下而上的聚类方法从混合解码器的输出中识别出重复的OOV单词。然后，同时使用同一个OOV单词的多个实例来学习该OOV单词的属性。实验结果表明，自底向上聚类方法对OOV词的重复检测是非常有效的。此外，通过使用来自同一单词的多个实例的证据，可以大大提高重复出现的OOV单词的发音准确率、回收率和POS标签准确率。

引用次数: 4

Compact acoustic modeling based on acoustic manifold using a mixture of factor analyzers 基于混合因子分析的声流形紧凑声学建模

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707702

Wenlin Zhang, Bi-cheng Li, Weiqiang Zhang

A compact acoustic model for speech recognition is proposed based on nonlinear manifold modeling of the acoustic feature space. Acoustic features of the speech signal is assumed to form a low-dimensional manifold, which is modeled by a mixture of factor analyzers. Each factor analyzer describes a local area of the manifold using a low-dimensional linear model. For an HMM-based speech recognition system, observations of a particular state are constrained to be located on part of the manifold, which may cover several factor analyzers. For each tied-state, a sparse weight vector is obtained through an iteration shrinkage algorithm, in which the sparseness is determined automatically by the training data. For each nonzero component of the weight vector, a low-dimensional factor is estimated for the corresponding factor model according to the maximum a posteriori (MAP) criterion, resulting in a compact state model. Experimental results show that compared with the conventional HMM-GMM system and the SGMM system, the new method not only contains fewer parameters, but also yields better recognition results.

基于声学特征空间的非线性流形建模，提出了一种紧凑的语音识别声学模型。假设语音信号的声学特征形成一个低维流形，该流形由混合因子分析仪建模。每个因子分析器使用低维线性模型描述流形的局部区域。对于基于hmm的语音识别系统，特定状态的观察被限制在歧管的一部分，这可能涵盖几个因素分析仪。对于每个绑定状态，通过迭代收缩算法获得一个稀疏权向量，其中稀疏度由训练数据自动确定。对于权向量的每个非零分量，根据最大后验(MAP)准则对相应的因子模型估计一个低维因子，得到一个紧凑的状态模型。实验结果表明，与传统的HMM-GMM系统和SGMM系统相比，新方法不仅包含更少的参数，而且具有更好的识别效果。

{"title":"Compact acoustic modeling based on acoustic manifold using a mixture of factor analyzers","authors":"Wenlin Zhang, Bi-cheng Li, Weiqiang Zhang","doi":"10.1109/ASRU.2013.6707702","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707702","url":null,"abstract":"A compact acoustic model for speech recognition is proposed based on nonlinear manifold modeling of the acoustic feature space. Acoustic features of the speech signal is assumed to form a low-dimensional manifold, which is modeled by a mixture of factor analyzers. Each factor analyzer describes a local area of the manifold using a low-dimensional linear model. For an HMM-based speech recognition system, observations of a particular state are constrained to be located on part of the manifold, which may cover several factor analyzers. For each tied-state, a sparse weight vector is obtained through an iteration shrinkage algorithm, in which the sparseness is determined automatically by the training data. For each nonzero component of the weight vector, a low-dimensional factor is estimated for the corresponding factor model according to the maximum a posteriori (MAP) criterion, resulting in a compact state model. Experimental results show that compared with the conventional HMM-GMM system and the SGMM system, the new method not only contains fewer parameters, but also yields better recognition results.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129363371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Discriminative semi-supervised training for keyword search in low resource languages 针对低资源语言关键词搜索的判别式半监督训练

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707770

Roger Hsiao, Tim Ng, F. Grézl, D. Karakos, S. Tsakalidis, L. Nguyen, R. Schwartz

In this paper, we investigate semi-supervised training for low resource languages where the initial systems may have high error rate (≥ 70.0% word eror rate). To handle the lack of data, we study semi-supervised techniques including data selection, data weighting, discriminative training and multilayer perceptron learning to improve system performance. The entire suite of semi-supervised methods presented in this paper was evaluated under the IARPA Babel program for the keyword spotting tasks. Our semi-supervised system had the best performance in the OpenKWS13 surprise language evaluation for the limited condition. In this paper, we describe our work on the Turkish and Vietnamese systems.

在本文中，我们研究了针对低资源语言的半监督训练，在低资源语言中，初始系统可能具有较高的错误率（单词错误率≥ 70.0%）。为了解决数据缺乏的问题，我们研究了包括数据选择、数据加权、判别训练和多层感知器学习在内的半监督技术，以提高系统性能。本文介绍的整套半监督方法是在 IARPA 巴别计划下针对关键词搜索任务进行评估的。我们的半监督系统在有限条件下的 OpenKWS13 意外语言评估中表现最佳。本文将介绍我们在土耳其语和越南语系统方面所做的工作。

引用次数: 25

An empirical study of confusion modeling in keyword search for low resource languages 低资源语言关键词搜索中的混淆建模实证研究

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707774

M. Saraçlar, A. Sethy, B. Ramabhadran, L. Mangu, Jia Cui, Xiaodong Cui, Brian Kingsbury, Jonathan Mamou

Keyword search, in the context of low resource languages, has emerged as a key area of research. The dominant approach in keyword search is to use Automatic Speech Recognition (ASR) as a front end to produce a representation of audio that can be indexed. The biggest drawback of this approach lies in its the inability to deal with out-of-vocabulary words and query terms that are not in the ASR system output. In this paper we present an empirical study evaluating various approaches based on using confusion models as query expansion techniques to address this problem. We present results across four languages using a range of confusion models which lead to significant improvements in keyword search performance as measured by the Maximum Term Weighted Value (MTWV) metric.

关键词搜索，在低资源语言的背景下，已经成为一个重要的研究领域。关键字搜索的主要方法是使用自动语音识别(ASR)作为前端来生成可索引的音频表示。这种方法的最大缺点在于它无法处理不在ASR系统输出中的词汇表外的单词和查询术语。在本文中，我们提出了一项实证研究，评估了基于使用混淆模型作为查询扩展技术来解决这个问题的各种方法。我们使用一系列混淆模型展示了四种语言的结果，这些模型通过最大术语加权值(MTWV)度量指标显著改善了关键字搜索性能。

引用次数: 42

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀