首页 > 最新文献

2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文 中文
Language style and domain adaptation for cross-language SLU porting 跨语言SLU移植的语言风格和领域适应
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707720
Evgeny A. Stepanov, Ilya Kashkarev, Ali Orkan Bayer, G. Riccardi, Arindam Ghosh
Automatic cross-language Spoken Language Understanding porting is plagued by two limitations. First, SLU are usually trained on limited domain corpora. Second, language pair resources (e.g. aligned corpora) are scarce or unmatched in style (e.g. news vs. conversation). We present experiments on automatic style adaptation of the input for the translation systems and their output for SLU. We approach the problem of scarce aligned data by adapting the available parallel data to the target domain using limited in-domain and larger web crawled close-to-domain corpora. SLU performance is optimized by reranking its output with Recurrent Neural Network-based joint language model. We evaluate end-to-end SLU porting on close and distant language pairs: Spanish - Italian and Turkish - Italian; and achieve significant improvements both in translation quality and SLU performance.
自动跨语言口语理解移植受到两个限制的困扰。首先,SLU通常是在有限的领域语料库上训练的。其次,语言对资源(例如对齐的语料库)很少或在风格上不匹配(例如新闻与对话)。我们提出了翻译系统输入和翻译系统输出的自动风格自适应实验。我们通过使用有限的域内和更大的web爬取的近域语料库将可用的并行数据适应目标领域来解决稀缺对齐数据的问题。采用基于递归神经网络的联合语言模型对SLU的输出进行重新排序,优化了SLU的性能。我们评估了端到端的SLU移植在近端和远端的语言对:西班牙语-意大利语和土耳其语-意大利语;在翻译质量和SLU性能方面都取得了显著的进步。
{"title":"Language style and domain adaptation for cross-language SLU porting","authors":"Evgeny A. Stepanov, Ilya Kashkarev, Ali Orkan Bayer, G. Riccardi, Arindam Ghosh","doi":"10.1109/ASRU.2013.6707720","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707720","url":null,"abstract":"Automatic cross-language Spoken Language Understanding porting is plagued by two limitations. First, SLU are usually trained on limited domain corpora. Second, language pair resources (e.g. aligned corpora) are scarce or unmatched in style (e.g. news vs. conversation). We present experiments on automatic style adaptation of the input for the translation systems and their output for SLU. We approach the problem of scarce aligned data by adapting the available parallel data to the target domain using limited in-domain and larger web crawled close-to-domain corpora. SLU performance is optimized by reranking its output with Recurrent Neural Network-based joint language model. We evaluate end-to-end SLU porting on close and distant language pairs: Spanish - Italian and Turkish - Italian; and achieve significant improvements both in translation quality and SLU performance.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125451802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A propagation approach to modelling the joint distributions of clean and corrupted speech in the Mel-Cepstral domain 一种在mel -倒谱域模拟干净和损坏语音联合分布的传播方法
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707726
Ramón Fernández Astudillo
This paper presents a closed form solution relating the joint distributions of corrupted and clean speech in the short-time Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficient (MFCC) domains. This makes possible a tighter integration of STFT domain speech enhancement and feature and model-compensation techniques for robust automatic speech recognition. The approach directly utilizes the conventional speech distortion model for STFT speech enhancement, allowing for low cost, single pass, causal implementations. Compared to similar uncertainty propagation approaches, it provides the full joint distribution, rather than just the posterior distribution, which provides additional model compensation possibilities. The method is exemplified by deriving an MMSE-MFCC estimator from the propagated joint distribution. It is shown that similar performance to that of STFT uncertainty propagation (STFT-UP) can be obtained on the AURORA4, while deriving the full joint distribution.
本文给出了一种关于短时间傅里叶变换(STFT)和mel -频率倒谱系数(MFCC)域中损坏语音和干净语音联合分布的封闭解。这使得STFT域语音增强和特征和模型补偿技术的更紧密集成成为可能,以实现鲁棒自动语音识别。该方法直接利用传统的语音失真模型进行STFT语音增强,允许低成本,单次通过,因果实现。与类似的不确定性传播方法相比,它提供了完整的联合分布,而不仅仅是后验分布,这提供了额外的模型补偿可能性。通过从传播联合分布中推导出MMSE-MFCC估计量,对该方法进行了验证。结果表明,在获得全联合分布的同时,在AURORA4上可以获得与STFT不确定性传播(STFT- up)相似的性能。
{"title":"A propagation approach to modelling the joint distributions of clean and corrupted speech in the Mel-Cepstral domain","authors":"Ramón Fernández Astudillo","doi":"10.1109/ASRU.2013.6707726","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707726","url":null,"abstract":"This paper presents a closed form solution relating the joint distributions of corrupted and clean speech in the short-time Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficient (MFCC) domains. This makes possible a tighter integration of STFT domain speech enhancement and feature and model-compensation techniques for robust automatic speech recognition. The approach directly utilizes the conventional speech distortion model for STFT speech enhancement, allowing for low cost, single pass, causal implementations. Compared to similar uncertainty propagation approaches, it provides the full joint distribution, rather than just the posterior distribution, which provides additional model compensation possibilities. The method is exemplified by deriving an MMSE-MFCC estimator from the propagated joint distribution. It is shown that similar performance to that of STFT uncertainty propagation (STFT-UP) can be obtained on the AURORA4, while deriving the full joint distribution.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128395459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multi-stream temporally varying weight regression for cross-lingual speech recognition 跨语言语音识别的多流时变权回归
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707769
Shilin Liu, K. Sim
Building a good Automatic Speech Recognition (ASR) system with limited resources is a very challenging task due to the existing many speech variations. Multilingual and cross-lingual speech recognition techniques are commonly used for this task. This paper investigates the recently proposed Temporally Varying Weight Regression (TVWR) method for cross-lingual speech recognition. TVWR uses posterior features to implicitly model the long-term temporal structures in acoustic patterns. By leveraging on the well-trained foreign recognizers, high quality monophone/state posteriors can be easily incorporated into TVWR to boost the ASR performance on low-resource languages. Furthermore, multi-stream TVWR is proposed, where multiple sets of posterior features are used to incorporate richer (temporal and spatial) context information. Finally, a separate state-tying for the TVWR regression parameters is used to better utilize the more reliable posterior features. Experimental results are evaluated for English and Malay speech recognition with limited resources. By using the Czech, Hungarian and Russian posterior features, TVWR was found to consistently outperform the tandem systems trained on the same features.
在有限的资源下,构建一个好的自动语音识别系统是一项非常具有挑战性的任务,因为语音变体很多。多语言和跨语言语音识别技术通常用于此任务。本文研究了最近提出的用于跨语言语音识别的时间变权回归(TVWR)方法。TVWR使用后验特征来隐式模拟声学模式中的长期时间结构。通过利用训练有素的外国识别器,高质量的单音/状态后置可以很容易地融入TVWR中,以提高低资源语言的ASR性能。在此基础上,提出了多流TVWR算法,利用多组后验特征融合更丰富的(时间和空间)上下文信息。最后,对TVWR回归参数使用单独的状态绑定,以更好地利用更可靠的后验特征。在有限的资源条件下,对马来语和英语语音识别的实验结果进行了评价。通过使用捷克、匈牙利和俄罗斯的后验特征,TVWR被发现始终优于在相同特征上训练的串联系统。
{"title":"Multi-stream temporally varying weight regression for cross-lingual speech recognition","authors":"Shilin Liu, K. Sim","doi":"10.1109/ASRU.2013.6707769","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707769","url":null,"abstract":"Building a good Automatic Speech Recognition (ASR) system with limited resources is a very challenging task due to the existing many speech variations. Multilingual and cross-lingual speech recognition techniques are commonly used for this task. This paper investigates the recently proposed Temporally Varying Weight Regression (TVWR) method for cross-lingual speech recognition. TVWR uses posterior features to implicitly model the long-term temporal structures in acoustic patterns. By leveraging on the well-trained foreign recognizers, high quality monophone/state posteriors can be easily incorporated into TVWR to boost the ASR performance on low-resource languages. Furthermore, multi-stream TVWR is proposed, where multiple sets of posterior features are used to incorporate richer (temporal and spatial) context information. Finally, a separate state-tying for the TVWR regression parameters is used to better utilize the more reliable posterior features. Experimental results are evaluated for English and Malay speech recognition with limited resources. By using the Czech, Hungarian and Russian posterior features, TVWR was found to consistently outperform the tandem systems trained on the same features.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133916696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Cross-lingual context sharing and parameter-tying for multi-lingual speech recognition 多语言语音识别的跨语言上下文共享和参数关联
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707717
Aanchan Mohan, R. Rose
This paper is concerned with the problem of building acoustic models for automatic speech recognition (ASR) using speech data from multiple languages. Techniques for multi-lingual ASR are developed in the context of the subspace Gaussian mixture model (SGMM)[2, 3]. Multi-lingual SGMM based ASR systems have been configured with shared subspace parameters trained from multiple languages but with distinct language dependent phonetic contexts and states[11, 12]. First, an approach for sharing state-level target language and foreign language SGMM parameters is described. Second, semi-tied covariance transformations are applied as an alternative to full-covariance Gaussians to make acoustic model training less sensitive to issues of insufficient training data. These techniques are applied to Hindi and Marathi language data obtained for an agricultural commodities dialog task in multiple Indian languages.
本文研究了基于多语言语音数据的自动语音识别声学模型的建立问题。多语种ASR技术是在子空间高斯混合模型(SGMM)的背景下发展起来的[2,3]。基于多语言SGMM的ASR系统已经配置了从多种语言训练的共享子空间参数,但具有不同的语言依赖的语音上下文和状态[11,12]。首先,描述了一种共享国家级目标语言和外语SGMM参数的方法。其次,采用半捆绑协方差变换作为全协方差高斯变换的替代方法,使声学模型训练对训练数据不足的问题不那么敏感。这些技术应用于印地语和马拉地语数据,这些数据是为多种印度语言的农产品对话任务获得的。
{"title":"Cross-lingual context sharing and parameter-tying for multi-lingual speech recognition","authors":"Aanchan Mohan, R. Rose","doi":"10.1109/ASRU.2013.6707717","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707717","url":null,"abstract":"This paper is concerned with the problem of building acoustic models for automatic speech recognition (ASR) using speech data from multiple languages. Techniques for multi-lingual ASR are developed in the context of the subspace Gaussian mixture model (SGMM)[2, 3]. Multi-lingual SGMM based ASR systems have been configured with shared subspace parameters trained from multiple languages but with distinct language dependent phonetic contexts and states[11, 12]. First, an approach for sharing state-level target language and foreign language SGMM parameters is described. Second, semi-tied covariance transformations are applied as an alternative to full-covariance Gaussians to make acoustic model training less sensitive to issues of insufficient training data. These techniques are applied to Hindi and Marathi language data obtained for an agricultural commodities dialog task in multiple Indian languages.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132744127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing 基于框架语义分析的口语对话系统语义槽的无监督归纳和填充
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707716
Yun-Nung (Vivian) Chen, William Yang Wang, Alexander I. Rudnicky
Spoken dialogue systems typically use predefined semantic slots to parse users' natural language inputs into unified semantic representations. To define the slots, domain experts and professional annotators are often involved, and the cost can be expensive. In this paper, we ask the following question: given a collection of unlabeled raw audios, can we use the frame semantics theory to automatically induce and fill the semantic slots in an unsupervised fashion? To do this, we propose the use of a state-of-the-art frame-semantic parser, and a spectral clustering based slot ranking model that adapts the generic output of the parser to the target semantic space. Empirical experiments on a real-world spoken dialogue dataset show that the automatically induced semantic slots are in line with the reference slots created by domain experts: we observe a mean averaged precision of 69.36% using ASR-transcribed data. Our slot filling evaluations also indicate the promising future of this proposed approach.
口语对话系统通常使用预定义的语义槽将用户的自然语言输入解析为统一的语义表示。要定义插槽,通常需要领域专家和专业注释人员,而且成本可能很高。在本文中,我们提出了以下问题:给定一组未标记的原始音频,我们是否可以使用框架语义理论以无监督的方式自动归纳和填充语义槽?为此,我们建议使用最先进的帧语义解析器,以及基于谱聚类的槽排序模型,该模型使解析器的一般输出适应目标语义空间。在真实口语对话数据集上的实证实验表明,自动生成的语义槽与领域专家创建的参考槽一致:我们观察到使用asr转录数据的平均精度为69.36%。我们的槽填充评估也表明了该方法的良好前景。
{"title":"Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing","authors":"Yun-Nung (Vivian) Chen, William Yang Wang, Alexander I. Rudnicky","doi":"10.1109/ASRU.2013.6707716","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707716","url":null,"abstract":"Spoken dialogue systems typically use predefined semantic slots to parse users' natural language inputs into unified semantic representations. To define the slots, domain experts and professional annotators are often involved, and the cost can be expensive. In this paper, we ask the following question: given a collection of unlabeled raw audios, can we use the frame semantics theory to automatically induce and fill the semantic slots in an unsupervised fashion? To do this, we propose the use of a state-of-the-art frame-semantic parser, and a spectral clustering based slot ranking model that adapts the generic output of the parser to the target semantic space. Empirical experiments on a real-world spoken dialogue dataset show that the automatically induced semantic slots are in line with the reference slots created by domain experts: we observe a mean averaged precision of 69.36% using ASR-transcribed data. Our slot filling evaluations also indicate the promising future of this proposed approach.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130285528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 88
DNN acoustic modeling with modular multi-lingual feature extraction networks 基于模块化多语言特征提取网络的深度神经网络声学建模
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707754
Jonas Gehring, Quoc Bao Nguyen, Florian Metze, A. Waibel
In this work, we propose several deep neural network architectures that are able to leverage data from multiple languages. Modularity is achieved by training networks for extracting high-level features and for estimating phoneme state posteriors separately, and then combining them for decoding in a hybrid DNN/HMM setup. This approach has been shown to achieve superior performance for single-language systems, and here we demonstrate that feature extractors benefit significantly from being trained as multi-lingual networks with shared hidden representations. We also show that existing mono-lingual networks can be re-used in a modular fashion to achieve a similar level of performance without having to train new networks on multi-lingual data. Furthermore, we investigate in extending these architectures to make use of language-specific acoustic features. Evaluations are performed on a low-resource conversational telephone speech transcription task in Vietnamese, while additional data for acoustic model training is provided in Pashto, Tagalog, Turkish, and Cantonese. Improvements of up to 17.4% and 13.8% over mono-lingual GMMs and DNNs, respectively, are obtained.
在这项工作中,我们提出了几种能够利用多种语言数据的深度神经网络架构。模块化是通过训练网络分别提取高级特征和估计音素状态后验来实现的,然后在混合DNN/HMM设置中将它们组合在一起进行解码。这种方法已经被证明可以在单语言系统中获得卓越的性能,在这里我们证明了特征提取器从被训练成具有共享隐藏表示的多语言网络中受益匪浅。我们还表明,现有的单语言网络可以以模块化的方式重用,以达到类似的性能水平,而无需在多语言数据上训练新的网络。此外,我们还研究了如何扩展这些架构以利用特定语言的声学特征。对越南语的低资源会话电话语音转录任务进行了评估,同时提供了普什图语、他加禄语、土耳其语和粤语的声学模型训练的附加数据。与单语GMMs和dnn相比,分别提高了17.4%和13.8%。
{"title":"DNN acoustic modeling with modular multi-lingual feature extraction networks","authors":"Jonas Gehring, Quoc Bao Nguyen, Florian Metze, A. Waibel","doi":"10.1109/ASRU.2013.6707754","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707754","url":null,"abstract":"In this work, we propose several deep neural network architectures that are able to leverage data from multiple languages. Modularity is achieved by training networks for extracting high-level features and for estimating phoneme state posteriors separately, and then combining them for decoding in a hybrid DNN/HMM setup. This approach has been shown to achieve superior performance for single-language systems, and here we demonstrate that feature extractors benefit significantly from being trained as multi-lingual networks with shared hidden representations. We also show that existing mono-lingual networks can be re-used in a modular fashion to achieve a similar level of performance without having to train new networks on multi-lingual data. Furthermore, we investigate in extending these architectures to make use of language-specific acoustic features. Evaluations are performed on a low-resource conversational telephone speech transcription task in Vietnamese, while additional data for acoustic model training is provided in Pashto, Tagalog, Turkish, and Cantonese. Improvements of up to 17.4% and 13.8% over mono-lingual GMMs and DNNs, respectively, are obtained.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125431380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings 低资源环境下可变长度段的固定维声学嵌入
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707765
Keith D. Levin, Katharine Henry, A. Jansen, Karen Livescu
Measures of acoustic similarity between words or other units are critical for segmental exemplar-based acoustic models, spoken term discovery, and query-by-example search. Dynamic time warping (DTW) alignment cost has been the most commonly used measure, but it has well-known inadequacies. Some recently proposed alternatives require large amounts of training data. In the interest of finding more efficient, accurate, and low-resource alternatives, we consider the problem of embedding speech segments of arbitrary length into fixed-dimensional spaces in which simple distances (such as cosine or Euclidean) serve as a proxy for linguistically meaningful (phonetic, lexical, etc.) dissimilarities. Such embeddings would enable efficient audio indexing and permit application of standard distance learning techniques to segmental acoustic modeling. In this paper, we explore several supervised and unsupervised approaches to this problem and evaluate them on an acoustic word discrimination task. We identify several embedding algorithms that match or improve upon the DTW baseline in low-resource settings.
单词或其他单元之间的声学相似性度量对于基于分段范例的声学模型、口语术语发现和按例查询搜索至关重要。动态时间规整(DTW)对齐成本是最常用的度量方法,但它存在着众所周知的不足。最近提出的一些替代方案需要大量的训练数据。为了寻找更高效、准确和低资源的替代方案,我们考虑了将任意长度的语音片段嵌入固定维空间的问题,其中简单距离(如余弦或欧几里得)作为语言意义(语音、词汇等)差异的代理。这种嵌入将使有效的音频索引和允许标准的远程学习技术应用于分段声学建模。在本文中,我们探索了几种有监督和无监督的方法来解决这个问题,并在一个声学单词识别任务上对它们进行了评价。我们确定了几种在低资源设置下匹配或改进DTW基线的嵌入算法。
{"title":"Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings","authors":"Keith D. Levin, Katharine Henry, A. Jansen, Karen Livescu","doi":"10.1109/ASRU.2013.6707765","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707765","url":null,"abstract":"Measures of acoustic similarity between words or other units are critical for segmental exemplar-based acoustic models, spoken term discovery, and query-by-example search. Dynamic time warping (DTW) alignment cost has been the most commonly used measure, but it has well-known inadequacies. Some recently proposed alternatives require large amounts of training data. In the interest of finding more efficient, accurate, and low-resource alternatives, we consider the problem of embedding speech segments of arbitrary length into fixed-dimensional spaces in which simple distances (such as cosine or Euclidean) serve as a proxy for linguistically meaningful (phonetic, lexical, etc.) dissimilarities. Such embeddings would enable efficient audio indexing and permit application of standard distance learning techniques to segmental acoustic modeling. In this paper, we explore several supervised and unsupervised approaches to this problem and evaluate them on an acoustic word discrimination task. We identify several embedding algorithms that match or improve upon the DTW baseline in low-resource settings.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"645 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120876196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 120
Semi-supervised training of Deep Neural Networks 深度神经网络的半监督训练
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707741
Karel Veselý, M. Hannemann, L. Burget
In this paper we search for an optimal strategy for semi-supervised Deep Neural Network (DNN) training. We assume that a small part of the data is transcribed, while the majority of the data is untranscribed. We explore self-training strategies with data selection based on both the utterance-level and frame-level confidences. Further on, we study the interactions between semi-supervised frame-discriminative training and sequence-discriminative sMBR training. We found it beneficial to reduce the disproportion in amounts of transcribed and untranscribed data by including the transcribed data several times, as well as to do a frame-selection based on per-frame confidences derived from confusion in a lattice. For the experiments, we used the Limited language pack condition for the Surprise language task (Vietnamese) from the IARPA Babel program. The absolute Word Error Rate (WER) improvement for frame cross-entropy training is 2.2%, this corresponds to WER recovery of 36% when compared to the identical system, where the DNN is built on the fully transcribed data.
本文研究半监督深度神经网络(DNN)训练的最优策略。我们假设一小部分数据已转录,而大部分数据未转录。我们探索了基于话语级和框架级自信的数据选择的自我训练策略。在此基础上,我们进一步研究了半监督框架判别训练和序列判别sMBR训练之间的相互作用。我们发现,通过多次包含转录数据来减少转录和未转录数据数量的不比例,以及基于从晶格中混乱派生的每帧置信度进行帧选择是有益的。对于实验,我们使用了来自IARPA Babel计划的惊喜语言任务(越南语)的有限语言包条件。帧交叉熵训练的绝对字错误率(WER)改善为2.2%,与相同系统相比,这相当于36%的WER恢复,其中DNN建立在完全转录的数据上。
{"title":"Semi-supervised training of Deep Neural Networks","authors":"Karel Veselý, M. Hannemann, L. Burget","doi":"10.1109/ASRU.2013.6707741","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707741","url":null,"abstract":"In this paper we search for an optimal strategy for semi-supervised Deep Neural Network (DNN) training. We assume that a small part of the data is transcribed, while the majority of the data is untranscribed. We explore self-training strategies with data selection based on both the utterance-level and frame-level confidences. Further on, we study the interactions between semi-supervised frame-discriminative training and sequence-discriminative sMBR training. We found it beneficial to reduce the disproportion in amounts of transcribed and untranscribed data by including the transcribed data several times, as well as to do a frame-selection based on per-frame confidences derived from confusion in a lattice. For the experiments, we used the Limited language pack condition for the Surprise language task (Vietnamese) from the IARPA Babel program. The absolute Word Error Rate (WER) improvement for frame cross-entropy training is 2.2%, this corresponds to WER recovery of 36% when compared to the identical system, where the DNN is built on the fully transcribed data.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128043740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 137
Effective pseudo-relevance feedback for language modeling in speech recognition 语音识别中有效的伪相关反馈语言建模
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707698
Berlin Chen, Yi-Wen Chen, Kuan-Yu Chen, E. Jan
A part and parcel of any automatic speech recognition (ASR) system is language modeling (LM), which helps to constrain the acoustic analysis, guide the search through multiple candidate word strings, and quantify the acceptability of the final output hypothesis given an input utterance. Despite the fact that the n-gram model remains the predominant one, a number of novel and ingenious LM methods have been developed to complement or be used in place of the n-gram model. A more recent line of research is to leverage information cues gleaned from pseudo-relevance feedback (PRF) to derive an utterance-regularized language model for complementing the n-gram model. This paper presents a continuation of this general line of research and its main contribution is two-fold. First, we explore an alternative and more efficient formulation to construct such an utterance-regularized language model for ASR. Second, the utilities of various utterance-regularized language models are analyzed and compared extensively. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task demonstrate that our proposed language models can offer substantial improvements over the baseline n-gram system, and achieve performance competitive to, or better than, some state-of-the-art language models.
任何自动语音识别(ASR)系统的重要组成部分都是语言建模(LM),它有助于约束声学分析,指导在多个候选词串中的搜索,并量化给定输入话语的最终输出假设的可接受性。尽管n-gram模型仍然是主要的模型,但已经开发了许多新颖而巧妙的LM方法来补充或代替n-gram模型。最近的一项研究是利用从伪相关反馈(PRF)中收集的信息线索来推导一个话语正则化语言模型,以补充n-gram模型。本文提出了这一研究总路线的延续,其主要贡献有两个方面。首先,我们探索了一种更有效的替代方案来构建ASR的话语正则化语言模型。其次,对各种话语正则化语言模型的效用进行了广泛的分析和比较。在大词汇量连续语音识别(LVCSR)任务上的经验实验表明,我们提出的语言模型可以在基线n-gram系统上提供实质性的改进,并实现与一些最先进的语言模型相媲美或更好的性能。
{"title":"Effective pseudo-relevance feedback for language modeling in speech recognition","authors":"Berlin Chen, Yi-Wen Chen, Kuan-Yu Chen, E. Jan","doi":"10.1109/ASRU.2013.6707698","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707698","url":null,"abstract":"A part and parcel of any automatic speech recognition (ASR) system is language modeling (LM), which helps to constrain the acoustic analysis, guide the search through multiple candidate word strings, and quantify the acceptability of the final output hypothesis given an input utterance. Despite the fact that the n-gram model remains the predominant one, a number of novel and ingenious LM methods have been developed to complement or be used in place of the n-gram model. A more recent line of research is to leverage information cues gleaned from pseudo-relevance feedback (PRF) to derive an utterance-regularized language model for complementing the n-gram model. This paper presents a continuation of this general line of research and its main contribution is two-fold. First, we explore an alternative and more efficient formulation to construct such an utterance-regularized language model for ASR. Second, the utilities of various utterance-regularized language models are analyzed and compared extensively. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task demonstrate that our proposed language models can offer substantial improvements over the baseline n-gram system, and achieve performance competitive to, or better than, some state-of-the-art language models.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131620207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic entity detection from multiple ASR hypotheses within the WFST framework 在WFST框架内从多个ASR假设中进行语义实体检测
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707710
J. Svec, P. Ircing, L. Smídl
The paper presents a novel approach to named entity detection from ASR lattices. Since the described method not only detects the named entities but also assigns a detailed semantic interpretation to them, we call our approach the semantic entity detection. All the algorithms are designed to use automata operations defined within the framework of weighted finite state transducers (WFST) - the ASR lattices are nowadays frequently represented as weighted acceptors. The expert knowledge about the semantics of the task at hand can be first expressed in the form of a context free grammar and then converted to the FST form. We use a WFST optimization to obtain compact representation of the ASR lattice. The WFST framework also allows to use the word confusion networks as another representation of multiple ASR hypotheses. That way we can use the full power of composition and optimization operations implemented in the OpenFST toolkit for our semantic entity detection algorithm. The devised method also employs the concept of a factor automaton; this approach allows us to overcome the need for a filler model and consequently makes the method more general. The paper includes experimental evaluation of the proposed algorithm and compares the performance obtained by using the one-best word hypothesis, optimized lattices and word confusion networks.
本文提出了一种从ASR格中检测命名实体的新方法。由于所描述的方法不仅检测命名实体,而且还为它们分配详细的语义解释,因此我们称我们的方法为语义实体检测。所有的算法都被设计为使用在加权有限状态传感器(WFST)框架内定义的自动机操作- ASR晶格现在经常被表示为加权受体。关于手头任务的语义的专业知识可以首先以上下文无关语法的形式表示,然后转换为FST形式。我们使用WFST优化来获得ASR晶格的紧凑表示。WFST框架还允许使用词混淆网络作为多个ASR假设的另一种表示。这样,我们就可以利用OpenFST工具包中实现的所有组合和优化操作的全部功能来实现我们的语义实体检测算法。所设计的方法还采用了因子自动机的概念;这种方法使我们能够克服对填充模型的需求,从而使该方法更加通用。本文对所提算法进行了实验评价,并比较了采用单优词假设、优化格和混淆词网络所获得的性能。
{"title":"Semantic entity detection from multiple ASR hypotheses within the WFST framework","authors":"J. Svec, P. Ircing, L. Smídl","doi":"10.1109/ASRU.2013.6707710","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707710","url":null,"abstract":"The paper presents a novel approach to named entity detection from ASR lattices. Since the described method not only detects the named entities but also assigns a detailed semantic interpretation to them, we call our approach the semantic entity detection. All the algorithms are designed to use automata operations defined within the framework of weighted finite state transducers (WFST) - the ASR lattices are nowadays frequently represented as weighted acceptors. The expert knowledge about the semantics of the task at hand can be first expressed in the form of a context free grammar and then converted to the FST form. We use a WFST optimization to obtain compact representation of the ASR lattice. The WFST framework also allows to use the word confusion networks as another representation of multiple ASR hypotheses. That way we can use the full power of composition and optimization operations implemented in the OpenFST toolkit for our semantic entity detection algorithm. The devised method also employs the concept of a factor automaton; this approach allows us to overcome the need for a filler model and consequently makes the method more general. The paper includes experimental evaluation of the proposed algorithm and compares the performance obtained by using the one-best word hypothesis, optimized lattices and word confusion networks.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124045880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
2013 IEEE Workshop on Automatic Speech Recognition and Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1