首页 > 最新文献

2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文 中文
Search results based N-best hypothesis rescoring with maximum entropy classification 基于n -最优假设评分的最大熵分类搜索结果
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707767
Fuchun Peng, Scott Roy, B. Shahshahani, F. Beaufays
We propose a simple yet effective method for improving speech recognition by reranking the N-best speech recognition hypotheses using search results. We model N-best reranking as a binary classification problem and select the hypothesis with the highest classification confidence. We use query-specific features extracted from the search results to encode domain knowledge and use it with a maximum entropy classifier to rescore the N-best list. We show that rescoring even only the top 2 hypotheses, we can obtain a significant 3% absolute sentence accuracy (SACC) improvement over a strong baseline on production traffic from an entertainment domain.
我们提出了一种简单而有效的方法,通过使用搜索结果对n个最佳语音识别假设进行重新排序来改进语音识别。我们将N-best重排序建模为一个二元分类问题,并选择具有最高分类置信度的假设。我们使用从搜索结果中提取的特定于查询的特征对领域知识进行编码,并将其与最大熵分类器一起使用来重新评分N-best列表。我们表明,即使只重新记录前2个假设,我们也可以在娱乐领域的生产流量的强大基线上获得显著的3%的绝对句子准确性(SACC)提高。
{"title":"Search results based N-best hypothesis rescoring with maximum entropy classification","authors":"Fuchun Peng, Scott Roy, B. Shahshahani, F. Beaufays","doi":"10.1109/ASRU.2013.6707767","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707767","url":null,"abstract":"We propose a simple yet effective method for improving speech recognition by reranking the N-best speech recognition hypotheses using search results. We model N-best reranking as a binary classification problem and select the hypothesis with the highest classification confidence. We use query-specific features extracted from the search results to encode domain knowledge and use it with a maximum entropy classifier to rescore the N-best list. We show that rescoring even only the top 2 hypotheses, we can obtain a significant 3% absolute sentence accuracy (SACC) improvement over a strong baseline on production traffic from an entertainment domain.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130988861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
The IBM keyword search system for the DARPA RATS program IBM关键字搜索系统为DARPA RATS计划
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707730
L. Mangu, H. Soltau, H. Kuo, G. Saon
The paper describes a state-of-the-art keyword search (KWS) system in which significant improvements are obtained by using Convolutional Neural Network acoustic models, a two-step speech segmentation approach and a simplified ASR architecture optimized for KWS. The system described in this paper had the best performance in the 2013 DARPA RATS evaluation for both Levantine and Farsi.
本文介绍了一个最先进的关键字搜索(KWS)系统,该系统通过使用卷积神经网络声学模型、两步语音分割方法和针对KWS优化的简化ASR架构获得了显著的改进。本文所描述的系统在2013年DARPA黎凡特语和波斯语的RATS评估中均表现最佳。
{"title":"The IBM keyword search system for the DARPA RATS program","authors":"L. Mangu, H. Soltau, H. Kuo, G. Saon","doi":"10.1109/ASRU.2013.6707730","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707730","url":null,"abstract":"The paper describes a state-of-the-art keyword search (KWS) system in which significant improvements are obtained by using Convolutional Neural Network acoustic models, a two-step speech segmentation approach and a simplified ASR architecture optimized for KWS. The system described in this paper had the best performance in the 2013 DARPA RATS evaluation for both Levantine and Farsi.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124264167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Learning filter banks within a deep neural network framework 在深度神经网络框架内学习滤波器组
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707746
Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, B. Ramabhadran
Mel-filter banks are commonly used in speech recognition, as they are motivated from theory related to speech production and perception. While features derived from mel-filter banks are quite popular, we argue that this filter bank is not really an appropriate choice as it is not learned for the objective at hand, i.e. speech recognition. In this paper, we explore replacing the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network. Thus, the filter bank is learned to minimize cross-entropy, which is more closely tied to the speech recognition objective. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter bank learning approach, compared to having a fixed set of filters.
梅尔滤波器组通常用于语音识别,因为它们是由与语音产生和感知相关的理论驱动的。虽然mel-filter bank衍生的特征非常受欢迎,但我们认为这个filter bank并不是一个合适的选择,因为它不是为手头的目标(即语音识别)而学习的。在本文中,我们探索用与深度神经网络的其余部分共同学习的滤波器组层替换滤波器组。因此,滤波器组被学习最小化交叉熵,这与语音识别目标更紧密地联系在一起。在一个50小时的英语广播新闻任务中,我们表明,与使用一组固定的过滤器相比,使用过滤器组学习方法,我们可以在单词错误率(WER)方面实现5%的相对改进。
{"title":"Learning filter banks within a deep neural network framework","authors":"Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, B. Ramabhadran","doi":"10.1109/ASRU.2013.6707746","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707746","url":null,"abstract":"Mel-filter banks are commonly used in speech recognition, as they are motivated from theory related to speech production and perception. While features derived from mel-filter banks are quite popular, we argue that this filter bank is not really an appropriate choice as it is not learned for the objective at hand, i.e. speech recognition. In this paper, we explore replacing the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network. Thus, the filter bank is learned to minimize cross-entropy, which is more closely tied to the speech recognition objective. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter bank learning approach, compared to having a fixed set of filters.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114089047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 170
Learning a subword vocabulary based on unigram likelihood 学习基于一元似然的子词词汇
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707697
Matti Varjokallio, M. Kurimo, Sami Virpioja
Using words as vocabulary units for tasks like speech recognition is infeasible for many morphologically rich languages, including Finnish. Thus, subword units are commonly used for language modeling. This work presents a novel algorithm for creating a subword vocabulary, based on the unigram likelihood of a text corpus. The method is evaluated with entropy measure and a Finnish LVCSR task. Unigram entropy of the text corpus is shown to be a good indicator for the quality of higher order n-gram models, also resulting in high speech recognition accuracy.
在语音识别等任务中使用单词作为词汇单位对于许多词法丰富的语言(包括芬兰语)是不可行的。因此,子词单位通常用于语言建模。这项工作提出了一种基于文本语料库的单图似然来创建子词词汇的新算法。用熵测度和芬兰LVCSR任务对该方法进行了评价。文本语料库的单图熵是高阶n-图模型质量的良好指标,也导致了较高的语音识别精度。
{"title":"Learning a subword vocabulary based on unigram likelihood","authors":"Matti Varjokallio, M. Kurimo, Sami Virpioja","doi":"10.1109/ASRU.2013.6707697","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707697","url":null,"abstract":"Using words as vocabulary units for tasks like speech recognition is infeasible for many morphologically rich languages, including Finnish. Thus, subword units are commonly used for language modeling. This work presents a novel algorithm for creating a subword vocabulary, based on the unigram likelihood of a text corpus. The method is evaluated with entropy measure and a Finnish LVCSR task. Unigram entropy of the text corpus is shown to be a good indicator for the quality of higher order n-gram models, also resulting in high speech recognition accuracy.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123224498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Acoustic unit discovery and pronunciation generation from a grapheme-based lexicon 基于字素词典的声学单元发现和发音生成
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707760
William Hartmann, A. Roy, L. Lamel, J. Gauvain
We present a framework for discovering acoustic units and generating an associated pronunciation lexicon from an initial grapheme-based recognition system. Our approach consists of two distinct contributions. First, context-dependent grapheme models are clustered using a spectral clustering approach to create a set of phone-like acoustic units. Next, we transform the pronunciation lexicon using a statistical machine translation-based approach. Pronunciation hypotheses generated from a decoding of the training set are used to create a phrase-based translation table. We propose a novel method for scoring the phrase-based rules that significantly improves the output of the transformation process. Results on an English language dataset demonstrate the combined methods provide a 13% relative reduction in word error rate compared to a baseline grapheme-based system. Our approach could potentially be applied to low-resource languages without existing lexicons, such as in the Babel project.
我们提出了一个框架,用于发现声学单位,并从初始的基于字素的识别系统中生成相关的发音词典。我们的方法包括两个不同的贡献。首先,使用光谱聚类方法对上下文相关的字素模型进行聚类,以创建一组类似电话的声学单元。接下来,我们使用基于统计机器翻译的方法对发音词典进行转换。从训练集的解码中产生的发音假设用于创建基于短语的翻译表。我们提出了一种新的基于短语的规则评分方法,该方法显著提高了转换过程的输出。在英语语言数据集上的结果表明,与基于字形的基准系统相比,组合方法的单词错误率相对降低了13%。我们的方法可以潜在地应用于没有现有词典的低资源语言,比如Babel项目。
{"title":"Acoustic unit discovery and pronunciation generation from a grapheme-based lexicon","authors":"William Hartmann, A. Roy, L. Lamel, J. Gauvain","doi":"10.1109/ASRU.2013.6707760","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707760","url":null,"abstract":"We present a framework for discovering acoustic units and generating an associated pronunciation lexicon from an initial grapheme-based recognition system. Our approach consists of two distinct contributions. First, context-dependent grapheme models are clustered using a spectral clustering approach to create a set of phone-like acoustic units. Next, we transform the pronunciation lexicon using a statistical machine translation-based approach. Pronunciation hypotheses generated from a decoding of the training set are used to create a phrase-based translation table. We propose a novel method for scoring the phrase-based rules that significantly improves the output of the transformation process. Results on an English language dataset demonstrate the combined methods provide a 13% relative reduction in word error rate compared to a baseline grapheme-based system. Our approach could potentially be applied to low-resource languages without existing lexicons, such as in the Babel project.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123467963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes 第二个“CHiME”语音分离和识别挑战:挑战系统和结果概述
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707723
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, Marco Matassoni
Distant-microphone automatic speech recognition (ASR) remains a challenging goal in everyday environments involving multiple background sources and reverberation. This paper reports on the results of the 2nd `CHiME' Challenge, an initiative designed to analyse and evaluate the performance of ASR systems in a real-world domestic environment. We discuss the rationale for the challenge and provide a summary of the datasets, tasks and baseline systems. The paper overviews the systems that were entered for the two challenge tracks: small-vocabulary with moving talker and medium-vocabulary with stationary talker. We present a summary of the challenge findings including novel results produced by challenge system combination. Possible directions for future challenges are discussed.
在涉及多背景源和混响的日常环境中,远距离麦克风自动语音识别(ASR)仍然是一个具有挑战性的目标。本文报告了第二次“CHiME”挑战的结果,这是一项旨在分析和评估ASR系统在真实家庭环境中的性能的倡议。我们讨论了挑战的基本原理,并提供了数据集、任务和基线系统的摘要。本文概述了两个挑战轨道的系统:移动说话者的小词汇量和静止说话者的中等词汇量。我们提出了挑战结果的总结,包括挑战系统组合产生的新结果。讨论了未来挑战的可能方向。
{"title":"The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes","authors":"Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, Marco Matassoni","doi":"10.1109/ASRU.2013.6707723","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707723","url":null,"abstract":"Distant-microphone automatic speech recognition (ASR) remains a challenging goal in everyday environments involving multiple background sources and reverberation. This paper reports on the results of the 2nd `CHiME' Challenge, an initiative designed to analyse and evaluate the performance of ASR systems in a real-world domestic environment. We discuss the rationale for the challenge and provide a summary of the datasets, tasks and baseline systems. The paper overviews the systems that were entered for the two challenge tracks: small-vocabulary with moving talker and medium-vocabulary with stationary talker. We present a summary of the challenge findings including novel results produced by challenge system combination. Possible directions for future challenges are discussed.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125086659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
Language style and domain adaptation for cross-language SLU porting 跨语言SLU移植的语言风格和领域适应
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707720
Evgeny A. Stepanov, Ilya Kashkarev, Ali Orkan Bayer, G. Riccardi, Arindam Ghosh
Automatic cross-language Spoken Language Understanding porting is plagued by two limitations. First, SLU are usually trained on limited domain corpora. Second, language pair resources (e.g. aligned corpora) are scarce or unmatched in style (e.g. news vs. conversation). We present experiments on automatic style adaptation of the input for the translation systems and their output for SLU. We approach the problem of scarce aligned data by adapting the available parallel data to the target domain using limited in-domain and larger web crawled close-to-domain corpora. SLU performance is optimized by reranking its output with Recurrent Neural Network-based joint language model. We evaluate end-to-end SLU porting on close and distant language pairs: Spanish - Italian and Turkish - Italian; and achieve significant improvements both in translation quality and SLU performance.
自动跨语言口语理解移植受到两个限制的困扰。首先,SLU通常是在有限的领域语料库上训练的。其次,语言对资源(例如对齐的语料库)很少或在风格上不匹配(例如新闻与对话)。我们提出了翻译系统输入和翻译系统输出的自动风格自适应实验。我们通过使用有限的域内和更大的web爬取的近域语料库将可用的并行数据适应目标领域来解决稀缺对齐数据的问题。采用基于递归神经网络的联合语言模型对SLU的输出进行重新排序,优化了SLU的性能。我们评估了端到端的SLU移植在近端和远端的语言对:西班牙语-意大利语和土耳其语-意大利语;在翻译质量和SLU性能方面都取得了显著的进步。
{"title":"Language style and domain adaptation for cross-language SLU porting","authors":"Evgeny A. Stepanov, Ilya Kashkarev, Ali Orkan Bayer, G. Riccardi, Arindam Ghosh","doi":"10.1109/ASRU.2013.6707720","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707720","url":null,"abstract":"Automatic cross-language Spoken Language Understanding porting is plagued by two limitations. First, SLU are usually trained on limited domain corpora. Second, language pair resources (e.g. aligned corpora) are scarce or unmatched in style (e.g. news vs. conversation). We present experiments on automatic style adaptation of the input for the translation systems and their output for SLU. We approach the problem of scarce aligned data by adapting the available parallel data to the target domain using limited in-domain and larger web crawled close-to-domain corpora. SLU performance is optimized by reranking its output with Recurrent Neural Network-based joint language model. We evaluate end-to-end SLU porting on close and distant language pairs: Spanish - Italian and Turkish - Italian; and achieve significant improvements both in translation quality and SLU performance.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125451802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A propagation approach to modelling the joint distributions of clean and corrupted speech in the Mel-Cepstral domain 一种在mel -倒谱域模拟干净和损坏语音联合分布的传播方法
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707726
Ramón Fernández Astudillo
This paper presents a closed form solution relating the joint distributions of corrupted and clean speech in the short-time Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficient (MFCC) domains. This makes possible a tighter integration of STFT domain speech enhancement and feature and model-compensation techniques for robust automatic speech recognition. The approach directly utilizes the conventional speech distortion model for STFT speech enhancement, allowing for low cost, single pass, causal implementations. Compared to similar uncertainty propagation approaches, it provides the full joint distribution, rather than just the posterior distribution, which provides additional model compensation possibilities. The method is exemplified by deriving an MMSE-MFCC estimator from the propagated joint distribution. It is shown that similar performance to that of STFT uncertainty propagation (STFT-UP) can be obtained on the AURORA4, while deriving the full joint distribution.
本文给出了一种关于短时间傅里叶变换(STFT)和mel -频率倒谱系数(MFCC)域中损坏语音和干净语音联合分布的封闭解。这使得STFT域语音增强和特征和模型补偿技术的更紧密集成成为可能,以实现鲁棒自动语音识别。该方法直接利用传统的语音失真模型进行STFT语音增强,允许低成本,单次通过,因果实现。与类似的不确定性传播方法相比,它提供了完整的联合分布,而不仅仅是后验分布,这提供了额外的模型补偿可能性。通过从传播联合分布中推导出MMSE-MFCC估计量,对该方法进行了验证。结果表明,在获得全联合分布的同时,在AURORA4上可以获得与STFT不确定性传播(STFT- up)相似的性能。
{"title":"A propagation approach to modelling the joint distributions of clean and corrupted speech in the Mel-Cepstral domain","authors":"Ramón Fernández Astudillo","doi":"10.1109/ASRU.2013.6707726","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707726","url":null,"abstract":"This paper presents a closed form solution relating the joint distributions of corrupted and clean speech in the short-time Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficient (MFCC) domains. This makes possible a tighter integration of STFT domain speech enhancement and feature and model-compensation techniques for robust automatic speech recognition. The approach directly utilizes the conventional speech distortion model for STFT speech enhancement, allowing for low cost, single pass, causal implementations. Compared to similar uncertainty propagation approaches, it provides the full joint distribution, rather than just the posterior distribution, which provides additional model compensation possibilities. The method is exemplified by deriving an MMSE-MFCC estimator from the propagated joint distribution. It is shown that similar performance to that of STFT uncertainty propagation (STFT-UP) can be obtained on the AURORA4, while deriving the full joint distribution.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128395459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multi-stream temporally varying weight regression for cross-lingual speech recognition 跨语言语音识别的多流时变权回归
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707769
Shilin Liu, K. Sim
Building a good Automatic Speech Recognition (ASR) system with limited resources is a very challenging task due to the existing many speech variations. Multilingual and cross-lingual speech recognition techniques are commonly used for this task. This paper investigates the recently proposed Temporally Varying Weight Regression (TVWR) method for cross-lingual speech recognition. TVWR uses posterior features to implicitly model the long-term temporal structures in acoustic patterns. By leveraging on the well-trained foreign recognizers, high quality monophone/state posteriors can be easily incorporated into TVWR to boost the ASR performance on low-resource languages. Furthermore, multi-stream TVWR is proposed, where multiple sets of posterior features are used to incorporate richer (temporal and spatial) context information. Finally, a separate state-tying for the TVWR regression parameters is used to better utilize the more reliable posterior features. Experimental results are evaluated for English and Malay speech recognition with limited resources. By using the Czech, Hungarian and Russian posterior features, TVWR was found to consistently outperform the tandem systems trained on the same features.
在有限的资源下,构建一个好的自动语音识别系统是一项非常具有挑战性的任务,因为语音变体很多。多语言和跨语言语音识别技术通常用于此任务。本文研究了最近提出的用于跨语言语音识别的时间变权回归(TVWR)方法。TVWR使用后验特征来隐式模拟声学模式中的长期时间结构。通过利用训练有素的外国识别器,高质量的单音/状态后置可以很容易地融入TVWR中,以提高低资源语言的ASR性能。在此基础上,提出了多流TVWR算法,利用多组后验特征融合更丰富的(时间和空间)上下文信息。最后,对TVWR回归参数使用单独的状态绑定,以更好地利用更可靠的后验特征。在有限的资源条件下,对马来语和英语语音识别的实验结果进行了评价。通过使用捷克、匈牙利和俄罗斯的后验特征,TVWR被发现始终优于在相同特征上训练的串联系统。
{"title":"Multi-stream temporally varying weight regression for cross-lingual speech recognition","authors":"Shilin Liu, K. Sim","doi":"10.1109/ASRU.2013.6707769","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707769","url":null,"abstract":"Building a good Automatic Speech Recognition (ASR) system with limited resources is a very challenging task due to the existing many speech variations. Multilingual and cross-lingual speech recognition techniques are commonly used for this task. This paper investigates the recently proposed Temporally Varying Weight Regression (TVWR) method for cross-lingual speech recognition. TVWR uses posterior features to implicitly model the long-term temporal structures in acoustic patterns. By leveraging on the well-trained foreign recognizers, high quality monophone/state posteriors can be easily incorporated into TVWR to boost the ASR performance on low-resource languages. Furthermore, multi-stream TVWR is proposed, where multiple sets of posterior features are used to incorporate richer (temporal and spatial) context information. Finally, a separate state-tying for the TVWR regression parameters is used to better utilize the more reliable posterior features. Experimental results are evaluated for English and Malay speech recognition with limited resources. By using the Czech, Hungarian and Russian posterior features, TVWR was found to consistently outperform the tandem systems trained on the same features.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133916696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Cross-lingual context sharing and parameter-tying for multi-lingual speech recognition 多语言语音识别的跨语言上下文共享和参数关联
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707717
Aanchan Mohan, R. Rose
This paper is concerned with the problem of building acoustic models for automatic speech recognition (ASR) using speech data from multiple languages. Techniques for multi-lingual ASR are developed in the context of the subspace Gaussian mixture model (SGMM)[2, 3]. Multi-lingual SGMM based ASR systems have been configured with shared subspace parameters trained from multiple languages but with distinct language dependent phonetic contexts and states[11, 12]. First, an approach for sharing state-level target language and foreign language SGMM parameters is described. Second, semi-tied covariance transformations are applied as an alternative to full-covariance Gaussians to make acoustic model training less sensitive to issues of insufficient training data. These techniques are applied to Hindi and Marathi language data obtained for an agricultural commodities dialog task in multiple Indian languages.
本文研究了基于多语言语音数据的自动语音识别声学模型的建立问题。多语种ASR技术是在子空间高斯混合模型(SGMM)的背景下发展起来的[2,3]。基于多语言SGMM的ASR系统已经配置了从多种语言训练的共享子空间参数,但具有不同的语言依赖的语音上下文和状态[11,12]。首先,描述了一种共享国家级目标语言和外语SGMM参数的方法。其次,采用半捆绑协方差变换作为全协方差高斯变换的替代方法,使声学模型训练对训练数据不足的问题不那么敏感。这些技术应用于印地语和马拉地语数据,这些数据是为多种印度语言的农产品对话任务获得的。
{"title":"Cross-lingual context sharing and parameter-tying for multi-lingual speech recognition","authors":"Aanchan Mohan, R. Rose","doi":"10.1109/ASRU.2013.6707717","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707717","url":null,"abstract":"This paper is concerned with the problem of building acoustic models for automatic speech recognition (ASR) using speech data from multiple languages. Techniques for multi-lingual ASR are developed in the context of the subspace Gaussian mixture model (SGMM)[2, 3]. Multi-lingual SGMM based ASR systems have been configured with shared subspace parameters trained from multiple languages but with distinct language dependent phonetic contexts and states[11, 12]. First, an approach for sharing state-level target language and foreign language SGMM parameters is described. Second, semi-tied covariance transformations are applied as an alternative to full-covariance Gaussians to make acoustic model training less sensitive to issues of insufficient training data. These techniques are applied to Hindi and Marathi language data obtained for an agricultural commodities dialog task in multiple Indian languages.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132744127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2013 IEEE Workshop on Automatic Speech Recognition and Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1