首页 > 最新文献

2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Generalized likelihood ratio discriminant analysis 广义似然比判别分析
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373392
Hung-Shin Lee, Berlin Chen
In the past several decades, classifier-independent front-end feature extraction, where the derivation of acoustic features is lightly associated with the back-end model training or classification, has been prominently used in various pattern recognition tasks, including automatic speech recognition (ASR). In this paper, we present a novel discriminative feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension - heteroscedastic linear discriminant analysis (HLDA) can be regarded as two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance. Experimental results demonstrate that GLRDA and its variant can yield moderate performance improvements over HLDA and LDA for the large vocabulary continuous speech recognition (LVCSR) task.
在过去的几十年里,独立于分类器的前端特征提取在各种模式识别任务中得到了突出的应用,包括自动语音识别(ASR),其中声学特征的推导与后端模型训练或分类很少相关。在似然比检验(LRT)的基础上,提出了一种新的判别特征变换——广义似然比判别分析(GLRDA)。它试图通过使最令人困惑的情况(由零假设描述)在没有类分布的同方差假设的情况下尽可能不可能发生,来寻求更低维的特征子空间。我们还证明了经典的线性判别分析(LDA)及其众所周知的扩展异方差线性判别分析(HLDA)可以看作是我们提出的方法的两种特殊情况。可以将经验类混淆信息进一步纳入GLRDA中,以获得更好的识别性能。实验结果表明,在大词汇量连续语音识别(LVCSR)任务中,GLRDA及其变体比HLDA和LDA有适度的性能提升。
{"title":"Generalized likelihood ratio discriminant analysis","authors":"Hung-Shin Lee, Berlin Chen","doi":"10.1109/ASRU.2009.5373392","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373392","url":null,"abstract":"In the past several decades, classifier-independent front-end feature extraction, where the derivation of acoustic features is lightly associated with the back-end model training or classification, has been prominently used in various pattern recognition tasks, including automatic speech recognition (ASR). In this paper, we present a novel discriminative feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension - heteroscedastic linear discriminant analysis (HLDA) can be regarded as two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance. Experimental results demonstrate that GLRDA and its variant can yield moderate performance improvements over HLDA and LDA for the large vocabulary continuous speech recognition (LVCSR) task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121331409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Hidden Conditional Random Fields for phone recognition 隐藏条件随机场的电话识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373329
Yun-Hsuan Sung, Dan Jurafsky
We apply Hidden Conditional Random Fields (HCRFs) to the task of TIMIT phone recognition. HCRFs are discriminatively trained sequence models that augment conditional random fields with hidden states that are capable of representing subphones and mixture components. We extend HCRFs, which had previously only been applied to phone classification with known boundaries, to recognize continuous phone sequences. We use an N-best inference algorithm in both learning (to approximate all competitor phone sequences) and decoding (to marginalize over hidden states). Our monophone HCRFs achieve 28.3% phone error rate, outperforming maximum likelihood trained HMMs by 3.6%, maximum mutual information trained HMMs by 2.5%, and minimum phone error trained HMMs by 2.2%. We show that this win is partially due to HCRFs' ability to simultaneously optimize discriminative language models and acoustic models, a powerful property that has important implications for speech recognition.
我们将隐藏条件随机场(HCRFs)应用于TIMIT电话识别任务。hcrf是判别训练的序列模型,它通过能够表示子电话和混合组件的隐藏状态来增强条件随机场。我们对HCRFs进行了扩展,将其应用于已知边界的电话分类,以识别连续的电话序列。我们在学习(近似所有竞争对手的电话序列)和解码(在隐藏状态上边缘化)中都使用了N-best推理算法。我们的单音hcrf的电话错误率达到28.3%,比最大似然训练的hmm高3.6%,比最大互信息训练的hmm高2.5%,比最小电话错误训练的hmm高2.2%。我们表明,这一胜利部分归功于HCRFs同时优化判别语言模型和声学模型的能力,这是一种强大的特性,对语音识别具有重要意义。
{"title":"Hidden Conditional Random Fields for phone recognition","authors":"Yun-Hsuan Sung, Dan Jurafsky","doi":"10.1109/ASRU.2009.5373329","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373329","url":null,"abstract":"We apply Hidden Conditional Random Fields (HCRFs) to the task of TIMIT phone recognition. HCRFs are discriminatively trained sequence models that augment conditional random fields with hidden states that are capable of representing subphones and mixture components. We extend HCRFs, which had previously only been applied to phone classification with known boundaries, to recognize continuous phone sequences. We use an N-best inference algorithm in both learning (to approximate all competitor phone sequences) and decoding (to marginalize over hidden states). Our monophone HCRFs achieve 28.3% phone error rate, outperforming maximum likelihood trained HMMs by 3.6%, maximum mutual information trained HMMs by 2.5%, and minimum phone error trained HMMs by 2.2%. We show that this win is partially due to HCRFs' ability to simultaneously optimize discriminative language models and acoustic models, a powerful property that has important implications for speech recognition.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"116 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126394035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Extended Minimum Classification Error Training in Voice Activity Detection 语音活动检测中的扩展最小分类错误训练
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373251
T. Arakawa, Haitham Al-Hassanieh, M. Tsujikawa, R. Isotani
Voice Activity Detection (VAD) is a fundamental part of speech processing. Combination of multiple acoustic features is an effective approach to make VAD more robust against various noise conditions. There have been proposed several feature combination methods, in which weights for feature values are optimized based on Minimum Classification Error (MCE) training. We improve these MCE-based methods by introducing a novel discriminative function for whole frames. The proposed method optimizes combination weights taking into account the ratio between false acceptance and false rejection rates as well as the effect of the use of shaping procedures such as hangover.
语音活动检测(VAD)是语音处理的基础部分。多种声学特征的组合是提高VAD在各种噪声条件下鲁棒性的有效途径。已经提出了几种特征组合方法,其中基于最小分类误差(MCE)训练优化特征值的权值。我们通过引入一种新的全帧判别函数来改进这些基于mce的方法。所提出的方法考虑到错误接受率和错误拒绝率之间的比率以及使用诸如宿醉等整形程序的影响来优化组合权重。
{"title":"Extended Minimum Classification Error Training in Voice Activity Detection","authors":"T. Arakawa, Haitham Al-Hassanieh, M. Tsujikawa, R. Isotani","doi":"10.1109/ASRU.2009.5373251","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373251","url":null,"abstract":"Voice Activity Detection (VAD) is a fundamental part of speech processing. Combination of multiple acoustic features is an effective approach to make VAD more robust against various noise conditions. There have been proposed several feature combination methods, in which weights for feature values are optimized based on Minimum Classification Error (MCE) training. We improve these MCE-based methods by introducing a novel discriminative function for whole frames. The proposed method optimizes combination weights taking into account the ratio between false acceptance and false rejection rates as well as the effect of the use of shaping procedures such as hangover.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126764174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
MAP estimation of online mapping parameters in ensemble speaker and speaking environment modeling 集成说话人和说话环境建模中在线映射参数的MAP估计
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373236
Yu Tsao, Shigeki Matsuda, Satoshi Nakamura, Chin-Hui Lee
Recently, an ensemble speaker and speaking environment modeling (ESSEM) framework was proposed to enhance automatic speech recognition performance under adverse conditions. In the online phase of ESSEM, the prepared environment structure in the offline stage is transformed to a set of acoustic models for the target testing environment by using a mapping function. In the original ESSEM framework, the mapping function parameters are estimated based on a maximum likelihood (ML) criterion. In this study, we propose to use a maximum a posteriori (MAP) criterion to calculate the mapping function to avoid a possible over-fitting problem that can degrade the accuracy of environment characterization. For the MAP estimation, we also study two types of prior densities, namely, clustered prior and hierarchical prior, in this paper. On the Aurora-2 task using either type of prior densities, MAP-based ESSEM can achieve better performance than ML-based ESSEM, especially under low SNR conditions. When comparing to our best baseline results, the MAP-based ESSEM achieves a 14.97% (5.41% to 4.60%) word error rate reduction in average at a signal to noise ratio of 0dB to 20dB over the three testing sets.
近年来,为了提高恶劣条件下的自动语音识别性能,提出了一种集成说话人和说话环境建模(ESSEM)框架。在ESSEM的在线阶段,利用映射函数将离线阶段准备好的环境结构转换为目标测试环境的声学模型集。在原始的ESSEM框架中,映射函数参数是基于最大似然(ML)准则估计的。在本研究中,我们建议使用最大后验(MAP)准则来计算映射函数,以避免可能出现的过度拟合问题,从而降低环境表征的准确性。对于MAP估计,本文还研究了两种类型的先验密度,即聚类先验和分层先验。在使用两种先验密度的Aurora-2任务上,基于map的ESSEM比基于ml的ESSEM性能更好,特别是在低信噪比条件下。与我们的最佳基线结果相比,在三个测试集的信噪比为0dB至20dB时,基于map的ESSEM平均降低了14.97%(5.41%至4.60%)的单词错误率。
{"title":"MAP estimation of online mapping parameters in ensemble speaker and speaking environment modeling","authors":"Yu Tsao, Shigeki Matsuda, Satoshi Nakamura, Chin-Hui Lee","doi":"10.1109/ASRU.2009.5373236","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373236","url":null,"abstract":"Recently, an ensemble speaker and speaking environment modeling (ESSEM) framework was proposed to enhance automatic speech recognition performance under adverse conditions. In the online phase of ESSEM, the prepared environment structure in the offline stage is transformed to a set of acoustic models for the target testing environment by using a mapping function. In the original ESSEM framework, the mapping function parameters are estimated based on a maximum likelihood (ML) criterion. In this study, we propose to use a maximum a posteriori (MAP) criterion to calculate the mapping function to avoid a possible over-fitting problem that can degrade the accuracy of environment characterization. For the MAP estimation, we also study two types of prior densities, namely, clustered prior and hierarchical prior, in this paper. On the Aurora-2 task using either type of prior densities, MAP-based ESSEM can achieve better performance than ML-based ESSEM, especially under low SNR conditions. When comparing to our best baseline results, the MAP-based ESSEM achieves a 14.97% (5.41% to 4.60%) word error rate reduction in average at a signal to noise ratio of 0dB to 20dB over the three testing sets.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126188774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Short-time instantaneous frequency and bandwidth features for speech recognition 短时间瞬时频率和带宽特征的语音识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373305
P. Tsiakoulis, A. Potamianos, D. Dimitriadis
In this paper, we investigate the performance of modulation related features and normalized spectral moments for automatic speech recognition. We focus on the short-time averages of the amplitude weighted instantaneous frequencies and bandwidths, computed at each subband of a mel-spaced filterbank. Similar features have been proposed in previous studies, and have been successfully combined with MFCCs for speech and speaker recognition. Our goal is to investigate the stand-alone performance of these features. First, it is experimentally shown that the proposed features are only moderately correlated in the frequency domain, and, unlike MFCCs, they do not require a transformation to the cepstral domain. Next, the filterbank parameters (number of filters and filter overlap) are investigated for the proposed features and compared with those of MFCCs. Results show that frequency related features perform at least as well as MFCCs for clean conditions, and yield superior results for noisy conditions; up to 50% relative error rate reduction for the AURORA3 Spanish task.
本文研究了自动语音识别中调制相关特征和归一化谱矩的性能。我们的重点是振幅加权瞬时频率和带宽的短时平均值,在mel间隔滤波器组的每个子带计算。在以前的研究中已经提出了类似的特征,并成功地将其与mfcc结合起来用于语音和说话人识别。我们的目标是研究这些特性的独立性能。首先,实验表明,所提出的特征在频域中仅适度相关,并且与mfccc不同,它们不需要转换到倒谱域。接下来,研究了滤波器组参数(滤波器数量和滤波器重叠),并与mfccc的特征进行了比较。结果表明,频率相关特征在清洁条件下的表现至少与mfccc一样好,并且在嘈杂条件下产生更好的结果;AURORA3西班牙语任务的相对错误率降低了50%。
{"title":"Short-time instantaneous frequency and bandwidth features for speech recognition","authors":"P. Tsiakoulis, A. Potamianos, D. Dimitriadis","doi":"10.1109/ASRU.2009.5373305","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373305","url":null,"abstract":"In this paper, we investigate the performance of modulation related features and normalized spectral moments for automatic speech recognition. We focus on the short-time averages of the amplitude weighted instantaneous frequencies and bandwidths, computed at each subband of a mel-spaced filterbank. Similar features have been proposed in previous studies, and have been successfully combined with MFCCs for speech and speaker recognition. Our goal is to investigate the stand-alone performance of these features. First, it is experimentally shown that the proposed features are only moderately correlated in the frequency domain, and, unlike MFCCs, they do not require a transformation to the cepstral domain. Next, the filterbank parameters (number of filters and filter overlap) are investigated for the proposed features and compared with those of MFCCs. Results show that frequency related features perform at least as well as MFCCs for clean conditions, and yield superior results for noisy conditions; up to 50% relative error rate reduction for the AURORA3 Spanish task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126465551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Generalization problem in ASR acoustic model training and adaptation ASR声学模型训练与自适应中的泛化问题
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373493
S. Furui
Since speech is highly variable, even if we have a fairly large-scale database, we cannot avoid the data sparseness problem in constructing automatic speech recognition (ASR) systems. How to train and adapt statistical models using limited amounts of data is one of the most important research issues in ASR. This paper summarizes major techniques that have been proposed to solve the generalization problem in acoustic model training and adaptation, that is, how to achieve high recognition accuracy for new utterances. One of the common approaches is controlling the degree of freedom in model training and adaptation. The techniques can be classified by whether a priori knowledge of speech obtained by a speech database such as those spoken by many speakers is used or not. Another approach is maximizing “margins” between training samples and the decision boundaries. Many of these techniques have also been combined and extended to further improve performance. Although many useful techniques have been developed, we still do not have a golden standard that can be applied to any kind of speech variation and any condition of the speech data available for training and adaptation.
由于语音是高度可变的,即使我们有一个相当大的数据库,在构建自动语音识别(ASR)系统时也无法避免数据稀疏问题。如何利用有限的数据训练和调整统计模型是ASR中最重要的研究问题之一。本文综述了声学模型训练与自适应中的泛化问题,即如何对新语音达到较高的识别精度。一种常用的方法是控制模型训练和适应的自由度。这些技术可以通过是否使用从语音数据库获得的语音先验知识(如许多说话者的语音)来分类。另一种方法是最大化训练样本和决策边界之间的“边际”。为了进一步提高性能,许多这些技术也被组合和扩展。尽管已经开发了许多有用的技术,但我们仍然没有一个黄金标准,可以应用于任何类型的语音变化和任何条件下的语音数据,用于训练和适应。
{"title":"Generalization problem in ASR acoustic model training and adaptation","authors":"S. Furui","doi":"10.1109/ASRU.2009.5373493","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373493","url":null,"abstract":"Since speech is highly variable, even if we have a fairly large-scale database, we cannot avoid the data sparseness problem in constructing automatic speech recognition (ASR) systems. How to train and adapt statistical models using limited amounts of data is one of the most important research issues in ASR. This paper summarizes major techniques that have been proposed to solve the generalization problem in acoustic model training and adaptation, that is, how to achieve high recognition accuracy for new utterances. One of the common approaches is controlling the degree of freedom in model training and adaptation. The techniques can be classified by whether a priori knowledge of speech obtained by a speech database such as those spoken by many speakers is used or not. Another approach is maximizing “margins” between training samples and the decision boundaries. Many of these techniques have also been combined and extended to further improve performance. Although many useful techniques have been developed, we still do not have a golden standard that can be applied to any kind of speech variation and any condition of the speech data available for training and adaptation.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131323370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Power function-based power distribution normalization algorithm for robust speech recognition 基于幂函数的鲁棒语音识别功率分布归一化算法
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373233
Chanwoo Kim, R. Stern
A novel algorithm that normalizes the distribution of spectral power coefficients is described in this paper. The algorithm, called power-function-based power distribution (PPDN) is based on the observation that the ratio of arithmetic mean to geometric mean changes as speech is corrupted by noise, and a parametric power function is used to equalize this ratio. We also observe that a longer “medium-duration” observation window (of approximately 100 ms) is better suited for parameter estimation for noise compensation than the briefer window typically used for automatic speech recognition. We also describe the implementation of an online version of PPDN based on exponentially weighted temporal averaging. Experimental results shows that PPDN provides comparable or slightly better results than state of- the-art algorithms such as vector Taylor series for speech recognition while requiring much less computation. Hence, the algorithm is suitable for both real-time speech communication or as a real-time preprocessing stage for speech recognition systems.
本文提出了一种新的谱功率系数分布归一化算法。该算法被称为基于幂函数的功率分布(PPDN),它基于对算术平均值与几何平均值的比值随着语音被噪声破坏而变化的观察,并使用参数幂函数来平衡该比值。我们还观察到,较长的“中等持续时间”观察窗口(大约100毫秒)比通常用于自动语音识别的较短窗口更适合于噪声补偿的参数估计。我们还描述了基于指数加权时间平均的PPDN在线版本的实现。实验结果表明,PPDN提供了与矢量泰勒级数等最先进的语音识别算法相当或稍好的结果,同时所需的计算量要少得多。因此,该算法既适用于实时语音通信,也适用于语音识别系统的实时预处理阶段。
{"title":"Power function-based power distribution normalization algorithm for robust speech recognition","authors":"Chanwoo Kim, R. Stern","doi":"10.1109/ASRU.2009.5373233","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373233","url":null,"abstract":"A novel algorithm that normalizes the distribution of spectral power coefficients is described in this paper. The algorithm, called power-function-based power distribution (PPDN) is based on the observation that the ratio of arithmetic mean to geometric mean changes as speech is corrupted by noise, and a parametric power function is used to equalize this ratio. We also observe that a longer “medium-duration” observation window (of approximately 100 ms) is better suited for parameter estimation for noise compensation than the briefer window typically used for automatic speech recognition. We also describe the implementation of an online version of PPDN based on exponentially weighted temporal averaging. Experimental results shows that PPDN provides comparable or slightly better results than state of- the-art algorithms such as vector Taylor series for speech recognition while requiring much less computation. Hence, the algorithm is suitable for both real-time speech communication or as a real-time preprocessing stage for speech recognition systems.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127982058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
A multiplatform speech recognition decoder based on weighted finite-state transducers 基于加权有限状态换能器的多平台语音识别解码器
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373404
Emilian Stoimenov, Tanja Schultz
Speech recognition decoders based on static graphs have recently proven to significantly outperform the traditional approach of prefix tree expansion in terms of decoding speed [1], [2]. The reduced search effort makes static graph decoders an attractive alternative for tasks concerned with limited processing power or memory footprint on devices such as PDAs, internet tablets, and smart phones. In this paper we explore the benefits of decoding with an optimized speech recognition network over the fully task-optimized prefix-tree based decoder IBIS [3]. We designed and implemented a new decoder called SWIFT (Speedy WeIgthed Finite-state Transducer) based on WFSTs with its application to embedded platforms in mind. After describing the design, the network construction and storage process, we present evaluation results on a small task suitable for embedded applications, and on a large task, namely the European Parliament Plenary Sessions (EPPS) task from the TC-STAR project [20]. The SWIFT Decoder is up to 50% faster than IBIS on both tasks. In addition, SWIFT achieves significant memory consumption reductions obtained by our innovative network specific storage layout optimization.
基于静态图的语音识别解码器最近被证明在解码速度方面明显优于传统的前缀树扩展方法[1],[2]。减少搜索工作量使得静态图形解码器成为处理能力或内存占用有限的任务(如pda、互联网平板电脑和智能手机)的有吸引力的替代方案。在本文中,我们探讨了与基于完全任务优化的前缀树解码器IBIS相比,使用优化语音识别网络进行解码的好处[3]。我们设计并实现了一种基于WFSTs的新型解码器SWIFT(快速加权有限状态传感器),并考虑到其在嵌入式平台上的应用。在描述了设计、网络构建和存储过程之后,我们介绍了适合嵌入式应用的小型任务和大型任务的评估结果,即TC-STAR项目中的欧洲议会全体会议(EPPS)任务[20]。SWIFT解码器在这两项任务上都比IBIS快50%。此外,SWIFT通过创新的网络特定存储布局优化实现了显著的内存消耗降低。
{"title":"A multiplatform speech recognition decoder based on weighted finite-state transducers","authors":"Emilian Stoimenov, Tanja Schultz","doi":"10.1109/ASRU.2009.5373404","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373404","url":null,"abstract":"Speech recognition decoders based on static graphs have recently proven to significantly outperform the traditional approach of prefix tree expansion in terms of decoding speed [1], [2]. The reduced search effort makes static graph decoders an attractive alternative for tasks concerned with limited processing power or memory footprint on devices such as PDAs, internet tablets, and smart phones. In this paper we explore the benefits of decoding with an optimized speech recognition network over the fully task-optimized prefix-tree based decoder IBIS [3]. We designed and implemented a new decoder called SWIFT (Speedy WeIgthed Finite-state Transducer) based on WFSTs with its application to embedded platforms in mind. After describing the design, the network construction and storage process, we present evaluation results on a small task suitable for embedded applications, and on a large task, namely the European Parliament Plenary Sessions (EPPS) task from the TC-STAR project [20]. The SWIFT Decoder is up to 50% faster than IBIS on both tasks. In addition, SWIFT achieves significant memory consumption reductions obtained by our innovative network specific storage layout optimization.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115534633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Toward machine translation with statistics and syntax and semantics 向统计学、语法和语义的机器翻译方向发展
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373509
Dekai Wu
In this paper, we survey some central issues in the historical, current, and future landscape of statistical machine translation (SMT) research, taking as a starting point an extended three-dimensional MT model space. We posit a socio-geographical conceptual disparity hypothesis, that aims to explain why language pairs like Chinese-English have presented MT with so much more difficulty than others. The evolution from simple token-based to segment-based to tree-based syntactic SMT is sketched. For tree-based SMT, we consider language bias rationales for selecting the degree of compositional power within the hierarchy of expressiveness for transduction grammars (or synchronous grammars). This leads us to inversion transductions and the ITG model prevalent in current state-of-the-art SMT, along with the underlying ITG hypothesis, which posits a language universal. Against this backdrop, we enumerate a set of key open questions for syntactic SMT. We then consider the more recent area of semantic SMT. We list principles for successful application of sense disambiguation models to semantic SMT, and describe early directions in the use of semantic role labeling for semantic SMT.
本文以一个扩展的三维机器翻译模型空间为出发点,综述了统计机器翻译(SMT)研究的历史、当前和未来的一些核心问题。我们提出了一个社会地理概念差异假说,旨在解释为什么像汉英这样的语言对呈现MT比其他语言对更困难。概述了从简单的基于标记到基于片段再到基于树的语法SMT的演变过程。对于基于树的SMT,我们考虑了在转导语法(或同步语法)的表达层次中选择组合能力程度的语言偏差原理。这导致我们在当前最先进的SMT中流行的反转转导和ITG模型,以及潜在的ITG假设,它假设了一种语言的普遍性。在此背景下,我们列举了一组语法SMT的关键开放问题。然后我们考虑语义SMT的最新领域。我们列出了语义消歧模型成功应用于语义SMT的原则,并描述了在语义SMT中使用语义角色标记的早期方向。
{"title":"Toward machine translation with statistics and syntax and semantics","authors":"Dekai Wu","doi":"10.1109/ASRU.2009.5373509","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373509","url":null,"abstract":"In this paper, we survey some central issues in the historical, current, and future landscape of statistical machine translation (SMT) research, taking as a starting point an extended three-dimensional MT model space. We posit a socio-geographical conceptual disparity hypothesis, that aims to explain why language pairs like Chinese-English have presented MT with so much more difficulty than others. The evolution from simple token-based to segment-based to tree-based syntactic SMT is sketched. For tree-based SMT, we consider language bias rationales for selecting the degree of compositional power within the hierarchy of expressiveness for transduction grammars (or synchronous grammars). This leads us to inversion transductions and the ITG model prevalent in current state-of-the-art SMT, along with the underlying ITG hypothesis, which posits a language universal. Against this backdrop, we enumerate a set of key open questions for syntactic SMT. We then consider the more recent area of semantic SMT. We list principles for successful application of sense disambiguation models to semantic SMT, and describe early directions in the use of semantic role labeling for semantic SMT.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123964265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Temporal envelope subtraction for robust speech recognition using modulation spectrum 时序包络减法用于调制频谱的鲁棒语音识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372922
Sriram Ganapathy, Samuel Thomas, H. Hermansky
In this paper, we present a new noise compensation technique for modulation frequency features derived from syllable length segments of subband temporal envelopes. The subband temporal envelopes are estimated using frequency domain linear prediction (FDLP). We propose a technique for noise compensation in FDLP where an estimate of the noise envelope is subtracted from the noisy speech envelope. The noise compensated FDLP envelopes are compressed with static (logarithmic) and dynamic (adaptive loops) compression and are transformed into modulation spectral features. Experiments are performed on a phoneme recognition task as well as a connected digit recognition task where the test data is corrupted with variety of noise types at different signal to noise ratios. In these experiments with mismatched train and test conditions, the proposed features provide considerable improvements compared to other state of the art noise robust feature extraction techniques (average relative improvement of 25 % and 35 % over the baseline PLP features for phoneme and word recognition tasks respectively).
本文提出了一种针对子带时间包络的音节长度片段调制频率特征的噪声补偿方法。利用频域线性预测(FDLP)估计子带时间包络。我们提出了一种FDLP中的噪声补偿技术,该技术将噪声包络的估计值从噪声语音包络中减去。噪声补偿的FDLP包络被静态(对数)和动态(自适应回路)压缩,并被转换成调制频谱特征。在音素识别任务和连接数字识别任务上进行了实验,其中测试数据在不同的信噪比下被各种噪声类型损坏。在这些训练和测试条件不匹配的实验中,与其他最先进的噪声鲁棒特征提取技术相比,所提出的特征提供了相当大的改进(在音素和单词识别任务中,相对于基线PLP特征的平均相对改进分别为25%和35%)。
{"title":"Temporal envelope subtraction for robust speech recognition using modulation spectrum","authors":"Sriram Ganapathy, Samuel Thomas, H. Hermansky","doi":"10.1109/ASRU.2009.5372922","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372922","url":null,"abstract":"In this paper, we present a new noise compensation technique for modulation frequency features derived from syllable length segments of subband temporal envelopes. The subband temporal envelopes are estimated using frequency domain linear prediction (FDLP). We propose a technique for noise compensation in FDLP where an estimate of the noise envelope is subtracted from the noisy speech envelope. The noise compensated FDLP envelopes are compressed with static (logarithmic) and dynamic (adaptive loops) compression and are transformed into modulation spectral features. Experiments are performed on a phoneme recognition task as well as a connected digit recognition task where the test data is corrupted with variety of noise types at different signal to noise ratios. In these experiments with mismatched train and test conditions, the proposed features provide considerable improvements compared to other state of the art noise robust feature extraction techniques (average relative improvement of 25 % and 35 % over the baseline PLP features for phoneme and word recognition tasks respectively).","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121140307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1