Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373392
Hung-Shin Lee, Berlin Chen
In the past several decades, classifier-independent front-end feature extraction, where the derivation of acoustic features is lightly associated with the back-end model training or classification, has been prominently used in various pattern recognition tasks, including automatic speech recognition (ASR). In this paper, we present a novel discriminative feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension - heteroscedastic linear discriminant analysis (HLDA) can be regarded as two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance. Experimental results demonstrate that GLRDA and its variant can yield moderate performance improvements over HLDA and LDA for the large vocabulary continuous speech recognition (LVCSR) task.
{"title":"Generalized likelihood ratio discriminant analysis","authors":"Hung-Shin Lee, Berlin Chen","doi":"10.1109/ASRU.2009.5373392","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373392","url":null,"abstract":"In the past several decades, classifier-independent front-end feature extraction, where the derivation of acoustic features is lightly associated with the back-end model training or classification, has been prominently used in various pattern recognition tasks, including automatic speech recognition (ASR). In this paper, we present a novel discriminative feature transformation, named generalized likelihood ratio discriminant analysis (GLRDA), on the basis of the likelihood ratio test (LRT). It attempts to seek a lower dimensional feature subspace by making the most confusing situation, described by the null hypothesis, as unlikely to happen as possible without the homoscedastic assumption on class distributions. We also show that the classical linear discriminant analysis (LDA) and its well-known extension - heteroscedastic linear discriminant analysis (HLDA) can be regarded as two special cases of our proposed method. The empirical class confusion information can be further incorporated into GLRDA for better recognition performance. Experimental results demonstrate that GLRDA and its variant can yield moderate performance improvements over HLDA and LDA for the large vocabulary continuous speech recognition (LVCSR) task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121331409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373329
Yun-Hsuan Sung, Dan Jurafsky
We apply Hidden Conditional Random Fields (HCRFs) to the task of TIMIT phone recognition. HCRFs are discriminatively trained sequence models that augment conditional random fields with hidden states that are capable of representing subphones and mixture components. We extend HCRFs, which had previously only been applied to phone classification with known boundaries, to recognize continuous phone sequences. We use an N-best inference algorithm in both learning (to approximate all competitor phone sequences) and decoding (to marginalize over hidden states). Our monophone HCRFs achieve 28.3% phone error rate, outperforming maximum likelihood trained HMMs by 3.6%, maximum mutual information trained HMMs by 2.5%, and minimum phone error trained HMMs by 2.2%. We show that this win is partially due to HCRFs' ability to simultaneously optimize discriminative language models and acoustic models, a powerful property that has important implications for speech recognition.
{"title":"Hidden Conditional Random Fields for phone recognition","authors":"Yun-Hsuan Sung, Dan Jurafsky","doi":"10.1109/ASRU.2009.5373329","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373329","url":null,"abstract":"We apply Hidden Conditional Random Fields (HCRFs) to the task of TIMIT phone recognition. HCRFs are discriminatively trained sequence models that augment conditional random fields with hidden states that are capable of representing subphones and mixture components. We extend HCRFs, which had previously only been applied to phone classification with known boundaries, to recognize continuous phone sequences. We use an N-best inference algorithm in both learning (to approximate all competitor phone sequences) and decoding (to marginalize over hidden states). Our monophone HCRFs achieve 28.3% phone error rate, outperforming maximum likelihood trained HMMs by 3.6%, maximum mutual information trained HMMs by 2.5%, and minimum phone error trained HMMs by 2.2%. We show that this win is partially due to HCRFs' ability to simultaneously optimize discriminative language models and acoustic models, a powerful property that has important implications for speech recognition.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"116 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126394035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373251
T. Arakawa, Haitham Al-Hassanieh, M. Tsujikawa, R. Isotani
Voice Activity Detection (VAD) is a fundamental part of speech processing. Combination of multiple acoustic features is an effective approach to make VAD more robust against various noise conditions. There have been proposed several feature combination methods, in which weights for feature values are optimized based on Minimum Classification Error (MCE) training. We improve these MCE-based methods by introducing a novel discriminative function for whole frames. The proposed method optimizes combination weights taking into account the ratio between false acceptance and false rejection rates as well as the effect of the use of shaping procedures such as hangover.
{"title":"Extended Minimum Classification Error Training in Voice Activity Detection","authors":"T. Arakawa, Haitham Al-Hassanieh, M. Tsujikawa, R. Isotani","doi":"10.1109/ASRU.2009.5373251","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373251","url":null,"abstract":"Voice Activity Detection (VAD) is a fundamental part of speech processing. Combination of multiple acoustic features is an effective approach to make VAD more robust against various noise conditions. There have been proposed several feature combination methods, in which weights for feature values are optimized based on Minimum Classification Error (MCE) training. We improve these MCE-based methods by introducing a novel discriminative function for whole frames. The proposed method optimizes combination weights taking into account the ratio between false acceptance and false rejection rates as well as the effect of the use of shaping procedures such as hangover.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126764174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373236
Yu Tsao, Shigeki Matsuda, Satoshi Nakamura, Chin-Hui Lee
Recently, an ensemble speaker and speaking environment modeling (ESSEM) framework was proposed to enhance automatic speech recognition performance under adverse conditions. In the online phase of ESSEM, the prepared environment structure in the offline stage is transformed to a set of acoustic models for the target testing environment by using a mapping function. In the original ESSEM framework, the mapping function parameters are estimated based on a maximum likelihood (ML) criterion. In this study, we propose to use a maximum a posteriori (MAP) criterion to calculate the mapping function to avoid a possible over-fitting problem that can degrade the accuracy of environment characterization. For the MAP estimation, we also study two types of prior densities, namely, clustered prior and hierarchical prior, in this paper. On the Aurora-2 task using either type of prior densities, MAP-based ESSEM can achieve better performance than ML-based ESSEM, especially under low SNR conditions. When comparing to our best baseline results, the MAP-based ESSEM achieves a 14.97% (5.41% to 4.60%) word error rate reduction in average at a signal to noise ratio of 0dB to 20dB over the three testing sets.
{"title":"MAP estimation of online mapping parameters in ensemble speaker and speaking environment modeling","authors":"Yu Tsao, Shigeki Matsuda, Satoshi Nakamura, Chin-Hui Lee","doi":"10.1109/ASRU.2009.5373236","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373236","url":null,"abstract":"Recently, an ensemble speaker and speaking environment modeling (ESSEM) framework was proposed to enhance automatic speech recognition performance under adverse conditions. In the online phase of ESSEM, the prepared environment structure in the offline stage is transformed to a set of acoustic models for the target testing environment by using a mapping function. In the original ESSEM framework, the mapping function parameters are estimated based on a maximum likelihood (ML) criterion. In this study, we propose to use a maximum a posteriori (MAP) criterion to calculate the mapping function to avoid a possible over-fitting problem that can degrade the accuracy of environment characterization. For the MAP estimation, we also study two types of prior densities, namely, clustered prior and hierarchical prior, in this paper. On the Aurora-2 task using either type of prior densities, MAP-based ESSEM can achieve better performance than ML-based ESSEM, especially under low SNR conditions. When comparing to our best baseline results, the MAP-based ESSEM achieves a 14.97% (5.41% to 4.60%) word error rate reduction in average at a signal to noise ratio of 0dB to 20dB over the three testing sets.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126188774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373305
P. Tsiakoulis, A. Potamianos, D. Dimitriadis
In this paper, we investigate the performance of modulation related features and normalized spectral moments for automatic speech recognition. We focus on the short-time averages of the amplitude weighted instantaneous frequencies and bandwidths, computed at each subband of a mel-spaced filterbank. Similar features have been proposed in previous studies, and have been successfully combined with MFCCs for speech and speaker recognition. Our goal is to investigate the stand-alone performance of these features. First, it is experimentally shown that the proposed features are only moderately correlated in the frequency domain, and, unlike MFCCs, they do not require a transformation to the cepstral domain. Next, the filterbank parameters (number of filters and filter overlap) are investigated for the proposed features and compared with those of MFCCs. Results show that frequency related features perform at least as well as MFCCs for clean conditions, and yield superior results for noisy conditions; up to 50% relative error rate reduction for the AURORA3 Spanish task.
{"title":"Short-time instantaneous frequency and bandwidth features for speech recognition","authors":"P. Tsiakoulis, A. Potamianos, D. Dimitriadis","doi":"10.1109/ASRU.2009.5373305","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373305","url":null,"abstract":"In this paper, we investigate the performance of modulation related features and normalized spectral moments for automatic speech recognition. We focus on the short-time averages of the amplitude weighted instantaneous frequencies and bandwidths, computed at each subband of a mel-spaced filterbank. Similar features have been proposed in previous studies, and have been successfully combined with MFCCs for speech and speaker recognition. Our goal is to investigate the stand-alone performance of these features. First, it is experimentally shown that the proposed features are only moderately correlated in the frequency domain, and, unlike MFCCs, they do not require a transformation to the cepstral domain. Next, the filterbank parameters (number of filters and filter overlap) are investigated for the proposed features and compared with those of MFCCs. Results show that frequency related features perform at least as well as MFCCs for clean conditions, and yield superior results for noisy conditions; up to 50% relative error rate reduction for the AURORA3 Spanish task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126465551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373493
S. Furui
Since speech is highly variable, even if we have a fairly large-scale database, we cannot avoid the data sparseness problem in constructing automatic speech recognition (ASR) systems. How to train and adapt statistical models using limited amounts of data is one of the most important research issues in ASR. This paper summarizes major techniques that have been proposed to solve the generalization problem in acoustic model training and adaptation, that is, how to achieve high recognition accuracy for new utterances. One of the common approaches is controlling the degree of freedom in model training and adaptation. The techniques can be classified by whether a priori knowledge of speech obtained by a speech database such as those spoken by many speakers is used or not. Another approach is maximizing “margins” between training samples and the decision boundaries. Many of these techniques have also been combined and extended to further improve performance. Although many useful techniques have been developed, we still do not have a golden standard that can be applied to any kind of speech variation and any condition of the speech data available for training and adaptation.
{"title":"Generalization problem in ASR acoustic model training and adaptation","authors":"S. Furui","doi":"10.1109/ASRU.2009.5373493","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373493","url":null,"abstract":"Since speech is highly variable, even if we have a fairly large-scale database, we cannot avoid the data sparseness problem in constructing automatic speech recognition (ASR) systems. How to train and adapt statistical models using limited amounts of data is one of the most important research issues in ASR. This paper summarizes major techniques that have been proposed to solve the generalization problem in acoustic model training and adaptation, that is, how to achieve high recognition accuracy for new utterances. One of the common approaches is controlling the degree of freedom in model training and adaptation. The techniques can be classified by whether a priori knowledge of speech obtained by a speech database such as those spoken by many speakers is used or not. Another approach is maximizing “margins” between training samples and the decision boundaries. Many of these techniques have also been combined and extended to further improve performance. Although many useful techniques have been developed, we still do not have a golden standard that can be applied to any kind of speech variation and any condition of the speech data available for training and adaptation.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131323370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373233
Chanwoo Kim, R. Stern
A novel algorithm that normalizes the distribution of spectral power coefficients is described in this paper. The algorithm, called power-function-based power distribution (PPDN) is based on the observation that the ratio of arithmetic mean to geometric mean changes as speech is corrupted by noise, and a parametric power function is used to equalize this ratio. We also observe that a longer “medium-duration” observation window (of approximately 100 ms) is better suited for parameter estimation for noise compensation than the briefer window typically used for automatic speech recognition. We also describe the implementation of an online version of PPDN based on exponentially weighted temporal averaging. Experimental results shows that PPDN provides comparable or slightly better results than state of- the-art algorithms such as vector Taylor series for speech recognition while requiring much less computation. Hence, the algorithm is suitable for both real-time speech communication or as a real-time preprocessing stage for speech recognition systems.
{"title":"Power function-based power distribution normalization algorithm for robust speech recognition","authors":"Chanwoo Kim, R. Stern","doi":"10.1109/ASRU.2009.5373233","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373233","url":null,"abstract":"A novel algorithm that normalizes the distribution of spectral power coefficients is described in this paper. The algorithm, called power-function-based power distribution (PPDN) is based on the observation that the ratio of arithmetic mean to geometric mean changes as speech is corrupted by noise, and a parametric power function is used to equalize this ratio. We also observe that a longer “medium-duration” observation window (of approximately 100 ms) is better suited for parameter estimation for noise compensation than the briefer window typically used for automatic speech recognition. We also describe the implementation of an online version of PPDN based on exponentially weighted temporal averaging. Experimental results shows that PPDN provides comparable or slightly better results than state of- the-art algorithms such as vector Taylor series for speech recognition while requiring much less computation. Hence, the algorithm is suitable for both real-time speech communication or as a real-time preprocessing stage for speech recognition systems.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127982058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373404
Emilian Stoimenov, Tanja Schultz
Speech recognition decoders based on static graphs have recently proven to significantly outperform the traditional approach of prefix tree expansion in terms of decoding speed [1], [2]. The reduced search effort makes static graph decoders an attractive alternative for tasks concerned with limited processing power or memory footprint on devices such as PDAs, internet tablets, and smart phones. In this paper we explore the benefits of decoding with an optimized speech recognition network over the fully task-optimized prefix-tree based decoder IBIS [3]. We designed and implemented a new decoder called SWIFT (Speedy WeIgthed Finite-state Transducer) based on WFSTs with its application to embedded platforms in mind. After describing the design, the network construction and storage process, we present evaluation results on a small task suitable for embedded applications, and on a large task, namely the European Parliament Plenary Sessions (EPPS) task from the TC-STAR project [20]. The SWIFT Decoder is up to 50% faster than IBIS on both tasks. In addition, SWIFT achieves significant memory consumption reductions obtained by our innovative network specific storage layout optimization.
{"title":"A multiplatform speech recognition decoder based on weighted finite-state transducers","authors":"Emilian Stoimenov, Tanja Schultz","doi":"10.1109/ASRU.2009.5373404","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373404","url":null,"abstract":"Speech recognition decoders based on static graphs have recently proven to significantly outperform the traditional approach of prefix tree expansion in terms of decoding speed [1], [2]. The reduced search effort makes static graph decoders an attractive alternative for tasks concerned with limited processing power or memory footprint on devices such as PDAs, internet tablets, and smart phones. In this paper we explore the benefits of decoding with an optimized speech recognition network over the fully task-optimized prefix-tree based decoder IBIS [3]. We designed and implemented a new decoder called SWIFT (Speedy WeIgthed Finite-state Transducer) based on WFSTs with its application to embedded platforms in mind. After describing the design, the network construction and storage process, we present evaluation results on a small task suitable for embedded applications, and on a large task, namely the European Parliament Plenary Sessions (EPPS) task from the TC-STAR project [20]. The SWIFT Decoder is up to 50% faster than IBIS on both tasks. In addition, SWIFT achieves significant memory consumption reductions obtained by our innovative network specific storage layout optimization.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115534633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373509
Dekai Wu
In this paper, we survey some central issues in the historical, current, and future landscape of statistical machine translation (SMT) research, taking as a starting point an extended three-dimensional MT model space. We posit a socio-geographical conceptual disparity hypothesis, that aims to explain why language pairs like Chinese-English have presented MT with so much more difficulty than others. The evolution from simple token-based to segment-based to tree-based syntactic SMT is sketched. For tree-based SMT, we consider language bias rationales for selecting the degree of compositional power within the hierarchy of expressiveness for transduction grammars (or synchronous grammars). This leads us to inversion transductions and the ITG model prevalent in current state-of-the-art SMT, along with the underlying ITG hypothesis, which posits a language universal. Against this backdrop, we enumerate a set of key open questions for syntactic SMT. We then consider the more recent area of semantic SMT. We list principles for successful application of sense disambiguation models to semantic SMT, and describe early directions in the use of semantic role labeling for semantic SMT.
{"title":"Toward machine translation with statistics and syntax and semantics","authors":"Dekai Wu","doi":"10.1109/ASRU.2009.5373509","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373509","url":null,"abstract":"In this paper, we survey some central issues in the historical, current, and future landscape of statistical machine translation (SMT) research, taking as a starting point an extended three-dimensional MT model space. We posit a socio-geographical conceptual disparity hypothesis, that aims to explain why language pairs like Chinese-English have presented MT with so much more difficulty than others. The evolution from simple token-based to segment-based to tree-based syntactic SMT is sketched. For tree-based SMT, we consider language bias rationales for selecting the degree of compositional power within the hierarchy of expressiveness for transduction grammars (or synchronous grammars). This leads us to inversion transductions and the ITG model prevalent in current state-of-the-art SMT, along with the underlying ITG hypothesis, which posits a language universal. Against this backdrop, we enumerate a set of key open questions for syntactic SMT. We then consider the more recent area of semantic SMT. We list principles for successful application of sense disambiguation models to semantic SMT, and describe early directions in the use of semantic role labeling for semantic SMT.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123964265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5372922
Sriram Ganapathy, Samuel Thomas, H. Hermansky
In this paper, we present a new noise compensation technique for modulation frequency features derived from syllable length segments of subband temporal envelopes. The subband temporal envelopes are estimated using frequency domain linear prediction (FDLP). We propose a technique for noise compensation in FDLP where an estimate of the noise envelope is subtracted from the noisy speech envelope. The noise compensated FDLP envelopes are compressed with static (logarithmic) and dynamic (adaptive loops) compression and are transformed into modulation spectral features. Experiments are performed on a phoneme recognition task as well as a connected digit recognition task where the test data is corrupted with variety of noise types at different signal to noise ratios. In these experiments with mismatched train and test conditions, the proposed features provide considerable improvements compared to other state of the art noise robust feature extraction techniques (average relative improvement of 25 % and 35 % over the baseline PLP features for phoneme and word recognition tasks respectively).
{"title":"Temporal envelope subtraction for robust speech recognition using modulation spectrum","authors":"Sriram Ganapathy, Samuel Thomas, H. Hermansky","doi":"10.1109/ASRU.2009.5372922","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372922","url":null,"abstract":"In this paper, we present a new noise compensation technique for modulation frequency features derived from syllable length segments of subband temporal envelopes. The subband temporal envelopes are estimated using frequency domain linear prediction (FDLP). We propose a technique for noise compensation in FDLP where an estimate of the noise envelope is subtracted from the noisy speech envelope. The noise compensated FDLP envelopes are compressed with static (logarithmic) and dynamic (adaptive loops) compression and are transformed into modulation spectral features. Experiments are performed on a phoneme recognition task as well as a connected digit recognition task where the test data is corrupted with variety of noise types at different signal to noise ratios. In these experiments with mismatched train and test conditions, the proposed features provide considerable improvements compared to other state of the art noise robust feature extraction techniques (average relative improvement of 25 % and 35 % over the baseline PLP features for phoneme and word recognition tasks respectively).","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121140307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}