Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5372895
Philip N. Garner
When combined with cepstral normalisation techniques, the features normally used in Automatic Speech Recognition are based on Signal to Noise Ratio (SNR). We show that calculating SNR from the outset, rather than relying on cepstral normalisation to produce it, gives features with a number of practical and mathematical advantages over power-spectral based ones. In a detailed analysis, we derive Maximum Likelihood and Maximum a-Posteriori estimates for SNR based features, and show that they can outperform more conventional ones, especially when subsequently combined with cepstral variance normalisation. We further show anecdotal evidence that SNR based features lend themselves well to noise estimates based on low-energy envelope tracking.
{"title":"SNR features for automatic speech recognition","authors":"Philip N. Garner","doi":"10.1109/ASRU.2009.5372895","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372895","url":null,"abstract":"When combined with cepstral normalisation techniques, the features normally used in Automatic Speech Recognition are based on Signal to Noise Ratio (SNR). We show that calculating SNR from the outset, rather than relying on cepstral normalisation to produce it, gives features with a number of practical and mathematical advantages over power-spectral based ones. In a detailed analysis, we derive Maximum Likelihood and Maximum a-Posteriori estimates for SNR based features, and show that they can outperform more conventional ones, especially when subsequently combined with cepstral variance normalisation. We further show anecdotal evidence that SNR based features lend themselves well to noise estimates based on low-energy envelope tracking.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116224084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373293
K. Boakye, Benoit Favre, Dilek Z. Hakkani-Tür
In this paper, we describe our efforts toward the automatic detection of English questions in meetings. We analyze the utility of various features for this task, originating from three distinct classes: lexico-syntactic, turn-related, and pitch-related. Of particular interest is the use of parse tree information in classification, an approach as yet unexplored. Results from experiments on the ICSI MRDA corpus demonstrate that lexico-syntactic features are most useful for this task, with turn-and pitch-related features providing complementary information in combination. In addition, experiments using reference parse trees on the broadcast conversation portion of the OntoNotes release 2.9 data set illustrate the potential of parse trees to outperform word lexical features.
{"title":"Any questions? Automatic question detection in meetings","authors":"K. Boakye, Benoit Favre, Dilek Z. Hakkani-Tür","doi":"10.1109/ASRU.2009.5373293","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373293","url":null,"abstract":"In this paper, we describe our efforts toward the automatic detection of English questions in meetings. We analyze the utility of various features for this task, originating from three distinct classes: lexico-syntactic, turn-related, and pitch-related. Of particular interest is the use of parse tree information in classification, an approach as yet unexplored. Results from experiments on the ICSI MRDA corpus demonstrate that lexico-syntactic features are most useful for this task, with turn-and pitch-related features providing complementary information in combination. In addition, experiments using reference parse trees on the broadcast conversation portion of the OntoNotes release 2.9 data set illustrate the potential of parse trees to outperform word lexical features.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124805628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5372889
Timothy J. Hazen, Wade Shen, Christopher M. White
This paper examines a query-by-example approach to spoken term detection in audio files. The approach is designed for low-resource situations in which limited or no in-domain training material is available and accurate word-based speech recognition capability is unavailable. Instead of using word or phone strings as search terms, the user presents the system with audio snippets of desired search terms to act as the queries. Query and test materials are represented using phonetic posteriorgrams obtained from a phonetic recognition system. Query matches in the test data are located using a modified dynamic time warping search between query templates and test utterances. Experiments using this approach are presented using data from the Fisher corpus.
{"title":"Query-by-example spoken term detection using phonetic posteriorgram templates","authors":"Timothy J. Hazen, Wade Shen, Christopher M. White","doi":"10.1109/ASRU.2009.5372889","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372889","url":null,"abstract":"This paper examines a query-by-example approach to spoken term detection in audio files. The approach is designed for low-resource situations in which limited or no in-domain training material is available and accurate word-based speech recognition capability is unavailable. Instead of using word or phone strings as search terms, the user presents the system with audio snippets of desired search terms to act as the queries. Query and test materials are represented using phonetic posteriorgrams obtained from a phonetic recognition system. Query matches in the test data are located using a modified dynamic time warping search between query templates and test utterances. Experiments using this approach are presented using data from the Fisher corpus.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126969833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373227
Xiaodong Cui, Jian Xue, Bowen Zhou
This paper investigates an eigen feature space maximum likelihood linear regression (fMLLR) scheme to improve the performance of online speaker adaptation in automatic speech recognition systems. In this stochastic-approximation-like framework, the traditional incremental fMLLR estimation is considered as a slowly changing mean of the eigen fMLLR. It helps the adaptation when only a limited amount of data is available at the beginning of the conversation. The scheme is shown to be able to balance the transformation estimation given the data and yields reasonable improvements for online systems.
{"title":"Improving online incremental speaker adaptation with eigen feature space MLLR","authors":"Xiaodong Cui, Jian Xue, Bowen Zhou","doi":"10.1109/ASRU.2009.5373227","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373227","url":null,"abstract":"This paper investigates an eigen feature space maximum likelihood linear regression (fMLLR) scheme to improve the performance of online speaker adaptation in automatic speech recognition systems. In this stochastic-approximation-like framework, the traditional incremental fMLLR estimation is considered as a slowly changing mean of the eigen fMLLR. It helps the adaptation when only a limited amount of data is available at the beginning of the conversation. The scheme is shown to be able to balance the transformation estimation given the data and yields reasonable improvements for online systems.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121669364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373424
Ngoc Thang Vu, Tanja Schultz
We report on our recent efforts toward a large vocabulary Vietnamese speech recognition system. In particular, we describe the Vietnamese text and speech database recently collected as part of our GlobalPhone corpus. The data was complemented by a large collection of text data crawled from various Vietnamese websites. To bootstrap the Vietnamese speech recognition system we used our Rapid Language Adaptation scheme applying a multilingual phone inventory. After initialization we investigated the peculiarities of the Vietnamese language and achieved significant improvements by implementing different tone modeling schemes, extended by pitch extraction, handling multiwords to address the monosyllable structure of Vietnamese, and featuring language modeling based on 5-grams. Furthermore, we addressed the issue of dialectal variations between South and North Vietnam by creating dialect dependent pronunciations and including dialect in the context decision tree of the recognizer. Our currently best recognition system achieves a word error rate of 11.7% on read newspaper speech.
{"title":"Vietnamese large vocabulary continuous speech recognition","authors":"Ngoc Thang Vu, Tanja Schultz","doi":"10.1109/ASRU.2009.5373424","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373424","url":null,"abstract":"We report on our recent efforts toward a large vocabulary Vietnamese speech recognition system. In particular, we describe the Vietnamese text and speech database recently collected as part of our GlobalPhone corpus. The data was complemented by a large collection of text data crawled from various Vietnamese websites. To bootstrap the Vietnamese speech recognition system we used our Rapid Language Adaptation scheme applying a multilingual phone inventory. After initialization we investigated the peculiarities of the Vietnamese language and achieved significant improvements by implementing different tone modeling schemes, extended by pitch extraction, handling multiwords to address the monosyllable structure of Vietnamese, and featuring language modeling based on 5-grams. Furthermore, we addressed the issue of dialectal variations between South and North Vietnam by creating dialect dependent pronunciations and including dialect in the context decision tree of the recognizer. Our currently best recognition system achieves a word error rate of 11.7% on read newspaper speech.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115952057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373371
Fernando Batista, I. Trancoso, N. Mamede
This paper describes and evaluates a language independent approach for automatically enriching the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach, which achieves results within state-of-the-art. The language independence of the approach is attested with experiments conducted on Portuguese, Spanish and English Broadcast News corpora. This paper provides the first comparative study between the three languages, concerning these tasks.
{"title":"Comparing automatic rich transcription for Portuguese, Spanish and English Broadcast News","authors":"Fernando Batista, I. Trancoso, N. Mamede","doi":"10.1109/ASRU.2009.5373371","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373371","url":null,"abstract":"This paper describes and evaluates a language independent approach for automatically enriching the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach, which achieves results within state-of-the-art. The language independence of the approach is attested with experiments conducted on Portuguese, Spanish and English Broadcast News corpora. This paper provides the first comparative study between the three languages, concerning these tasks.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373239
Y. Qiao, Masayuki Suzuki, N. Minematsu
This paper proposes Hidden Structure Model (HSM) for statistical modeling of sequence data. The HSM generalizes our previous proposal on structural representation by introducing hidden states and probabilistic models. Compared with the previous structural representation, HSM not only can solve the problem of misalignment of events, but also can conduct structure-based decoding, which allows us to apply HSM to general speech recognition tasks. Different from HMM, HSM accounts for the probability of both locally absolute and globally contrastive features. This paper focuses on the fundamental formulation and theories of HSM. We also develop methods for the problems of state inference, probability calculation and parameter estimation of HSM. Especially, we show that the state inference of HSM can be reduced to a quadratic programming problem. We carry out two experiments to examine the performance of HSM on labeling sequences. The first experiment tests HSM by using artificially transformed sequences, and the second experiment is based on a Japanese corpus of connected vowel utterances. The experimental results demonstrate the effectiveness of HSM.
{"title":"A study on Hidden Structural Model and its application to labeling sequences","authors":"Y. Qiao, Masayuki Suzuki, N. Minematsu","doi":"10.1109/ASRU.2009.5373239","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373239","url":null,"abstract":"This paper proposes Hidden Structure Model (HSM) for statistical modeling of sequence data. The HSM generalizes our previous proposal on structural representation by introducing hidden states and probabilistic models. Compared with the previous structural representation, HSM not only can solve the problem of misalignment of events, but also can conduct structure-based decoding, which allows us to apply HSM to general speech recognition tasks. Different from HMM, HSM accounts for the probability of both locally absolute and globally contrastive features. This paper focuses on the fundamental formulation and theories of HSM. We also develop methods for the problems of state inference, probability calculation and parameter estimation of HSM. Especially, we show that the state inference of HSM can be reduced to a quadratic programming problem. We carry out two experiments to examine the performance of HSM on labeling sequences. The first experiment tests HSM by using artificially transformed sequences, and the second experiment is based on a Japanese corpus of connected vowel utterances. The experimental results demonstrate the effectiveness of HSM.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131522837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5372931
Yaodong Zhang, James R. Glass
In this paper, we present an unsupervised learning framework to address the problem of detecting spoken keywords. Without any transcription information, a Gaussian Mixture Model is trained to label speech frames with a Gaussian posteriorgram. Given one or more spoken examples of a keyword, we use segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances. The keyword detection result is then obtained by ranking the distortion scores of all the test utterances. We examine the TIMIT corpus as a development set to tune the parameters in our system, and the MIT Lecture corpus for more substantial evaluation. The results demonstrate the viability and effectiveness of our unsupervised learning framework on the keyword spotting task.
{"title":"Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams","authors":"Yaodong Zhang, James R. Glass","doi":"10.1109/ASRU.2009.5372931","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372931","url":null,"abstract":"In this paper, we present an unsupervised learning framework to address the problem of detecting spoken keywords. Without any transcription information, a Gaussian Mixture Model is trained to label speech frames with a Gaussian posteriorgram. Given one or more spoken examples of a keyword, we use segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances. The keyword detection result is then obtained by ranking the distortion scores of all the test utterances. We examine the TIMIT corpus as a development set to tune the parameters in our system, and the MIT Lecture corpus for more substantial evaluation. The results demonstrate the viability and effectiveness of our unsupervised learning framework on the keyword spotting task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131686081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5372883
Md. Jahangir Alam, S. Selouani, D. O'Shaughnessy
Suppression of speech components after perceptual speech enhancement (SE) lowers the noise masking threshold (NMT) level of the enhanced signal. This may re-introduce noise components that are initially masked but not processed by the denoising filter, thereby, favoring the emergence of musical noise. This paper presents a modified perceptual speech enhancement algorithm based on a perceptually motivated weighting factor to effectively suppress the background noise without introducing much distortion in the enhanced signal using the perceptual speech enhancement methods. The performance of the proposed enhancement algorithm is evaluated by the Segmental SNR and Perceptual Evaluation of Speech Quality (PESQ) measures under various noisy environments and yields better results compared to the perceptual speech enhancement methods.
{"title":"An improved perceptual speech enhancement technique employing a psychoacoustically motivated weighting factor","authors":"Md. Jahangir Alam, S. Selouani, D. O'Shaughnessy","doi":"10.1109/ASRU.2009.5372883","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372883","url":null,"abstract":"Suppression of speech components after perceptual speech enhancement (SE) lowers the noise masking threshold (NMT) level of the enhanced signal. This may re-introduce noise components that are initially masked but not processed by the denoising filter, thereby, favoring the emergence of musical noise. This paper presents a modified perceptual speech enhancement algorithm based on a perceptually motivated weighting factor to effectively suppress the background noise without introducing much distortion in the enhanced signal using the perceptual speech enhancement methods. The performance of the proposed enhancement algorithm is evaluated by the Segmental SNR and Perceptual Evaluation of Speech Quality (PESQ) measures under various noisy environments and yields better results compared to the perceptual speech enhancement methods.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128768910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373317
H. Xu, M. Gales, K. K. Chin
Model-based noise compensation techniques, such as Vector Taylor Series (VTS) compensation, have been applied to a range of noise robustness tasks. However one of the issues with these forms of approach is that for large speech recognition systems they are computationally expensive. To address this problem schemes such as Joint Uncertainty Decoding (JUD) have been proposed. Though computationally more efficient, the performance of the system is typically degraded. This paper proposes an alternative scheme, related to JUD, but making fewer approximations, VTS-JUD. Unfortunately this approach also removes some of the computational advantages of JUD. To address this, rather than using VTS-JUD directly, it is used instead to obtain statistics to estimate a predictive linear transform, PCMLLR. This is both computationally efficient and limits some of the issues associated with the diagonal covariance matrices typically used with schemes such as VTS. PCMLLR can also be simply used within an adaptive training framework (PAT). The performance of the VTS-JUD, PCMLLR and PAT system were compared to a number of standard approaches on an in-car speech recognition task. The proposed scheme is an attractive alternative to existing approaches.
{"title":"Improving joint uncertainty decoding performance by predictive methods for noise robust speech recognition","authors":"H. Xu, M. Gales, K. K. Chin","doi":"10.1109/ASRU.2009.5373317","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373317","url":null,"abstract":"Model-based noise compensation techniques, such as Vector Taylor Series (VTS) compensation, have been applied to a range of noise robustness tasks. However one of the issues with these forms of approach is that for large speech recognition systems they are computationally expensive. To address this problem schemes such as Joint Uncertainty Decoding (JUD) have been proposed. Though computationally more efficient, the performance of the system is typically degraded. This paper proposes an alternative scheme, related to JUD, but making fewer approximations, VTS-JUD. Unfortunately this approach also removes some of the computational advantages of JUD. To address this, rather than using VTS-JUD directly, it is used instead to obtain statistics to estimate a predictive linear transform, PCMLLR. This is both computationally efficient and limits some of the issues associated with the diagonal covariance matrices typically used with schemes such as VTS. PCMLLR can also be simply used within an adaptive training framework (PAT). The performance of the VTS-JUD, PCMLLR and PAT system were compared to a number of standard approaches on an in-car speech recognition task. The proposed scheme is an attractive alternative to existing approaches.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116752319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}