Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034651
H. Meinedo, N. Souto, J. Neto
This paper describes our work on the development of a large vocabulary continuous speech recognition system applied to a broadcast news task for the European Portuguese language in the scope of the ALERT project. We start by presenting the baseline recogniser AUDIMUS, which was originally developed with a corpus of read newspaper text. This is a hybrid system that uses a combination of phone probabilities generated by several MLPs trained on distinct feature sets. The paper details the modifications introduced in this system, namely in the development of a new language model, the vocabulary and pronunciation lexicon and the training on new data from the ALERT BN corpus currently available. The system trained with this BN corpus achieved 18.4% WER when tested with the F0 focus condition (studio, planed, native, clean), and 35.2% when tested in all focus conditions.
{"title":"Speech recognition of broadcast news for the European Portuguese language","authors":"H. Meinedo, N. Souto, J. Neto","doi":"10.1109/ASRU.2001.1034651","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034651","url":null,"abstract":"This paper describes our work on the development of a large vocabulary continuous speech recognition system applied to a broadcast news task for the European Portuguese language in the scope of the ALERT project. We start by presenting the baseline recogniser AUDIMUS, which was originally developed with a corpus of read newspaper text. This is a hybrid system that uses a combination of phone probabilities generated by several MLPs trained on distinct feature sets. The paper details the modifications introduced in this system, namely in the development of a new language model, the vocabulary and pronunciation lexicon and the training on new data from the ALERT BN corpus currently available. The system trained with this BN corpus achieved 18.4% WER when tested with the F0 focus condition (studio, planed, native, clean), and 35.2% when tested in all focus conditions.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034576
J. Larson
The World Wide Web Voice Browser Working Group has released specifications for four integrated languages to developing speech applications: VoiceXML 2.0, Speech Synthesis Markup Language, Speech Recognition Grammar Markup Language, and Semantic Interpretation. These languages enable developers to specify quickly conversational speech Web applications that can be accessed by any telephone or cell phone. The speech recognition and natural language communities are welcome to use these specifications and their implementations as they become available, as well as comment on the direction and details of these evolving specifications.
{"title":"VoiceXML 2.0 and the W3C speech interface framework","authors":"J. Larson","doi":"10.1109/ASRU.2001.1034576","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034576","url":null,"abstract":"The World Wide Web Voice Browser Working Group has released specifications for four integrated languages to developing speech applications: VoiceXML 2.0, Speech Synthesis Markup Language, Speech Recognition Grammar Markup Language, and Semantic Interpretation. These languages enable developers to specify quickly conversational speech Web applications that can be accessed by any telephone or cell phone. The speech recognition and natural language communities are welcome to use these specifications and their implementations as they become available, as well as comment on the direction and details of these evolving specifications.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124806472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034622
I. Zitouni, H. Kuo, Chin-Hui Lee
We describe different techniques to improve natural language call routing: boosting, relevance feedback, discriminative training, and constrained minimization. Their common goal is to reweight the data in order to let the system focus on documents judged hard to classify by a single classifier. These approaches are evaluated with the common vector-based classifier and also with the beta classifier which had given good results in the similar task of E-mail steering. We explore ways of deriving and combining uncorrelated classifiers in order to improve accuracy. Compared to the cosine and beta baseline classifiers, we report an improvement of 49% and 10%, respectively.
{"title":"Natural language call routing: towards combination and boosting of classifiers","authors":"I. Zitouni, H. Kuo, Chin-Hui Lee","doi":"10.1109/ASRU.2001.1034622","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034622","url":null,"abstract":"We describe different techniques to improve natural language call routing: boosting, relevance feedback, discriminative training, and constrained minimization. Their common goal is to reweight the data in order to let the system focus on documents judged hard to classify by a single classifier. These approaches are evaluated with the common vector-based classifier and also with the beta classifier which had given good results in the similar task of E-mail steering. We explore ways of deriving and combining uncorrelated classifiers in order to improve accuracy. Compared to the cosine and beta baseline classifiers, we report an improvement of 49% and 10%, respectively.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123075792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034590
K. Weber, Samy Bengio, H. Bourlard
HMM2 is a particular hidden Markov model where state emission probabilities of the temporal (primary) HMM are modeled through (secondary) state-dependent frequency-based HMMs (see Weber, K. et al., Proc. ICSGP, vol.III, p.147-50, 2000). As we show in another paper (see Weber et al., Proc. Eurospeech, Sep. 2001), a secondary HMM can also be used to extract robust ASR features. Here, we further investigate this novel approach towards using a full HMM2 as feature extractor, working in the spectral domain, and extracting robust formant-like features for a standard ASR system. HMM2 performs a nonlinear, state-dependent frequency warping, and it is shown that the resulting frequency segmentation actually contains particularly discriminant features. To improve the HMM2 system further, we complement the initial spectral energy vectors with frequency information. Finally, adding temporal information to the HMM2 feature vector yields further improvements. These conclusions are experimentally validated on the Numbers95 database, where word error rates of 15%, using only a 4-dimensional feature vector (3 formant-like parameters and one time index) were obtained.
HMM2是一个特殊的隐马尔可夫模型,其中时间(主要)HMM的状态发射概率通过(次要)依赖于状态的基于频率的HMM来建模(参见Weber, K. et al., Proc. ICSGP, vol.III, p.147- 50,2000)。正如我们在另一篇论文中所展示的那样(参见Weber等人,Proc. Eurospeech, 2001年9月),二级HMM也可以用于提取鲁棒的ASR特征。在这里,我们进一步研究了这种新方法,使用完整的HMM2作为特征提取器,在谱域工作,并为标准ASR系统提取鲁棒的共振峰特征。HMM2执行非线性、状态相关的频率翘曲,结果表明,频率分割实际上包含特别的判别特征。为了进一步改进HMM2系统,我们用频率信息补充了初始谱能量向量。最后,将时间信息添加到HMM2特征向量中会得到进一步的改进。这些结论在Numbers95数据库上进行了实验验证,该数据库仅使用一个4维特征向量(3个峰形参数和一个时间索引)获得了错误率为15%的单词。
{"title":"Speech recognition using advanced HMM2 features","authors":"K. Weber, Samy Bengio, H. Bourlard","doi":"10.1109/ASRU.2001.1034590","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034590","url":null,"abstract":"HMM2 is a particular hidden Markov model where state emission probabilities of the temporal (primary) HMM are modeled through (secondary) state-dependent frequency-based HMMs (see Weber, K. et al., Proc. ICSGP, vol.III, p.147-50, 2000). As we show in another paper (see Weber et al., Proc. Eurospeech, Sep. 2001), a secondary HMM can also be used to extract robust ASR features. Here, we further investigate this novel approach towards using a full HMM2 as feature extractor, working in the spectral domain, and extracting robust formant-like features for a standard ASR system. HMM2 performs a nonlinear, state-dependent frequency warping, and it is shown that the resulting frequency segmentation actually contains particularly discriminant features. To improve the HMM2 system further, we complement the initial spectral energy vectors with frequency information. Finally, adding temporal information to the HMM2 feature vector yields further improvements. These conclusions are experimentally validated on the Numbers95 database, where word error rates of 15%, using only a 4-dimensional feature vector (3 formant-like parameters and one time index) were obtained.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123285594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034632
C. Lee, Shrikanth S. Narayanan, R. Pieraccini
This paper reports on methods for automatic classification of spoken utterances based on the emotional state of the speaker. The data set used for the analysis comes from a corpus of human-machine dialogues recorded from a commercial application deployed by SpeechWorks. Linear discriminant classification with Gaussian class-conditional probability distribution and k-nearest neighbors methods are used to classify utterances into two basic emotion states, negative and non-negative The features used by the classifiers are utterance-level statistics of the fundamental frequency and energy of the speech signal. To improve classification performance, two specific feature selection methods are used; namely, promising first selection and forward feature selection. Principal component analysis is used to reduce the dimensionality of the features while maximizing classification accuracy. Improvements obtained by feature selection and PCA are reported. We also report the results.
{"title":"Recognition of negative emotions from the speech signal","authors":"C. Lee, Shrikanth S. Narayanan, R. Pieraccini","doi":"10.1109/ASRU.2001.1034632","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034632","url":null,"abstract":"This paper reports on methods for automatic classification of spoken utterances based on the emotional state of the speaker. The data set used for the analysis comes from a corpus of human-machine dialogues recorded from a commercial application deployed by SpeechWorks. Linear discriminant classification with Gaussian class-conditional probability distribution and k-nearest neighbors methods are used to classify utterances into two basic emotion states, negative and non-negative The features used by the classifiers are utterance-level statistics of the fundamental frequency and energy of the speech signal. To improve classification performance, two specific feature selection methods are used; namely, promising first selection and forward feature selection. Principal component analysis is used to reduce the dimensionality of the features while maximizing classification accuracy. Improvements obtained by feature selection and PCA are reported. We also report the results.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127487659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034618
Tatsuya Kawahara, H. Nanjo, S. Furui
We introduce our extensive projects on spontaneous speech processing and current trials of lecture speech recognition. A large corpus of lecture presentations and talks is being collected in the project. We have trained initial baseline models and confirmed significant difference of real lectures and written notes. In spontaneous lecture speech, the speaking rate is generally faster and changes a lot, which makes it harder to apply fixed segmentation and decoding settings. Therefore, we propose sequential decoding and speaking-rate dependent decoding strategies. The sequential decoder simultaneously performs automatic segmentation and decoding of input utterances. Then, the most adequate acoustic analysis, phone models and decoding parameters are applied according to the current speaking rate. These strategies achieve improvement on automatic transcription of real lecture speech.
{"title":"Automatic transcription of spontaneous lecture speech","authors":"Tatsuya Kawahara, H. Nanjo, S. Furui","doi":"10.1109/ASRU.2001.1034618","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034618","url":null,"abstract":"We introduce our extensive projects on spontaneous speech processing and current trials of lecture speech recognition. A large corpus of lecture presentations and talks is being collected in the project. We have trained initial baseline models and confirmed significant difference of real lectures and written notes. In spontaneous lecture speech, the speaking rate is generally faster and changes a lot, which makes it harder to apply fixed segmentation and decoding settings. Therefore, we propose sequential decoding and speaking-rate dependent decoding strategies. The sequential decoder simultaneously performs automatic segmentation and decoding of input utterances. Then, the most adequate acoustic analysis, phone models and decoding parameters are applied according to the current speaking rate. These strategies achieve improvement on automatic transcription of real lecture speech.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125612590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034685
K. Choukri
This paper aims at briefly describing the rationale behind the foundation of the European Language Resources Association (ELRA) in 1995 and its activities since then. We would like to focus on the issues involved in making language resources available to different sectors of the language engineering community. ELRA is presented as a conduit for the distribution of speech, written and terminology databases, enabling all players to have access to language resources (LRs). In order to produce and provide such resources effectively to research and development groups in academic, commercial and industrial environments, it is necessary to address legal, logistic and other practical issues. This has already been done by ELRA through the establishment of an operational infrastructure that capitalizes on the investments of the European Commission and European national agencies to ensure the availability of speech, text, and terminology resources.
{"title":"European Language Resources Association history and recent developments","authors":"K. Choukri","doi":"10.1109/ASRU.2001.1034685","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034685","url":null,"abstract":"This paper aims at briefly describing the rationale behind the foundation of the European Language Resources Association (ELRA) in 1995 and its activities since then. We would like to focus on the issues involved in making language resources available to different sectors of the language engineering community. ELRA is presented as a conduit for the distribution of speech, written and terminology databases, enabling all players to have access to language resources (LRs). In order to produce and provide such resources effectively to research and development groups in academic, commercial and industrial environments, it is necessary to address legal, logistic and other practical issues. This has already been done by ELRA through the establishment of an operational infrastructure that capitalizes on the investments of the European Commission and European national agencies to ensure the availability of speech, text, and terminology resources.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127909646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-01DOI: 10.1109/ASRU.2001.1034672
B. Decadt, J. Duchateau, Walter Daelemans, P. Wambacq
We describe a method to enhance the readability of the textual output in a large vocabulary continuous speech recognition system when out-of-vocabulary words occur. The basic idea is to replace uncertain words in the transcriptions with a phoneme recognition result that is post-processed using a phoneme-to-grapheme converter. This converter turns phoneme strings into grapheme strings and is trained using machine learning techniques. Experiments show that, even when the grapheme strings are not fully correct, the resulting transcriptions are more easily readable than the original ones.
{"title":"Phoneme-to-grapheme conversion for out-of-vocabulary words in large vocabulary speech recognition","authors":"B. Decadt, J. Duchateau, Walter Daelemans, P. Wambacq","doi":"10.1109/ASRU.2001.1034672","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034672","url":null,"abstract":"We describe a method to enhance the readability of the textual output in a large vocabulary continuous speech recognition system when out-of-vocabulary words occur. The basic idea is to replace uncertain words in the transcriptions with a phoneme recognition result that is post-processed using a phoneme-to-grapheme converter. This converter turns phoneme strings into grapheme strings and is trained using machine learning techniques. Experiments show that, even when the grapheme strings are not fully correct, the resulting transcriptions are more easily readable than the original ones.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130627033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-01DOI: 10.1109/ASRU.2001.1034624
D. Van Uytsel, Dirk Van Compernolle, P. Wambacq
In Van Uytsel et al. (2001) a parsing language model based on a probabilistic left-comer grammar (PLCG) was proposed and encouraging performance on a speech recognition task using the PLCG-based language model was reported. In this paper we show how the PLCG-based language model can be further optimized by iterative parameter reestimation on unannotated training data. The precalculation of forward, inner and outer probabilities of states in the PLCG network provides an elegant crosscut to the computation of transition frequency expectations, which are needed in each iteration of the proposed reestimation procedure. The training algorithm enables model training on very large corpora. In our experiments, test set perplexity is close to saturation after three iterations, 5 to 16% lower than initially. We however observed no significant improvement of recognition accuracy after reestimation.
Van Uytsel等人(2001)提出了一种基于概率左角语法(PLCG)的解析语言模型,并报道了使用基于PLCG的语言模型在语音识别任务中的表现。在本文中,我们展示了如何通过对未注释的训练数据进行迭代参数重估计来进一步优化基于plc的语言模型。PLCG网络中状态的前向概率、内概率和外概率的预计算为每次迭代所提出的重估计过程中所需的过渡频率期望的计算提供了一个优雅的横切。训练算法可以在非常大的语料库上进行模型训练。在我们的实验中,经过三次迭代,测试集perplexity接近饱和,比初始降低了5% ~ 16%。然而,我们观察到重新估计后的识别准确率没有显著提高。
{"title":"Maximum-likelihood training of the PLCG-based language model","authors":"D. Van Uytsel, Dirk Van Compernolle, P. Wambacq","doi":"10.1109/ASRU.2001.1034624","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034624","url":null,"abstract":"In Van Uytsel et al. (2001) a parsing language model based on a probabilistic left-comer grammar (PLCG) was proposed and encouraging performance on a speech recognition task using the PLCG-based language model was reported. In this paper we show how the PLCG-based language model can be further optimized by iterative parameter reestimation on unannotated training data. The precalculation of forward, inner and outer probabilities of states in the PLCG network provides an elegant crosscut to the computation of transition frequency expectations, which are needed in each iteration of the proposed reestimation procedure. The training algorithm enables model training on very large corpora. In our experiments, test set perplexity is close to saturation after three iterations, 5 to 16% lower than initially. We however observed no significant improvement of recognition accuracy after reestimation.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"84 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127978868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-11-27DOI: 10.1109/ASRU.2001.1034604
sha Raj, Joshua Migdal, Rita Singh
Communication devices which perform distributed speech recognition (DSR) tasks currently transmit standardized coded parameters of speech signals. Recognition features are extracted from signals reconstructed using these on a remote server. Since reconstruction losses degrade recognition performance, proposals are being considered to standardize DSR-codecs which derive recognition features, to be transmitted and used directly for recognition. However, such a codec must be embedded on the transmitting device, along with its current standard codec. Performing recognition using codec bitstreams avoids these complications: no additional feature-extraction mechanism is required on the device, and there are no reconstruction losses on the server. We propose an LDA-based method for extracting optimal feature sets from codec bitstreams and demonstrate that features so derived result in improved recognition performance for the LPC, GSM and CELP codecs. For GSM and CELP, we show that the performance is comparable to that with uncoded speech and standard DSR-codec features.
{"title":"Distributed speech recognition with codec parameters","authors":"sha Raj, Joshua Migdal, Rita Singh","doi":"10.1109/ASRU.2001.1034604","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034604","url":null,"abstract":"Communication devices which perform distributed speech recognition (DSR) tasks currently transmit standardized coded parameters of speech signals. Recognition features are extracted from signals reconstructed using these on a remote server. Since reconstruction losses degrade recognition performance, proposals are being considered to standardize DSR-codecs which derive recognition features, to be transmitted and used directly for recognition. However, such a codec must be embedded on the transmitting device, along with its current standard codec. Performing recognition using codec bitstreams avoids these complications: no additional feature-extraction mechanism is required on the device, and there are no reconstruction losses on the server. We propose an LDA-based method for extracting optimal feature sets from codec bitstreams and demonstrate that features so derived result in improved recognition performance for the LPC, GSM and CELP codecs. For GSM and CELP, we show that the performance is comparable to that with uncoded speech and standard DSR-codec features.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114459961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}