Pub Date : 2009-12-13DOI: 10.1109/ASRU.2009.5372877
F. Stouten, D. Fohr, I. Illina
This paper describes the design of an Out-Of-Vocabulary words (OOV) detector. Such a system is assumed to detect segments that correspond to OOV words (words that are not included in the lexicon) in the output of a LVCSR system. The OOV detector uses acoustic confidence measures that are derived from several systems: a word recognizer constrained by a lexicon, a phone recognizer constrained by a grammar and a phone recognizer without constraints. On top of that it also uses some linguistic features. The experimental results on a French broadcast news transcription task showed that for our approach precision equals recall at 35%.
{"title":"Detection of OOV words by combining acoustic confidence measures with linguistic features","authors":"F. Stouten, D. Fohr, I. Illina","doi":"10.1109/ASRU.2009.5372877","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372877","url":null,"abstract":"This paper describes the design of an Out-Of-Vocabulary words (OOV) detector. Such a system is assumed to detect segments that correspond to OOV words (words that are not included in the lexicon) in the output of a LVCSR system. The OOV detector uses acoustic confidence measures that are derived from several systems: a word recognizer constrained by a lexicon, a phone recognizer constrained by a grammar and a phone recognizer without constraints. On top of that it also uses some linguistic features. The experimental results on a French broadcast news transcription task showed that for our approach precision equals recall at 35%.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"114 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120851906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373377
Pierre Gotab, Frédéric Béchet, Géraldine Damnati
Active learning can be used for the maintenance of a deployed Spoken Dialog System (SDS) that evolves with time and when large collection of dialog traces can be collected on a daily basis. At the Spoken Language Understanding (SLU) level this maintenance process is crucial as a deployed SDS evolves quickly when services are added, modified or dropped. Knowledge-based approaches, based on manually written grammars or inference rules, are often preferred as system designers can modify directly the SLU models in order to take into account such a modification in the service, even if no or very little related data has been collected. However as new examples are added to the annotated corpus, corpus-based methods can then be applied, replacing or in addition to the initial knowledge-based models. This paper describes an active learning scheme, based on an SLU criterion, which is used for automatically updating the SLU models of a deployed SDS. Two kind of SLU models are going to be compared: rule-based ones, used in the deployed system and consisting of several thousands of hand-crafted rules; corpus-based ones, based on the automatic learning of classifiers on an annotated corpus.
{"title":"Active learning for rule-based and corpus-based Spoken Language Understanding models","authors":"Pierre Gotab, Frédéric Béchet, Géraldine Damnati","doi":"10.1109/ASRU.2009.5373377","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373377","url":null,"abstract":"Active learning can be used for the maintenance of a deployed Spoken Dialog System (SDS) that evolves with time and when large collection of dialog traces can be collected on a daily basis. At the Spoken Language Understanding (SLU) level this maintenance process is crucial as a deployed SDS evolves quickly when services are added, modified or dropped. Knowledge-based approaches, based on manually written grammars or inference rules, are often preferred as system designers can modify directly the SLU models in order to take into account such a modification in the service, even if no or very little related data has been collected. However as new examples are added to the annotated corpus, corpus-based methods can then be applied, replacing or in addition to the initial knowledge-based models. This paper describes an active learning scheme, based on an SLU criterion, which is used for automatically updating the SLU models of a deployed SDS. Two kind of SLU models are going to be compared: rule-based ones, used in the deployed system and consisting of several thousands of hand-crafted rules; corpus-based ones, based on the automatic learning of classifiers on an annotated corpus.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127522086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373430
K. Kalgaonkar, M. Seltzer, A. Acero
This paper presents a novel data-driven technique for performing acoustic model adaptation to noisy environments. In the presence of additive noise, the relationship between log mel spectra of speech, noise and noisy speech is nonlinear. Traditional methods linearize this relationship using the mode of the nonlinearity or use some other approximation. The approach presented in this paper models this nonlinear relationship using linear spline regression. In this method, the set of spline parameters that minimizes the error between the predicted and actual noisy speech features is learned from training data, and used at runtime to adapt clean acoustic model parameters to the current noise conditions. Experiments were performed to evaluate the performance of the system on the Aurora 2 task. Results show that the proposed adaptation algorithm (word accuracy 89.22%) outperforms VTS model adaptation (word accuracy 88.38%).
{"title":"Noise robust model adaptation using linear spline interpolation","authors":"K. Kalgaonkar, M. Seltzer, A. Acero","doi":"10.1109/ASRU.2009.5373430","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373430","url":null,"abstract":"This paper presents a novel data-driven technique for performing acoustic model adaptation to noisy environments. In the presence of additive noise, the relationship between log mel spectra of speech, noise and noisy speech is nonlinear. Traditional methods linearize this relationship using the mode of the nonlinearity or use some other approximation. The approach presented in this paper models this nonlinear relationship using linear spline regression. In this method, the set of spline parameters that minimizes the error between the predicted and actual noisy speech features is learned from training data, and used at runtime to adapt clean acoustic model parameters to the current noise conditions. Experiments were performed to evaluate the performance of the system on the Aurora 2 task. Results show that the proposed adaptation algorithm (word accuracy 89.22%) outperforms VTS model adaptation (word accuracy 88.38%).","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125178029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/asru.2009.5373335
Joost van Doremalen, C. Cucchiarini, H. Strik
Frequent pronunciation errors made by L2 learners of Dutch often concern vowel substitutions. To detect such pronunciation errors, ASR-based confidence measures (CMs) are generally used. In the current paper we compare and combine confidence measures with MFCCs and phonetic features. The results show that the best results are obtained by using MFCCs, then CMs, and finally phonetic features, and that substantial improvements can be obtained by combining different features.
{"title":"Automatic detection of vowel pronunciation errors using multiple information sources","authors":"Joost van Doremalen, C. Cucchiarini, H. Strik","doi":"10.1109/asru.2009.5373335","DOIUrl":"https://doi.org/10.1109/asru.2009.5373335","url":null,"abstract":"Frequent pronunciation errors made by L2 learners of Dutch often concern vowel substitutions. To detect such pronunciation errors, ASR-based confidence measures (CMs) are generally used. In the current paper we compare and combine confidence measures with MFCCs and phonetic features. The results show that the best results are obtained by using MFCCs, then CMs, and finally phonetic features, and that substantial improvements can be obtained by combining different features.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"47 23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123587260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373353
S. Sakti, Noriyuki Kimura, Michael Paul, Chiori Hori, E. Sumita, Satoshi Nakamura, Jun Park, C. Wutiwiwatchai, Bo Xu, Hammam Riza, K. Arora, C. Luong, Haizhou Li
This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. The system was designed to translate common spoken utterances of travel conversations from a certain source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. Each A-STAR member contributes one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. Currently, the system has successfully covered 9 languages— namely, 8 Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, Chinese) and additionally, the English language. The system's domain covers about 20,000 travel expressions, including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we discuss the difficulties involved in connecting various different spoken language translation systems through Web servers. We also present speech-translation results on the first A-STAR demo experiments carried out in July 2009.
{"title":"The Asian network-based speech-to-speech translation system","authors":"S. Sakti, Noriyuki Kimura, Michael Paul, Chiori Hori, E. Sumita, Satoshi Nakamura, Jun Park, C. Wutiwiwatchai, Bo Xu, Hammam Riza, K. Arora, C. Luong, Haizhou Li","doi":"10.1109/ASRU.2009.5373353","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373353","url":null,"abstract":"This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. The system was designed to translate common spoken utterances of travel conversations from a certain source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. Each A-STAR member contributes one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. Currently, the system has successfully covered 9 languages— namely, 8 Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, Chinese) and additionally, the English language. The system's domain covers about 20,000 travel expressions, including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we discuss the difficulties involved in connecting various different spoken language translation systems through Web servers. We also present speech-translation results on the first A-STAR demo experiments carried out in July 2009.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125316443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5372886
Björn Schuller, Bogdan Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth
In the light of the first challenge on emotion recognition from speech we provide the largest-to-date benchmark comparison under equal conditions on nine standard corpora in the field using the two pre-dominant paradigms: modeling on a frame-level by means of hidden Markov models and supra-segmental modeling by systematic feature brute-forcing. Investigated corpora are the ABC, AVIC, DES, EMO-DB, eNTERFACE, SAL, SmartKom, SUSAS, and VAM databases. To provide better comparability among sets, we additionally cluster each database's emotions into binary valence and arousal discrimination tasks. In the result large differences are found among corpora that mostly stem from naturalistic emotions and spontaneous speech vs. more prototypical events. Further, supra-segmental modeling proves significantly beneficial on average when several classes are addressed at a time.
{"title":"Acoustic emotion recognition: A benchmark comparison of performances","authors":"Björn Schuller, Bogdan Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth","doi":"10.1109/ASRU.2009.5372886","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372886","url":null,"abstract":"In the light of the first challenge on emotion recognition from speech we provide the largest-to-date benchmark comparison under equal conditions on nine standard corpora in the field using the two pre-dominant paradigms: modeling on a frame-level by means of hidden Markov models and supra-segmental modeling by systematic feature brute-forcing. Investigated corpora are the ABC, AVIC, DES, EMO-DB, eNTERFACE, SAL, SmartKom, SUSAS, and VAM databases. To provide better comparability among sets, we additionally cluster each database's emotions into binary valence and arousal discrimination tasks. In the result large differences are found among corpora that mostly stem from naturalistic emotions and spontaneous speech vs. more prototypical events. Further, supra-segmental modeling proves significantly beneficial on average when several classes are addressed at a time.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122951614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373356
Qin Jin, Arthur R. Toth, Tanja Schultz, A. Black
It is a common feature of modern automated voice-driven applications and services to record and transmit a user's spoken request. At the same time, several domains and applications may require keeping the content of the user's request confidential and at the same time preserving the speaker's identity. This requires a technology that allows the speaker's voice to be de-identified in the sense that the voice sounds natural and intelligible but does not reveal the identity of the speaker. In this paper we investigate different voice transformation strategies on a large population of speakers to disguise the speakers' identities while preserving the intelligibility of the voices. We apply two automatic speaker identification approaches to verify the success of de-identification with voice transformation, a GMM-based and a Phonetic approach. The evaluation based on the automatic speaker identification systems verifies that the proposed voice transformation technique enables transmission of the content of the users' spoken requests while successfully preserving their identities. Also, the results indicate that different speakers still sound distinct after the transformation. Furthermore, we carried out a human listening test that proved the transformed speech to be both intelligible and securely de-identified, as it hid the identity of the speakers even to listeners who knew the speakers very well.
{"title":"Speaker de-identification via voice transformation","authors":"Qin Jin, Arthur R. Toth, Tanja Schultz, A. Black","doi":"10.1109/ASRU.2009.5373356","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373356","url":null,"abstract":"It is a common feature of modern automated voice-driven applications and services to record and transmit a user's spoken request. At the same time, several domains and applications may require keeping the content of the user's request confidential and at the same time preserving the speaker's identity. This requires a technology that allows the speaker's voice to be de-identified in the sense that the voice sounds natural and intelligible but does not reveal the identity of the speaker. In this paper we investigate different voice transformation strategies on a large population of speakers to disguise the speakers' identities while preserving the intelligibility of the voices. We apply two automatic speaker identification approaches to verify the success of de-identification with voice transformation, a GMM-based and a Phonetic approach. The evaluation based on the automatic speaker identification systems verifies that the proposed voice transformation technique enables transmission of the content of the users' spoken requests while successfully preserving their identities. Also, the results indicate that different speakers still sound distinct after the transformation. Furthermore, we carried out a human listening test that proved the transformed speech to be both intelligible and securely de-identified, as it hid the identity of the speakers even to listeners who knew the speakers very well.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117128422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373506
Wen-hsiang Tu, Sheng-Yuan Huang, J. Hung
This paper proposes a novel scheme in performing feature statistics normalization techniques for robust speech recognition. In the proposed approach, the processed temporal-domain feature sequence is first converted into the modulation spectral domain. The magnitude part of the modulation spectrum is decomposed into non-uniform sub-band segments, and then each sub-band segment is individually processed by the well-known normalization methods, like mean normalization (MN), mean and variance normalization (MVN) and histogram equalization (HEQ). Finally, we reconstruct the feature stream with all the modified sub-band magnitude spectral segments and the original phase spectrum using the inverse DFT. With this process, the components that correspond to more important modulation spectral bands in the feature sequence can be processed separately. For the Aurora-2 clean-condition training task, the new proposed sub-band spectral MVN and HEQ provide relative error rate reductions of 18.66% and 23.58% over the conventional temporal MVN and HEQ, respectively.
{"title":"Sub-band modulation spectrum compensation for robust speech recognition","authors":"Wen-hsiang Tu, Sheng-Yuan Huang, J. Hung","doi":"10.1109/ASRU.2009.5373506","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373506","url":null,"abstract":"This paper proposes a novel scheme in performing feature statistics normalization techniques for robust speech recognition. In the proposed approach, the processed temporal-domain feature sequence is first converted into the modulation spectral domain. The magnitude part of the modulation spectrum is decomposed into non-uniform sub-band segments, and then each sub-band segment is individually processed by the well-known normalization methods, like mean normalization (MN), mean and variance normalization (MVN) and histogram equalization (HEQ). Finally, we reconstruct the feature stream with all the modified sub-band magnitude spectral segments and the original phase spectrum using the inverse DFT. With this process, the components that correspond to more important modulation spectral bands in the feature sequence can be processed separately. For the Aurora-2 clean-condition training task, the new proposed sub-band spectral MVN and HEQ provide relative error rate reductions of 18.66% and 23.58% over the conventional temporal MVN and HEQ, respectively.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373344
P. Bell, Simon King
We investigate the use of full covariance Gaussians for large-vocabulary speech recognition. The large number of parameters gives high modelling power, but when training data is limited, the standard sample covariance matrix is often poorly conditioned, and has high variance. We explain how these problems may be solved by the use of a diagonal covariance smoothing prior, and relate this to the shrinkage estimator, for which the optimal shrinkage parameter may itself be estimated from the training data. We also compare the use of generatively and discriminatively trained priors. Results are presented on a large vocabulary conversational telephone speech recognition task.
{"title":"Diagonal priors for full covariance speech recognition","authors":"P. Bell, Simon King","doi":"10.1109/ASRU.2009.5373344","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373344","url":null,"abstract":"We investigate the use of full covariance Gaussians for large-vocabulary speech recognition. The large number of parameters gives high modelling power, but when training data is limited, the standard sample covariance matrix is often poorly conditioned, and has high variance. We explain how these problems may be solved by the use of a diagonal covariance smoothing prior, and relate this to the shrinkage estimator, for which the optimal shrinkage parameter may itself be estimated from the training data. We also compare the use of generatively and discriminatively trained priors. Results are presented on a large vocabulary conversational telephone speech recognition task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131269495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373407
E. Marcheret, V. Goel, P. Olsen
Discriminative training of the feature space using the minimum phone error (MPE) objective function has been shown to yield remarkable accuracy improvements. These gains, however, come at a high cost of memory required to store the transform. In a previous paper we reduced this memory requirement by 94% by quantizing the transform parameters. We used dimension dependent quantization tables and learned the quantization values with a fixed assignment of transform parameters to quantization values. In this paper we refine and extend the techniques to attain a further 35% reduction in memory with no degradation in sentence error rate. We discuss a principled method to assign the transform parameters to quantization values. We also show how the memory can be gradually reduced using a Viterbi algorithm to optimally assign variable number of bits to dimension dependent quantization tables. The techniques described could also be applied to the quantization of general linear transforms - a problem that should be of wider interest.
{"title":"Optimal quantization and bit allocation for compressing large discriminative feature space transforms","authors":"E. Marcheret, V. Goel, P. Olsen","doi":"10.1109/ASRU.2009.5373407","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373407","url":null,"abstract":"Discriminative training of the feature space using the minimum phone error (MPE) objective function has been shown to yield remarkable accuracy improvements. These gains, however, come at a high cost of memory required to store the transform. In a previous paper we reduced this memory requirement by 94% by quantizing the transform parameters. We used dimension dependent quantization tables and learned the quantization values with a fixed assignment of transform parameters to quantization values. In this paper we refine and extend the techniques to attain a further 35% reduction in memory with no degradation in sentence error rate. We discuss a principled method to assign the transform parameters to quantization values. We also show how the memory can be gradually reduced using a Viterbi algorithm to optimally assign variable number of bits to dimension dependent quantization tables. The techniques described could also be applied to the quantization of general linear transforms - a problem that should be of wider interest.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114446903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}