Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430187
Brett Matthews, U. Chaudhari, B. Ramabhadran
Many techniques for retrieving arbitrary content from audio have been developed to leverage the important challenge of providing fast access to very large volumes of multimedia data. We present a two-stage method for fast audio search, where a vector-space modelling approach is first used to retrieve a short list of candidate audio segments for a query. The list of candidate segments is then searched using a word-based index for known words and a phone-based index for out-of-vocabulary words. We explore various system configurations and examine trade-offs between speed and accuracy. We evaluate our audio search system according to the NIST 2006 Spoken Term Detection evaluation initiative. We find that we can obtain a 30-times speedup for the search phase of our system with a 10% relative loss in accuracy.
{"title":"Fast audio search using vector space modelling","authors":"Brett Matthews, U. Chaudhari, B. Ramabhadran","doi":"10.1109/ASRU.2007.4430187","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430187","url":null,"abstract":"Many techniques for retrieving arbitrary content from audio have been developed to leverage the important challenge of providing fast access to very large volumes of multimedia data. We present a two-stage method for fast audio search, where a vector-space modelling approach is first used to retrieve a short list of candidate audio segments for a query. The list of candidate segments is then searched using a word-based index for known words and a phone-based index for out-of-vocabulary words. We explore various system configurations and examine trade-offs between speed and accuracy. We evaluate our audio search system according to the NIST 2006 Spoken Term Detection evaluation initiative. We find that we can obtain a 30-times speedup for the search phase of our system with a 10% relative loss in accuracy.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125211754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430118
J. Bilmes
Summary form only given. Convexity is a property of real-valued functions that enable their efficient optimization. Convex optimization moreover is a problem onto which an amazing variety of practical problems can be cast. Having strong analogs to convexity, submodularity is a property of functions on discrete sets that allows their optimization to be done in only polynomial time. Submodularity generalizes the common notion of diminishing returns. Like convexity, a large variety of discrete optimization problems can be cast in terms of submodular optimization. The first part of this talk will survey recent work taking place in our lab on the application of submodularity to machine learning, which includes discriminative structure learning and word clustering for language models. The second part of the talk will discuss recent work on a technique that for many years has been widely successful in speech recognition, namely adaptation. We will view adaptation in a setting where the training and testing time distributions are not assumed identical (unlike typical Bayes risk theory). We will derive generalization error and sample complexity bounds for adaptation which are specified in terms of a natural divergence between the train/test distributions. These bounds, moreover, lead to practical and effective adaptation strategies for both generative models (e.g., GMMs, HMMs) and discriminative models (e.g., MLPs, SVMs). Joint work with Mukund Narasimhan and Xiao Li.
{"title":"Submodularity and adaptation","authors":"J. Bilmes","doi":"10.1109/ASRU.2007.4430118","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430118","url":null,"abstract":"Summary form only given. Convexity is a property of real-valued functions that enable their efficient optimization. Convex optimization moreover is a problem onto which an amazing variety of practical problems can be cast. Having strong analogs to convexity, submodularity is a property of functions on discrete sets that allows their optimization to be done in only polynomial time. Submodularity generalizes the common notion of diminishing returns. Like convexity, a large variety of discrete optimization problems can be cast in terms of submodular optimization. The first part of this talk will survey recent work taking place in our lab on the application of submodularity to machine learning, which includes discriminative structure learning and word clustering for language models. The second part of the talk will discuss recent work on a technique that for many years has been widely successful in speech recognition, namely adaptation. We will view adaptation in a setting where the training and testing time distributions are not assumed identical (unlike typical Bayes risk theory). We will derive generalization error and sample complexity bounds for adaptation which are specified in terms of a natural divergence between the train/test distributions. These bounds, moreover, lead to practical and effective adaptation strategies for both generative models (e.g., GMMs, HMMs) and discriminative models (e.g., MLPs, SVMs). Joint work with Mukund Narasimhan and Xiao Li.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125728207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430114
Mina Kim, Y. Oh, H. Kim
In this paper, we propose a pronunciation variation modeling method for improving the performance of a non-native automatic speech recognition (ASR) system that does not degrade the performance of a native ASR system. The proposed method is based on an indirect data-driven approach, where pronunciation variability is investigated from the training speech data, and variant rules are subsequently derived and applied to compensate for variability in the ASR pronunciation dictionary. To this end, native utterances are first recognized by using a phoneme recognizer, and then the variant phoneme patterns of native speech are obtained by aligning the recognized and reference phonetic sequences. The reference sequences are transcribed by using each of canonical, knowledge-based, and hand-labeled methods. Similar to non-native speech, the variant phoneme patterns of non-native speech can also be obtained by recognizing non-native utterances and comparing the recognized phoneme sequences and reference phonetic transcriptions. Finally, variant rules are derived from native and non-native variant phoneme patterns using decision trees and applied to the adaptation of a dictionary for non-native and native ASR systems. In this paper, Korean spoken by Chinese native speakers is considered as the non-native speech. It is shown from non-native ASR experiments that an ASR system using the dictionary constructed by the proposed pronunciation variation modeling method can relatively reduce the average word error rate (WER) by 18.5% when compared to the baseline ASR system using a canonical transcribed dictionary. In addition, the WER of a native ASR system using the proposed dictionary is also relatively reduced by 1.1%, as compared to the baseline native ASR system with a canonical constructed dictionary.
{"title":"Non-native pronunciation variation modeling using an indirect data driven method","authors":"Mina Kim, Y. Oh, H. Kim","doi":"10.1109/ASRU.2007.4430114","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430114","url":null,"abstract":"In this paper, we propose a pronunciation variation modeling method for improving the performance of a non-native automatic speech recognition (ASR) system that does not degrade the performance of a native ASR system. The proposed method is based on an indirect data-driven approach, where pronunciation variability is investigated from the training speech data, and variant rules are subsequently derived and applied to compensate for variability in the ASR pronunciation dictionary. To this end, native utterances are first recognized by using a phoneme recognizer, and then the variant phoneme patterns of native speech are obtained by aligning the recognized and reference phonetic sequences. The reference sequences are transcribed by using each of canonical, knowledge-based, and hand-labeled methods. Similar to non-native speech, the variant phoneme patterns of non-native speech can also be obtained by recognizing non-native utterances and comparing the recognized phoneme sequences and reference phonetic transcriptions. Finally, variant rules are derived from native and non-native variant phoneme patterns using decision trees and applied to the adaptation of a dictionary for non-native and native ASR systems. In this paper, Korean spoken by Chinese native speakers is considered as the non-native speech. It is shown from non-native ASR experiments that an ASR system using the dictionary constructed by the proposed pronunciation variation modeling method can relatively reduce the average word error rate (WER) by 18.5% when compared to the baseline ASR system using a canonical transcribed dictionary. In addition, the WER of a native ASR system using the proposed dictionary is also relatively reduced by 1.1%, as compared to the baseline native ASR system with a canonical constructed dictionary.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125864794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430174
S. Siniscalchi, T. Svendsen, Chin-Hui Lee
We present a novel approach to designing bottom-up automatic speech recognition (ASR) systems. The key component of the proposed approach is a bank of articulatory attribute detectors implemented using a set of feed-forward artificial neural networks (ANNs). Each detector computes a score describing an activation level of the specified speech attributes that the current frame exhibits. These cues are first combined by an event merger that provides some evidence about the presence of a higher level feature which is then verified by an evidence verifier to produce hypotheses at the phone or word level. We evaluate several configurations of our proposed system on a continuous phone recognition task. Experimental results on the TIMIT database show that the system achieves a phone error rate of 25% which is superior to results obtained with either hidden Markov model (HMM) or conditional random field (CRF) based recognizers. We believe the system's inherent flexibility and the ease of adding new detectors may provide further improvements.
{"title":"Towards bottom-up continuous phone recognition","authors":"S. Siniscalchi, T. Svendsen, Chin-Hui Lee","doi":"10.1109/ASRU.2007.4430174","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430174","url":null,"abstract":"We present a novel approach to designing bottom-up automatic speech recognition (ASR) systems. The key component of the proposed approach is a bank of articulatory attribute detectors implemented using a set of feed-forward artificial neural networks (ANNs). Each detector computes a score describing an activation level of the specified speech attributes that the current frame exhibits. These cues are first combined by an event merger that provides some evidence about the presence of a higher level feature which is then verified by an evidence verifier to produce hypotheses at the phone or word level. We evaluate several configurations of our proposed system on a continuous phone recognition task. Experimental results on the TIMIT database show that the system achieves a phone error rate of 25% which is superior to results obtained with either hidden Markov model (HMM) or conditional random field (CRF) based recognizers. We believe the system's inherent flexibility and the ease of adding new detectors may provide further improvements.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129750150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430131
T. Shinozaki, Tatsuya Kawahara
A combination of the cross-validation EM (CV-EM) algorithm and the cross-validation (CV) Gaussian mixture optimization method is explored. CV-EM and CV Gaussian mixture optimization are our previously proposed training algorithms that use CV likelihood instead of the conventional training set likelihood for robust model estimation. Since CV-EM is a parameter optimization method and CV Gaussian mixture optimization is a structure optimization algorithm, these methods can be combined. Large vocabulary speech recognition experiments are performed on oral presentations. It is shown that both CV-EM and CV Gaussian mixture optimization give lower word error rates than the conventional EM, and their combination is effective to further reduce the word error rate.
{"title":"HMM training based on CV-EM and CV Gaussian mixture optimization","authors":"T. Shinozaki, Tatsuya Kawahara","doi":"10.1109/ASRU.2007.4430131","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430131","url":null,"abstract":"A combination of the cross-validation EM (CV-EM) algorithm and the cross-validation (CV) Gaussian mixture optimization method is explored. CV-EM and CV Gaussian mixture optimization are our previously proposed training algorithms that use CV likelihood instead of the conventional training set likelihood for robust model estimation. Since CV-EM is a parameter optimization method and CV Gaussian mixture optimization is a structure optimization algorithm, these methods can be combined. Large vocabulary speech recognition experiments are performed on oral presentations. It is shown that both CV-EM and CV Gaussian mixture optimization give lower word error rates than the conventional EM, and their combination is effective to further reduce the word error rate.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128665915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430141
S. Vogel
Summary form only given. In this paper we will review some of recent work done in both domain-limited and domain-unlimited speech translation. We will show where progress has been made and highlight areas, where initial expectations have not been met so far.
{"title":"Speech-translation: from domain-limited to domain-unlimited translation tasks","authors":"S. Vogel","doi":"10.1109/ASRU.2007.4430141","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430141","url":null,"abstract":"Summary form only given. In this paper we will review some of recent work done in both domain-limited and domain-unlimited speech translation. We will show where progress has been made and highlight areas, where initial expectations have not been met so far.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122420454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430127
Yasuhiro Minami
We propose a new speech recognition method (HMM-trajectory method) that generates a speech trajectory from HMMs by maximizing their likelihood while accounting for the relationship between the MFCCs and dynamic MFCCs. One major advantage of this method is that this relationship, ignored in conventional speech recognition, is directly used in the speech recognition phase. This paper improves the recognition performance of the HMM-trajectory method for dealing with mixture Gaussian distributions. While the HMM-trajectory method chooses the Gaussian distribution sequence of the HMM states by selecting the best Gaussian distribution in the state during Viterbi decoding and calculating HMM trajectory likelihood along with the sequence, the proposed method compensates for HMM trajectory likelihood using ordinary HMM likelihood. In speaker-independent speech recognition experiments, the proposed method reduced the error rate about 10% for the task compared with HMMs, proving its effectiveness for Gaussian mixture components.
{"title":"Mixture Gaussian HMM-trajctory method using likelihood compensation","authors":"Yasuhiro Minami","doi":"10.1109/ASRU.2007.4430127","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430127","url":null,"abstract":"We propose a new speech recognition method (HMM-trajectory method) that generates a speech trajectory from HMMs by maximizing their likelihood while accounting for the relationship between the MFCCs and dynamic MFCCs. One major advantage of this method is that this relationship, ignored in conventional speech recognition, is directly used in the speech recognition phase. This paper improves the recognition performance of the HMM-trajectory method for dealing with mixture Gaussian distributions. While the HMM-trajectory method chooses the Gaussian distribution sequence of the HMM states by selecting the best Gaussian distribution in the state during Viterbi decoding and calculating HMM trajectory likelihood along with the sequence, the proposed method compensates for HMM trajectory likelihood using ordinary HMM likelihood. In speaker-independent speech recognition experiments, the proposed method reduced the error rate about 10% for the task compared with HMMs, proving its effectiveness for Gaussian mixture components.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124363010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430160
A. Subramanya, C. Bartels, J. Bilmes, Patrick Nguyen
We propose a technique for annotating data used to train a speech recognizer. The proposed scheme is based on labeling only a single frame for every word in the training set. We make use of the virtual evidence (VE) framework within a graphical model to take advantage of such data. We apply this approach to a large vocabulary speech recognition task, and show that our VE-based training scheme can improve over the performance of a system trained using sequence labeled data by 2.8% and 2.1% on the dev01 and eva101 sets respectively. Annotating data in the proposed scheme is not significantly slower than sequence labeling. We present timing results showing that training using the proposed approach is about 10 times faster than training using sequence labeled data while using only about 75% of the memory.
{"title":"Uncertainty in training large vocabulary speech recognizers","authors":"A. Subramanya, C. Bartels, J. Bilmes, Patrick Nguyen","doi":"10.1109/ASRU.2007.4430160","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430160","url":null,"abstract":"We propose a technique for annotating data used to train a speech recognizer. The proposed scheme is based on labeling only a single frame for every word in the training set. We make use of the virtual evidence (VE) framework within a graphical model to take advantage of such data. We apply this approach to a large vocabulary speech recognition task, and show that our VE-based training scheme can improve over the performance of a system trained using sequence labeled data by 2.8% and 2.1% on the dev01 and eva101 sets respectively. Annotating data in the proposed scheme is not significantly slower than sequence labeling. We present timing results showing that training using the proposed approach is about 10 times faster than training using sequence labeled data while using only about 75% of the memory.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121582478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430128
L. Yi, Zheng Fang, He Lei, Xia Yunqing
In this paper, we propose a state-dependent tied mixture (SDTM) models with variable codebook size to improve the model robustness for accented phonetic variations while maintaining model discriminative ability. State tying and mixture tying are combined to generate SDTM models. Compared to a pure mixture tying system, the SDTM model uses state tying to reserve the state identity; compared to the sole state tying system, such model uses a small set of parameters to discard the overlapping mixture distributions for robust model estimation. The codebook size of SDTM model is varied according to the confusion probability of states. The more confusable a state is, the larger its codebook size gets for a higher degree of model resolution. The codebook size is governed by state level variation probability of accented phonetic confusions which can be automatically extracted by frame-to-state alignment based on the local model mismatch. The effectiveness of this approach is evaluated on Mandarin accented speech. Our method yields a significant 2.1%, 9.5% and 3.5% absolute word error rate reduction compared with state tying, mixture tying and state-based phonetic tied mixtures, respectively.
{"title":"State-dependent mixture tying with variable codebook size for accented speech recognition","authors":"L. Yi, Zheng Fang, He Lei, Xia Yunqing","doi":"10.1109/ASRU.2007.4430128","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430128","url":null,"abstract":"In this paper, we propose a state-dependent tied mixture (SDTM) models with variable codebook size to improve the model robustness for accented phonetic variations while maintaining model discriminative ability. State tying and mixture tying are combined to generate SDTM models. Compared to a pure mixture tying system, the SDTM model uses state tying to reserve the state identity; compared to the sole state tying system, such model uses a small set of parameters to discard the overlapping mixture distributions for robust model estimation. The codebook size of SDTM model is varied according to the confusion probability of states. The more confusable a state is, the larger its codebook size gets for a higher degree of model resolution. The codebook size is governed by state level variation probability of accented phonetic confusions which can be automatically extracted by frame-to-state alignment based on the local model mismatch. The effectiveness of this approach is evaluated on Mandarin accented speech. Our method yields a significant 2.1%, 9.5% and 3.5% absolute word error rate reduction compared with state tying, mixture tying and state-based phonetic tied mixtures, respectively.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122197992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430098
B. Hsu
Despite the prevalent use of model combination techniques to improve speech recognition performance on domains with limited data, little prior research has focused on the choice of the actual interpolation model. For merging language models, the most popular approach has been the simple linear interpolation. In this work, we propose a generalization of linear interpolation that computes context-dependent mixture weights from arbitrary features. Results on a lecture transcription task yield up to a 1.0% absolute improvement in recognition word error rate (WER).
{"title":"Generalized linear interpolation of language models","authors":"B. Hsu","doi":"10.1109/ASRU.2007.4430098","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430098","url":null,"abstract":"Despite the prevalent use of model combination techniques to improve speech recognition performance on domains with limited data, little prior research has focused on the choice of the actual interpolation model. For merging language models, the most popular approach has been the simple linear interpolation. In this work, we propose a generalization of linear interpolation that computes context-dependent mixture weights from arbitrary features. Results on a lecture transcription task yield up to a 1.0% absolute improvement in recognition word error rate (WER).","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131065937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}