Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373386
H. Sak, M. Saraçlar, Tunga Güngör
This paper proposes a novel approach to integrate the morphology as a model into an automatic speech recognition (ASR) system for morphologically rich languages. The high out-of-vocabulary (OOV) word rates have been a major challenge for ASR in morphologically productive languages. The standard approach to this problem has been to shift from words to sub-word units in language modeling, and the only change to the system is in the language model estimated over these units. In contrast, we propose to integrate the morphology as other any knowledge source - such as the lexicon, and the language model- directly into the search network. The morphological parser for a language, implemented as a finite-state lexical transducer, can be considered as a computational lexicon. The computational lexicon represents a dynamic vocabulary in contrast to a static vocabulary generally used for ASR. We compose the transducer for this computational lexicon with a statistical language model over lexical morphemes to obtain a morphology-integrated search network. The resulting search network generates only grammatical word forms and improves the recognition accuracy due to reduced OOV rate. We give experimental results for Turkish broadcast news transcription, and show that it outperforms the 50 K and 100 K vocabulary word models while the 200 K vocabulary word model is slightly better.
{"title":"Integrating morphology into automatic speech recognition","authors":"H. Sak, M. Saraçlar, Tunga Güngör","doi":"10.1109/ASRU.2009.5373386","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373386","url":null,"abstract":"This paper proposes a novel approach to integrate the morphology as a model into an automatic speech recognition (ASR) system for morphologically rich languages. The high out-of-vocabulary (OOV) word rates have been a major challenge for ASR in morphologically productive languages. The standard approach to this problem has been to shift from words to sub-word units in language modeling, and the only change to the system is in the language model estimated over these units. In contrast, we propose to integrate the morphology as other any knowledge source - such as the lexicon, and the language model- directly into the search network. The morphological parser for a language, implemented as a finite-state lexical transducer, can be considered as a computational lexicon. The computational lexicon represents a dynamic vocabulary in contrast to a static vocabulary generally used for ASR. We compose the transducer for this computational lexicon with a statistical language model over lexical morphemes to obtain a morphology-integrated search network. The resulting search network generates only grammatical word forms and improves the recognition accuracy due to reduced OOV rate. We give experimental results for Turkish broadcast news transcription, and show that it outperforms the 50 K and 100 K vocabulary word models while the 200 K vocabulary word model is slightly better.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127597726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373341
Carolina Parada, A. Sethy, B. Ramabhadran
The goal of Spoken Term Detection (STD) technology is to allow open vocabulary search over large collections of speech content. In this paper, we address cases where search term(s) of interest (queries) are acoustic examples. This is provided either by identifying a region of interest in a speech stream or by speaking the query term. Queries often relate to named-entities and foreign words, which typically have poor coverage in the vocabulary of Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Throughout this paper, we focus on query-by-example search for such out-of-vocabulary (OOV) query terms. We build upon a finite state transducer (FST) based search and indexing system [1] to address the query by example search for OOV terms by representing both the query and the index as phonetic lattices from the output of an LVCSR system. We provide results comparing different representations and generation mechanisms for both queries and indexes built with word and combined word and subword units [2]. We also present a two-pass method which uses query-by-example search using the best hit identified in an initial pass to augment the STD search results. The results demonstrate that query-by-example search can yield a significantly better performance, measured using Actual Term-Weighted Value (ATWV), of 0.479 when compared to a baseline ATWV of 0.325 that uses reference pronunciations for OOVs. Further improvements can be obtained with the proposed two pass approach and filtering using the expected unigram counts from the LVCSR system's lexicon.
{"title":"Query-by-example Spoken Term Detection For OOV terms","authors":"Carolina Parada, A. Sethy, B. Ramabhadran","doi":"10.1109/ASRU.2009.5373341","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373341","url":null,"abstract":"The goal of Spoken Term Detection (STD) technology is to allow open vocabulary search over large collections of speech content. In this paper, we address cases where search term(s) of interest (queries) are acoustic examples. This is provided either by identifying a region of interest in a speech stream or by speaking the query term. Queries often relate to named-entities and foreign words, which typically have poor coverage in the vocabulary of Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Throughout this paper, we focus on query-by-example search for such out-of-vocabulary (OOV) query terms. We build upon a finite state transducer (FST) based search and indexing system [1] to address the query by example search for OOV terms by representing both the query and the index as phonetic lattices from the output of an LVCSR system. We provide results comparing different representations and generation mechanisms for both queries and indexes built with word and combined word and subword units [2]. We also present a two-pass method which uses query-by-example search using the best hit identified in an initial pass to augment the STD search results. The results demonstrate that query-by-example search can yield a significantly better performance, measured using Actual Term-Weighted Value (ATWV), of 0.479 when compared to a baseline ATWV of 0.325 that uses reference pronunciations for OOVs. Further improvements can be obtained with the proposed two pass approach and filtering using the expected unigram counts from the LVCSR system's lexicon.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125962269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373438
Anoop Deoras, F. Jelinek
Recently there has been a lot of interest in confusion network re-scoring using sophisticated and complex knowledge sources. Traditionally, re-scoring has been carried out by the N-best list method or by the lattices or confusion network dynamic programming method. Although the dynamic programming method is optimal, it allows for the incorporation of only Markov knowledge sources. N-best lists, on the other hand, can incorporate sentence level knowledge sources, but with increasing N, the re-scoring becomes computationally very intensive. In this paper, we present an elegant framework for confusion network re-scoring called ‘Iterative Decoding’. In it, integration of multiple and complex knowledge sources is not only easier but it also allows for much faster re-scoring as compared to the N-best list method. Experiments with Language Model re-scoring show that for comparable performance (in terms of word error rate (WER)) of Iterative Decoding and N-best list re-scoring, the search effort required by our method is 22 times less than that of the N-best list method.
{"title":"Iterative decoding: A novel re-scoring framework for confusion networks","authors":"Anoop Deoras, F. Jelinek","doi":"10.1109/ASRU.2009.5373438","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373438","url":null,"abstract":"Recently there has been a lot of interest in confusion network re-scoring using sophisticated and complex knowledge sources. Traditionally, re-scoring has been carried out by the N-best list method or by the lattices or confusion network dynamic programming method. Although the dynamic programming method is optimal, it allows for the incorporation of only Markov knowledge sources. N-best lists, on the other hand, can incorporate sentence level knowledge sources, but with increasing N, the re-scoring becomes computationally very intensive. In this paper, we present an elegant framework for confusion network re-scoring called ‘Iterative Decoding’. In it, integration of multiple and complex knowledge sources is not only easier but it also allows for much faster re-scoring as compared to the N-best list method. Experiments with Language Model re-scoring show that for comparable performance (in terms of word error rate (WER)) of Iterative Decoding and N-best list re-scoring, the search effort required by our method is 22 times less than that of the N-best list method.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126856372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5372916
G. Zweig, Patrick Nguyen
This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector streams operate at multiple time scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and are naturally geared to explain long span phenomena such as formant trajectories, duration, and syllable stress patterns. Generalization to unseen words is possible through the use of decomposable consistency features [1], [2], and our framework allows for the joint or separate discriminative training of the acoustic and language models. An initial evaluation of this framework with voice search data from the Bing Mobile (BM) application results in a 2% absolute improvement over an HMM baseline.
提出了一种分段条件随机场框架,用于大词汇量连续语音识别。这种方法的基础是使用声学探测器作为基本输入,并自动构建一组通用的段级特征。检测器流在多个时间尺度(帧、电话、多电话、音节或单词)上运行,并在CRF训练和解码过程中在单词级别进行组合。我们方法的一个关键方面是特征是在单词级别定义的,并且自然地用于解释长跨度现象,如形成峰轨迹,持续时间和音节重音模式。通过使用可分解的一致性特征[1],[2],我们的框架允许对声学和语言模型进行联合或单独的判别训练。使用Bing Mobile (BM)应用程序的语音搜索数据对该框架进行的初步评估结果显示,与HMM基线相比,该框架的绝对性能提高了2%。
{"title":"A segmental CRF approach to large vocabulary continuous speech recognition","authors":"G. Zweig, Patrick Nguyen","doi":"10.1109/ASRU.2009.5372916","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372916","url":null,"abstract":"This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector streams operate at multiple time scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and are naturally geared to explain long span phenomena such as formant trajectories, duration, and syllable stress patterns. Generalization to unseen words is possible through the use of decomposable consistency features [1], [2], and our framework allows for the joint or separate discriminative training of the acoustic and language models. An initial evaluation of this framework with voice search data from the Bing Mobile (BM) application results in a 2% absolute improvement over an HMM baseline.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126866345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373446
Steven J. Rennie, J. Hershey, P. Olsen
We present a new method for multi-talker speech recognition using a single-channel that combines loopy belief propagation and variational inference methods to control the complexity of inference. The method models each source using an HMM with a hierarchical set of acoustic states, and uses the max model to approximate how the sources interact to generate mixed data. Inference involves inferring a set of probabilistic time-frequency masks to separate the speakers. By conditioning these masks on the hierarchical acoustic states of the speakers, the fidelity and complexity of acoustic inference can be precisely controlled. Acoustic inference using the algorithm scales linearly with the number of probabilistic time-frequency masks, and temporal inference scales linearly with LM size. Results on the monaural speech separation task (SSC) data demonstrate that the presented Hierarchical Variational Max-Sum Product Algorithm (HVMSP) outperforms VMSP by over 2% absolute using 4 times fewer probablistic masks. HVMSP furthermore performs on-par with the MSP algorithm, which utilizes exact conditional marginal likelihoods, using 256 times less time-frequency masks.
{"title":"Hierarchical variational loopy belief propagation for multi-talker speech recognition","authors":"Steven J. Rennie, J. Hershey, P. Olsen","doi":"10.1109/ASRU.2009.5373446","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373446","url":null,"abstract":"We present a new method for multi-talker speech recognition using a single-channel that combines loopy belief propagation and variational inference methods to control the complexity of inference. The method models each source using an HMM with a hierarchical set of acoustic states, and uses the max model to approximate how the sources interact to generate mixed data. Inference involves inferring a set of probabilistic time-frequency masks to separate the speakers. By conditioning these masks on the hierarchical acoustic states of the speakers, the fidelity and complexity of acoustic inference can be precisely controlled. Acoustic inference using the algorithm scales linearly with the number of probabilistic time-frequency masks, and temporal inference scales linearly with LM size. Results on the monaural speech separation task (SSC) data demonstrate that the presented Hierarchical Variational Max-Sum Product Algorithm (HVMSP) outperforms VMSP by over 2% absolute using 4 times fewer probablistic masks. HVMSP furthermore performs on-par with the MSP algorithm, which utilizes exact conditional marginal likelihoods, using 256 times less time-frequency masks.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131359655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5372880
M. Paulik, A. Waibel
State-of-the art statistical machine translation depends heavily on the availability of domain-specific bilingual parallel text. However, acquiring large amounts of bilingual parallel text is costly and, depending on the language pair, sometimes impossible. We propose an alternative to parallel text as machine translation (MT) training data; audio recordings of parallel speech (pSp) as it occurs in any scenario where interpreters are involved. Although interpretation (pSp) differs significantly from translation (parallel text), we achieve surprisingly strong translation results with our pSp-trained MT and speech translation systems.We argue that the presented approach is of special interest for developing speech translation in the context of resource-deficient languages where even monolingual resources are scarce.
{"title":"Automatic translation from parallel speech: Simultaneous interpretation as MT training data","authors":"M. Paulik, A. Waibel","doi":"10.1109/ASRU.2009.5372880","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372880","url":null,"abstract":"State-of-the art statistical machine translation depends heavily on the availability of domain-specific bilingual parallel text. However, acquiring large amounts of bilingual parallel text is costly and, depending on the language pair, sometimes impossible. We propose an alternative to parallel text as machine translation (MT) training data; audio recordings of parallel speech (pSp) as it occurs in any scenario where interpreters are involved. Although interpretation (pSp) differs significantly from translation (parallel text), we achieve surprisingly strong translation results with our pSp-trained MT and speech translation systems.We argue that the presented approach is of special interest for developing speech translation in the context of resource-deficient languages where even monolingual resources are scarce.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"10 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114023514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373296
Timo Mertens, Daniel Schneider, A. Næss, T. Svendsen
In this paper we present two approaches to adapt a syllable-based recognition lexicon in an automatic speech recognition (ASR) setting. The motivation is to evaluate whether adaptation techniques commonly used on a word level can also be employed on a subword level. The first method predicts syllable variations, taking into account sub-syllabic phone cluster variations, and subsequently adapts the syllable lexicon. The second approach adds syllable bigrams to the lexicon to cope with acoustic confusability of subword units and syllable-inherent phone attachment ambiguities. We evaluate the methods on two German data sets, one consisting of planned and the other of spontaneous speech. Although the first method did not yield any improvement in the syllable error rate (SER), we could observe that the predicted confusions correlate with those observed in the test data. Bigram adaptation improved the SER by 1.3% and 0.8% absolute on the planned and spontaneous data sets, respectively.
{"title":"Lexicon adaptation for subword speech recognition","authors":"Timo Mertens, Daniel Schneider, A. Næss, T. Svendsen","doi":"10.1109/ASRU.2009.5373296","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373296","url":null,"abstract":"In this paper we present two approaches to adapt a syllable-based recognition lexicon in an automatic speech recognition (ASR) setting. The motivation is to evaluate whether adaptation techniques commonly used on a word level can also be employed on a subword level. The first method predicts syllable variations, taking into account sub-syllabic phone cluster variations, and subsequently adapts the syllable lexicon. The second approach adds syllable bigrams to the lexicon to cope with acoustic confusability of subword units and syllable-inherent phone attachment ambiguities. We evaluate the methods on two German data sets, one consisting of planned and the other of spontaneous speech. Although the first method did not yield any improvement in the syllable error rate (SER), we could observe that the predicted confusions correlate with those observed in the test data. Bigram adaptation improved the SER by 1.3% and 0.8% absolute on the planned and spontaneous data sets, respectively.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114946830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373486
Hui-Ching Lin, J. Bilmes, Shasha Xie
We propose a novel approach for unsupervised extractive summarization. Our approach builds a semantic graph for the document to be summarized. Summary extraction is then formulated as optimizing submodular functions defined on the semantic graph. The optimization is theoretically guaranteed to be near-optimal under the framework of submodularity. Extensive experiments on the ICSI meeting summarization task on both human transcripts and automatic speech recognition (ASR) outputs show that the graph-based submodular selection approach consistently outperforms the maximum marginal relevance (MMR) approach, a concept-based approach using integer linear programming (ILP), and a recursive graph-based ranking algorithm using Google's PageRank.
{"title":"Graph-based submodular selection for extractive summarization","authors":"Hui-Ching Lin, J. Bilmes, Shasha Xie","doi":"10.1109/ASRU.2009.5373486","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373486","url":null,"abstract":"We propose a novel approach for unsupervised extractive summarization. Our approach builds a semantic graph for the document to be summarized. Summary extraction is then formulated as optimizing submodular functions defined on the semantic graph. The optimization is theoretically guaranteed to be near-optimal under the framework of submodularity. Extensive experiments on the ICSI meeting summarization task on both human transcripts and automatic speech recognition (ASR) outputs show that the graph-based submodular selection approach consistently outperforms the maximum marginal relevance (MMR) approach, a concept-based approach using integer linear programming (ILP), and a recursive graph-based ranking algorithm using Google's PageRank.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133618651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373263
Tara N. Sainath, B. Ramabhadran, M. Picheny
While research in large vocabulary continuous speech recognition (LVCSR) has sparked the development of many state of the art research ideas, research in this domain suffers from two main drawbacks. First, because of the large number of parameters and poorly labeled transcriptions, gaining insight into further improvements based on error analysis is very difficult. Second, LVCSR systems often take a significantly longer time to train and test new research ideas compared to small vocabulary tasks. A small vocabulary task like TIMIT provides a phonetically rich and hand-labeled corpus and offers a good test bed to study algorithmic improvements. However, oftentimes research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we address these issues by taking the standard Ȝrecipeȝ used in typical LVCSR systems and applying it to the TIMIT phonetic recognition corpus, which provides a standard benchmark to compare methods. We find that at the speaker-independent (SI) level, our results offer comparable performance to other SI HMM systems. By taking advantage of speaker adaptation and discriminative training techniques commonly used in LVCSR systems, we achieve an error rate of 20%, the best results reported on the TIMIT task to date, moving us closer to the human reported phonetic recognition error rate of 15%. We propose the use of this system as the baseline for future research and believe that it will serve as a good framework to explore ideas that will carry over to LVCSR systems.
{"title":"An exploration of large vocabulary tools for small vocabulary phonetic recognition","authors":"Tara N. Sainath, B. Ramabhadran, M. Picheny","doi":"10.1109/ASRU.2009.5373263","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373263","url":null,"abstract":"While research in large vocabulary continuous speech recognition (LVCSR) has sparked the development of many state of the art research ideas, research in this domain suffers from two main drawbacks. First, because of the large number of parameters and poorly labeled transcriptions, gaining insight into further improvements based on error analysis is very difficult. Second, LVCSR systems often take a significantly longer time to train and test new research ideas compared to small vocabulary tasks. A small vocabulary task like TIMIT provides a phonetically rich and hand-labeled corpus and offers a good test bed to study algorithmic improvements. However, oftentimes research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we address these issues by taking the standard Ȝrecipeȝ used in typical LVCSR systems and applying it to the TIMIT phonetic recognition corpus, which provides a standard benchmark to compare methods. We find that at the speaker-independent (SI) level, our results offer comparable performance to other SI HMM systems. By taking advantage of speaker adaptation and discriminative training techniques commonly used in LVCSR systems, we achieve an error rate of 20%, the best results reported on the TIMIT task to date, moving us closer to the human reported phonetic recognition error rate of 15%. We propose the use of this system as the baseline for future research and believe that it will serve as a good framework to explore ideas that will carry over to LVCSR systems.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134321923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-12-01DOI: 10.1109/ASRU.2009.5373281
D. Falavigna, M. Gerosa, R. Gretter, D. Giuliani
In this paper, phone-to-word transduction is first investigated by coupling a speech recognizer, generating for each speech segment a phone sequence or a phone confusion network, with the efficient decoder of confusion networks adopted by MOSES, a popular statistical machine translation toolkit. Then, system combination is investigated by combining the outputs of several conventional ASR systems with the output of a system embedding phone-to-word decoding through statistical machine translation. Experiments are carried out in the context of a large vocabulary speech recognition task consisting of transcription of speeches delivered in English during the European Parliament Plenary Sessions (EPPS). While only a marginal performance improvements is achieved in system combination experiments when the output of the phone-to-word transducer is included in the combination, partial results show a great potential for improvements.
{"title":"Phone-to-word decoding through statistical machine translation and complementary system combination","authors":"D. Falavigna, M. Gerosa, R. Gretter, D. Giuliani","doi":"10.1109/ASRU.2009.5373281","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373281","url":null,"abstract":"In this paper, phone-to-word transduction is first investigated by coupling a speech recognizer, generating for each speech segment a phone sequence or a phone confusion network, with the efficient decoder of confusion networks adopted by MOSES, a popular statistical machine translation toolkit. Then, system combination is investigated by combining the outputs of several conventional ASR systems with the output of a system embedding phone-to-word decoding through statistical machine translation. Experiments are carried out in the context of a large vocabulary speech recognition task consisting of transcription of speeches delivered in English during the European Parliament Plenary Sessions (EPPS). While only a marginal performance improvements is achieved in system combination experiments when the output of the phone-to-word transducer is included in the combination, partial results show a great potential for improvements.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}