首页 > 最新文献

2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Integrating morphology into automatic speech recognition 将形态学整合到自动语音识别中
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373386
H. Sak, M. Saraçlar, Tunga Güngör
This paper proposes a novel approach to integrate the morphology as a model into an automatic speech recognition (ASR) system for morphologically rich languages. The high out-of-vocabulary (OOV) word rates have been a major challenge for ASR in morphologically productive languages. The standard approach to this problem has been to shift from words to sub-word units in language modeling, and the only change to the system is in the language model estimated over these units. In contrast, we propose to integrate the morphology as other any knowledge source - such as the lexicon, and the language model- directly into the search network. The morphological parser for a language, implemented as a finite-state lexical transducer, can be considered as a computational lexicon. The computational lexicon represents a dynamic vocabulary in contrast to a static vocabulary generally used for ASR. We compose the transducer for this computational lexicon with a statistical language model over lexical morphemes to obtain a morphology-integrated search network. The resulting search network generates only grammatical word forms and improves the recognition accuracy due to reduced OOV rate. We give experimental results for Turkish broadcast news transcription, and show that it outperforms the 50 K and 100 K vocabulary word models while the 200 K vocabulary word model is slightly better.
本文提出了一种将形态学作为模型集成到形态学丰富语言自动语音识别系统中的新方法。高词汇外(OOV)字率一直是词形生成语言的ASR面临的主要挑战。解决这个问题的标准方法是在语言建模中从词转移到子词单位,对系统的唯一改变是在这些单位上估计的语言模型。相反,我们建议将形态学作为其他任何知识来源(如词典和语言模型)直接集成到搜索网络中。语言的形态解析器作为有限状态词法换能器实现,可以看作是一个计算词法。与ASR通常使用的静态词汇表相比,计算词汇表表示动态词汇表。我们将该计算词典的换能器与基于词素的统计语言模型组合在一起,从而得到一个词素集成的搜索网络。由此产生的搜索网络只生成符合语法的词形,并且由于降低了OOV率而提高了识别精度。我们给出了土耳其广播新闻转录的实验结果,并表明它优于50 K和100 K词汇模型,而200 K词汇模型略好。
{"title":"Integrating morphology into automatic speech recognition","authors":"H. Sak, M. Saraçlar, Tunga Güngör","doi":"10.1109/ASRU.2009.5373386","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373386","url":null,"abstract":"This paper proposes a novel approach to integrate the morphology as a model into an automatic speech recognition (ASR) system for morphologically rich languages. The high out-of-vocabulary (OOV) word rates have been a major challenge for ASR in morphologically productive languages. The standard approach to this problem has been to shift from words to sub-word units in language modeling, and the only change to the system is in the language model estimated over these units. In contrast, we propose to integrate the morphology as other any knowledge source - such as the lexicon, and the language model- directly into the search network. The morphological parser for a language, implemented as a finite-state lexical transducer, can be considered as a computational lexicon. The computational lexicon represents a dynamic vocabulary in contrast to a static vocabulary generally used for ASR. We compose the transducer for this computational lexicon with a statistical language model over lexical morphemes to obtain a morphology-integrated search network. The resulting search network generates only grammatical word forms and improves the recognition accuracy due to reduced OOV rate. We give experimental results for Turkish broadcast news transcription, and show that it outperforms the 50 K and 100 K vocabulary word models while the 200 K vocabulary word model is slightly better.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127597726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Query-by-example Spoken Term Detection For OOV terms 针对OOV术语的按例查询语音术语检测
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373341
Carolina Parada, A. Sethy, B. Ramabhadran
The goal of Spoken Term Detection (STD) technology is to allow open vocabulary search over large collections of speech content. In this paper, we address cases where search term(s) of interest (queries) are acoustic examples. This is provided either by identifying a region of interest in a speech stream or by speaking the query term. Queries often relate to named-entities and foreign words, which typically have poor coverage in the vocabulary of Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Throughout this paper, we focus on query-by-example search for such out-of-vocabulary (OOV) query terms. We build upon a finite state transducer (FST) based search and indexing system [1] to address the query by example search for OOV terms by representing both the query and the index as phonetic lattices from the output of an LVCSR system. We provide results comparing different representations and generation mechanisms for both queries and indexes built with word and combined word and subword units [2]. We also present a two-pass method which uses query-by-example search using the best hit identified in an initial pass to augment the STD search results. The results demonstrate that query-by-example search can yield a significantly better performance, measured using Actual Term-Weighted Value (ATWV), of 0.479 when compared to a baseline ATWV of 0.325 that uses reference pronunciations for OOVs. Further improvements can be obtained with the proposed two pass approach and filtering using the expected unigram counts from the LVCSR system's lexicon.
口语术语检测(STD)技术的目标是允许对大量语音内容集合进行开放词汇搜索。在本文中,我们处理的情况下,感兴趣的搜索词(查询)是声学的例子。这可以通过识别语音流中感兴趣的区域或说出查询词来提供。查询通常与命名实体和外来词相关,这在大词汇量连续语音识别(LVCSR)系统的词汇表中覆盖率通常很低。在本文中,我们主要关注按例查询搜索这类超出词汇表(OOV)的查询词。我们建立了一个基于有限状态传感器(FST)的搜索和索引系统[1],通过将查询和索引都表示为LVCSR系统输出的语音格,通过示例搜索OOV术语来解决查询问题。我们提供了用单词和组合单词和子单词单元构建的查询和索引的不同表示和生成机制的比较结果[2]。我们还提出了一种双通道方法,该方法使用在初始通道中确定的最佳命中来使用按示例查询搜索来增加STD搜索结果。结果表明,与使用oov参考发音的基准ATWV 0.325相比,使用实际术语加权值(ATWV)测量的按示例查询搜索可以产生明显更好的性能,该值为0.479。进一步的改进可以通过提出的两遍方法和使用LVCSR系统词典中的预期单字节计数进行过滤来获得。
{"title":"Query-by-example Spoken Term Detection For OOV terms","authors":"Carolina Parada, A. Sethy, B. Ramabhadran","doi":"10.1109/ASRU.2009.5373341","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373341","url":null,"abstract":"The goal of Spoken Term Detection (STD) technology is to allow open vocabulary search over large collections of speech content. In this paper, we address cases where search term(s) of interest (queries) are acoustic examples. This is provided either by identifying a region of interest in a speech stream or by speaking the query term. Queries often relate to named-entities and foreign words, which typically have poor coverage in the vocabulary of Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Throughout this paper, we focus on query-by-example search for such out-of-vocabulary (OOV) query terms. We build upon a finite state transducer (FST) based search and indexing system [1] to address the query by example search for OOV terms by representing both the query and the index as phonetic lattices from the output of an LVCSR system. We provide results comparing different representations and generation mechanisms for both queries and indexes built with word and combined word and subword units [2]. We also present a two-pass method which uses query-by-example search using the best hit identified in an initial pass to augment the STD search results. The results demonstrate that query-by-example search can yield a significantly better performance, measured using Actual Term-Weighted Value (ATWV), of 0.479 when compared to a baseline ATWV of 0.325 that uses reference pronunciations for OOVs. Further improvements can be obtained with the proposed two pass approach and filtering using the expected unigram counts from the LVCSR system's lexicon.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125962269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 91
Iterative decoding: A novel re-scoring framework for confusion networks 迭代解码:一种新的混乱网络重新评分框架
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373438
Anoop Deoras, F. Jelinek
Recently there has been a lot of interest in confusion network re-scoring using sophisticated and complex knowledge sources. Traditionally, re-scoring has been carried out by the N-best list method or by the lattices or confusion network dynamic programming method. Although the dynamic programming method is optimal, it allows for the incorporation of only Markov knowledge sources. N-best lists, on the other hand, can incorporate sentence level knowledge sources, but with increasing N, the re-scoring becomes computationally very intensive. In this paper, we present an elegant framework for confusion network re-scoring called ‘Iterative Decoding’. In it, integration of multiple and complex knowledge sources is not only easier but it also allows for much faster re-scoring as compared to the N-best list method. Experiments with Language Model re-scoring show that for comparable performance (in terms of word error rate (WER)) of Iterative Decoding and N-best list re-scoring, the search effort required by our method is 22 times less than that of the N-best list method.
近年来,人们对使用复杂复杂的知识资源进行混淆网络重评分产生了浓厚的兴趣。传统的重评分方法是通过n -最优列表法或格或混淆网络动态规划法进行的。虽然动态规划方法是最优的,但它只允许合并马尔可夫知识来源。另一方面,N个最佳列表可以包含句子级的知识来源,但随着N的增加,重新评分的计算量变得非常大。在本文中,我们提出了一个优雅的混淆网络重新评分框架,称为“迭代解码”。在这种方法中,与n -最优列表方法相比,多个复杂知识来源的集成不仅更容易,而且可以更快地重新评分。语言模型重评分实验表明,在迭代解码和n -最佳列表重评分的性能(单词错误率(WER))相当的情况下,我们的方法所需的搜索工作量比n -最佳列表方法少22倍。
{"title":"Iterative decoding: A novel re-scoring framework for confusion networks","authors":"Anoop Deoras, F. Jelinek","doi":"10.1109/ASRU.2009.5373438","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373438","url":null,"abstract":"Recently there has been a lot of interest in confusion network re-scoring using sophisticated and complex knowledge sources. Traditionally, re-scoring has been carried out by the N-best list method or by the lattices or confusion network dynamic programming method. Although the dynamic programming method is optimal, it allows for the incorporation of only Markov knowledge sources. N-best lists, on the other hand, can incorporate sentence level knowledge sources, but with increasing N, the re-scoring becomes computationally very intensive. In this paper, we present an elegant framework for confusion network re-scoring called ‘Iterative Decoding’. In it, integration of multiple and complex knowledge sources is not only easier but it also allows for much faster re-scoring as compared to the N-best list method. Experiments with Language Model re-scoring show that for comparable performance (in terms of word error rate (WER)) of Iterative Decoding and N-best list re-scoring, the search effort required by our method is 22 times less than that of the N-best list method.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126856372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A segmental CRF approach to large vocabulary continuous speech recognition 大词汇量连续语音识别的分段CRF方法
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372916
G. Zweig, Patrick Nguyen
This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector streams operate at multiple time scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and are naturally geared to explain long span phenomena such as formant trajectories, duration, and syllable stress patterns. Generalization to unseen words is possible through the use of decomposable consistency features [1], [2], and our framework allows for the joint or separate discriminative training of the acoustic and language models. An initial evaluation of this framework with voice search data from the Bing Mobile (BM) application results in a 2% absolute improvement over an HMM baseline.
提出了一种分段条件随机场框架,用于大词汇量连续语音识别。这种方法的基础是使用声学探测器作为基本输入,并自动构建一组通用的段级特征。检测器流在多个时间尺度(帧、电话、多电话、音节或单词)上运行,并在CRF训练和解码过程中在单词级别进行组合。我们方法的一个关键方面是特征是在单词级别定义的,并且自然地用于解释长跨度现象,如形成峰轨迹,持续时间和音节重音模式。通过使用可分解的一致性特征[1],[2],我们的框架允许对声学和语言模型进行联合或单独的判别训练。使用Bing Mobile (BM)应用程序的语音搜索数据对该框架进行的初步评估结果显示,与HMM基线相比,该框架的绝对性能提高了2%。
{"title":"A segmental CRF approach to large vocabulary continuous speech recognition","authors":"G. Zweig, Patrick Nguyen","doi":"10.1109/ASRU.2009.5372916","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372916","url":null,"abstract":"This paper proposes a segmental conditional random field framework for large vocabulary continuous speech recognition. Fundamental to this approach is the use of acoustic detectors as the basic input, and the automatic construction of a versatile set of segment-level features. The detector streams operate at multiple time scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and are naturally geared to explain long span phenomena such as formant trajectories, duration, and syllable stress patterns. Generalization to unseen words is possible through the use of decomposable consistency features [1], [2], and our framework allows for the joint or separate discriminative training of the acoustic and language models. An initial evaluation of this framework with voice search data from the Bing Mobile (BM) application results in a 2% absolute improvement over an HMM baseline.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126866345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 135
Hierarchical variational loopy belief propagation for multi-talker speech recognition 多说话者语音识别的层次变分环路信念传播
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373446
Steven J. Rennie, J. Hershey, P. Olsen
We present a new method for multi-talker speech recognition using a single-channel that combines loopy belief propagation and variational inference methods to control the complexity of inference. The method models each source using an HMM with a hierarchical set of acoustic states, and uses the max model to approximate how the sources interact to generate mixed data. Inference involves inferring a set of probabilistic time-frequency masks to separate the speakers. By conditioning these masks on the hierarchical acoustic states of the speakers, the fidelity and complexity of acoustic inference can be precisely controlled. Acoustic inference using the algorithm scales linearly with the number of probabilistic time-frequency masks, and temporal inference scales linearly with LM size. Results on the monaural speech separation task (SSC) data demonstrate that the presented Hierarchical Variational Max-Sum Product Algorithm (HVMSP) outperforms VMSP by over 2% absolute using 4 times fewer probablistic masks. HVMSP furthermore performs on-par with the MSP algorithm, which utilizes exact conditional marginal likelihoods, using 256 times less time-frequency masks.
本文提出了一种基于单通道的多话音识别新方法,该方法结合了循环信念传播和变分推理方法来控制推理的复杂性。该方法使用具有声学状态分层集的HMM对每个源进行建模,并使用max模型来近似源如何相互作用以生成混合数据。推理包括推断一组概率时频掩模来分离说话者。通过调节这些掩模对扬声器的分层声学状态,可以精确控制声学推理的保真度和复杂性。使用该算法的声学推理与概率时频掩模的数量呈线性关系,而时间推理与LM的大小呈线性关系。在单耳语音分离任务(SSC)数据上的结果表明,所提出的分层变分最大和积算法(HVMSP)在使用4倍的概率掩码的情况下,比VMSP的绝对性能高出2%以上。此外,HVMSP算法的性能与MSP算法相当,MSP算法利用精确的条件边际似然,使用的时频掩码减少了256倍。
{"title":"Hierarchical variational loopy belief propagation for multi-talker speech recognition","authors":"Steven J. Rennie, J. Hershey, P. Olsen","doi":"10.1109/ASRU.2009.5373446","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373446","url":null,"abstract":"We present a new method for multi-talker speech recognition using a single-channel that combines loopy belief propagation and variational inference methods to control the complexity of inference. The method models each source using an HMM with a hierarchical set of acoustic states, and uses the max model to approximate how the sources interact to generate mixed data. Inference involves inferring a set of probabilistic time-frequency masks to separate the speakers. By conditioning these masks on the hierarchical acoustic states of the speakers, the fidelity and complexity of acoustic inference can be precisely controlled. Acoustic inference using the algorithm scales linearly with the number of probabilistic time-frequency masks, and temporal inference scales linearly with LM size. Results on the monaural speech separation task (SSC) data demonstrate that the presented Hierarchical Variational Max-Sum Product Algorithm (HVMSP) outperforms VMSP by over 2% absolute using 4 times fewer probablistic masks. HVMSP furthermore performs on-par with the MSP algorithm, which utilizes exact conditional marginal likelihoods, using 256 times less time-frequency masks.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131359655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Automatic translation from parallel speech: Simultaneous interpretation as MT training data 平行语音自动翻译:作为机器翻译训练数据的同声传译
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372880
M. Paulik, A. Waibel
State-of-the art statistical machine translation depends heavily on the availability of domain-specific bilingual parallel text. However, acquiring large amounts of bilingual parallel text is costly and, depending on the language pair, sometimes impossible. We propose an alternative to parallel text as machine translation (MT) training data; audio recordings of parallel speech (pSp) as it occurs in any scenario where interpreters are involved. Although interpretation (pSp) differs significantly from translation (parallel text), we achieve surprisingly strong translation results with our pSp-trained MT and speech translation systems.We argue that the presented approach is of special interest for developing speech translation in the context of resource-deficient languages where even monolingual resources are scarce.
最先进的统计机器翻译在很大程度上依赖于特定领域双语平行文本的可用性。然而,获取大量的双语平行文本是昂贵的,有时是不可能的,这取决于语言对。我们提出了一种替代平行文本作为机器翻译(MT)训练数据;平行语音(pSp)的音频记录,因为它发生在任何情况下,口译员参与。虽然口译(pSp)与翻译(平行文本)有很大的不同,但我们通过pSp训练的机器翻译和语音翻译系统获得了惊人的翻译结果。我们认为,所提出的方法对于在资源缺乏的语言背景下发展语音翻译具有特殊的意义,在这种情况下,甚至单语资源都是稀缺的。
{"title":"Automatic translation from parallel speech: Simultaneous interpretation as MT training data","authors":"M. Paulik, A. Waibel","doi":"10.1109/ASRU.2009.5372880","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372880","url":null,"abstract":"State-of-the art statistical machine translation depends heavily on the availability of domain-specific bilingual parallel text. However, acquiring large amounts of bilingual parallel text is costly and, depending on the language pair, sometimes impossible. We propose an alternative to parallel text as machine translation (MT) training data; audio recordings of parallel speech (pSp) as it occurs in any scenario where interpreters are involved. Although interpretation (pSp) differs significantly from translation (parallel text), we achieve surprisingly strong translation results with our pSp-trained MT and speech translation systems.We argue that the presented approach is of special interest for developing speech translation in the context of resource-deficient languages where even monolingual resources are scarce.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"10 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114023514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Lexicon adaptation for subword speech recognition 子词语音识别的词汇自适应
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373296
Timo Mertens, Daniel Schneider, A. Næss, T. Svendsen
In this paper we present two approaches to adapt a syllable-based recognition lexicon in an automatic speech recognition (ASR) setting. The motivation is to evaluate whether adaptation techniques commonly used on a word level can also be employed on a subword level. The first method predicts syllable variations, taking into account sub-syllabic phone cluster variations, and subsequently adapts the syllable lexicon. The second approach adds syllable bigrams to the lexicon to cope with acoustic confusability of subword units and syllable-inherent phone attachment ambiguities. We evaluate the methods on two German data sets, one consisting of planned and the other of spontaneous speech. Although the first method did not yield any improvement in the syllable error rate (SER), we could observe that the predicted confusions correlate with those observed in the test data. Bigram adaptation improved the SER by 1.3% and 0.8% absolute on the planned and spontaneous data sets, respectively.
在本文中,我们提出了两种方法来适应基于音节的识别词典在自动语音识别(ASR)设置。目的是评估常用的词级适配技术是否也可以应用于子词级适配。第一种方法预测音节变化,考虑到次音节电话簇的变化,随后调整音节词汇。第二种方法是在词典中增加音节双元,以解决子词单位的语音混淆和音节固有的电话连接歧义。我们在两个德语数据集上评估了这些方法,一个由计划语音组成,另一个由自发语音组成。虽然第一种方法在音节错误率(SER)方面没有任何改善,但我们可以观察到预测的混淆与测试数据中观察到的混淆相关。在计划数据集和自发数据集上,Bigram自适应将SER的绝对值分别提高了1.3%和0.8%。
{"title":"Lexicon adaptation for subword speech recognition","authors":"Timo Mertens, Daniel Schneider, A. Næss, T. Svendsen","doi":"10.1109/ASRU.2009.5373296","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373296","url":null,"abstract":"In this paper we present two approaches to adapt a syllable-based recognition lexicon in an automatic speech recognition (ASR) setting. The motivation is to evaluate whether adaptation techniques commonly used on a word level can also be employed on a subword level. The first method predicts syllable variations, taking into account sub-syllabic phone cluster variations, and subsequently adapts the syllable lexicon. The second approach adds syllable bigrams to the lexicon to cope with acoustic confusability of subword units and syllable-inherent phone attachment ambiguities. We evaluate the methods on two German data sets, one consisting of planned and the other of spontaneous speech. Although the first method did not yield any improvement in the syllable error rate (SER), we could observe that the predicted confusions correlate with those observed in the test data. Bigram adaptation improved the SER by 1.3% and 0.8% absolute on the planned and spontaneous data sets, respectively.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114946830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Graph-based submodular selection for extractive summarization 基于图的抽取摘要子模块选择
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373486
Hui-Ching Lin, J. Bilmes, Shasha Xie
We propose a novel approach for unsupervised extractive summarization. Our approach builds a semantic graph for the document to be summarized. Summary extraction is then formulated as optimizing submodular functions defined on the semantic graph. The optimization is theoretically guaranteed to be near-optimal under the framework of submodularity. Extensive experiments on the ICSI meeting summarization task on both human transcripts and automatic speech recognition (ASR) outputs show that the graph-based submodular selection approach consistently outperforms the maximum marginal relevance (MMR) approach, a concept-based approach using integer linear programming (ILP), and a recursive graph-based ranking algorithm using Google's PageRank.
我们提出了一种新的无监督摘录摘要方法。我们的方法为要总结的文档构建一个语义图。然后将摘要提取表述为优化定义在语义图上的子模块函数。在子模块化框架下,理论上保证了优化是近最优的。在ICSI会议总结任务上对人类转录本和自动语音识别(ASR)输出进行的大量实验表明,基于图的子模块选择方法始终优于最大边际相关性(MMR)方法、基于概念的整数线性规划(ILP)方法和基于递归图的排序算法。
{"title":"Graph-based submodular selection for extractive summarization","authors":"Hui-Ching Lin, J. Bilmes, Shasha Xie","doi":"10.1109/ASRU.2009.5373486","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373486","url":null,"abstract":"We propose a novel approach for unsupervised extractive summarization. Our approach builds a semantic graph for the document to be summarized. Summary extraction is then formulated as optimizing submodular functions defined on the semantic graph. The optimization is theoretically guaranteed to be near-optimal under the framework of submodularity. Extensive experiments on the ICSI meeting summarization task on both human transcripts and automatic speech recognition (ASR) outputs show that the graph-based submodular selection approach consistently outperforms the maximum marginal relevance (MMR) approach, a concept-based approach using integer linear programming (ILP), and a recursive graph-based ranking algorithm using Google's PageRank.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133618651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 88
An exploration of large vocabulary tools for small vocabulary phonetic recognition 小词汇语音识别的大词汇工具探索
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373263
Tara N. Sainath, B. Ramabhadran, M. Picheny
While research in large vocabulary continuous speech recognition (LVCSR) has sparked the development of many state of the art research ideas, research in this domain suffers from two main drawbacks. First, because of the large number of parameters and poorly labeled transcriptions, gaining insight into further improvements based on error analysis is very difficult. Second, LVCSR systems often take a significantly longer time to train and test new research ideas compared to small vocabulary tasks. A small vocabulary task like TIMIT provides a phonetically rich and hand-labeled corpus and offers a good test bed to study algorithmic improvements. However, oftentimes research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we address these issues by taking the standard Ȝrecipeȝ used in typical LVCSR systems and applying it to the TIMIT phonetic recognition corpus, which provides a standard benchmark to compare methods. We find that at the speaker-independent (SI) level, our results offer comparable performance to other SI HMM systems. By taking advantage of speaker adaptation and discriminative training techniques commonly used in LVCSR systems, we achieve an error rate of 20%, the best results reported on the TIMIT task to date, moving us closer to the human reported phonetic recognition error rate of 15%. We propose the use of this system as the baseline for future research and believe that it will serve as a good framework to explore ideas that will carry over to LVCSR systems.
虽然大词汇量连续语音识别(LVCSR)的研究引发了许多最新研究思路的发展,但该领域的研究存在两个主要缺陷。首先,由于大量的参数和标记不佳的转录,基于错误分析获得进一步改进的洞察力是非常困难的。其次,与小词汇任务相比,LVCSR系统通常需要更长的时间来训练和测试新的研究想法。像TIMIT这样的小型词汇任务提供了一个语音丰富且手工标记的语料库,并提供了一个很好的测试平台来研究算法改进。然而,通常为小词汇任务探索的研究思路并不总是为LVCSR系统提供收益。在本文中,我们通过将典型LVCSR系统中使用的标准Ȝrecipeȝ应用于TIMIT语音识别语料库来解决这些问题,该语料库为比较方法提供了一个标准基准。我们发现,在扬声器无关(SI)水平上,我们的结果提供了与其他SI HMM系统相当的性能。通过利用LVCSR系统中常用的说话人自适应和判别训练技术,我们实现了20%的错误率,这是迄今为止在TIMIT任务中报告的最佳结果,使我们更接近人类报告的语音识别错误率15%。我们建议使用该系统作为未来研究的基线,并相信它将作为一个很好的框架来探索将延续到LVCSR系统的想法。
{"title":"An exploration of large vocabulary tools for small vocabulary phonetic recognition","authors":"Tara N. Sainath, B. Ramabhadran, M. Picheny","doi":"10.1109/ASRU.2009.5373263","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373263","url":null,"abstract":"While research in large vocabulary continuous speech recognition (LVCSR) has sparked the development of many state of the art research ideas, research in this domain suffers from two main drawbacks. First, because of the large number of parameters and poorly labeled transcriptions, gaining insight into further improvements based on error analysis is very difficult. Second, LVCSR systems often take a significantly longer time to train and test new research ideas compared to small vocabulary tasks. A small vocabulary task like TIMIT provides a phonetically rich and hand-labeled corpus and offers a good test bed to study algorithmic improvements. However, oftentimes research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we address these issues by taking the standard Ȝrecipeȝ used in typical LVCSR systems and applying it to the TIMIT phonetic recognition corpus, which provides a standard benchmark to compare methods. We find that at the speaker-independent (SI) level, our results offer comparable performance to other SI HMM systems. By taking advantage of speaker adaptation and discriminative training techniques commonly used in LVCSR systems, we achieve an error rate of 20%, the best results reported on the TIMIT task to date, moving us closer to the human reported phonetic recognition error rate of 15%. We propose the use of this system as the baseline for future research and believe that it will serve as a good framework to explore ideas that will carry over to LVCSR systems.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134321923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Phone-to-word decoding through statistical machine translation and complementary system combination 通过统计机器翻译和互补系统相结合实现电话到文字的译码
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373281
D. Falavigna, M. Gerosa, R. Gretter, D. Giuliani
In this paper, phone-to-word transduction is first investigated by coupling a speech recognizer, generating for each speech segment a phone sequence or a phone confusion network, with the efficient decoder of confusion networks adopted by MOSES, a popular statistical machine translation toolkit. Then, system combination is investigated by combining the outputs of several conventional ASR systems with the output of a system embedding phone-to-word decoding through statistical machine translation. Experiments are carried out in the context of a large vocabulary speech recognition task consisting of transcription of speeches delivered in English during the European Parliament Plenary Sessions (EPPS). While only a marginal performance improvements is achieved in system combination experiments when the output of the phone-to-word transducer is included in the combination, partial results show a great potential for improvements.
本文首先通过将语音识别器(为每个语音段生成电话序列或电话混淆网络)与流行的统计机器翻译工具包MOSES采用的高效混淆网络解码器耦合在一起,研究了电话到单词的转导。然后,将几个传统ASR系统的输出与一个通过统计机器翻译嵌入电话到单词解码的系统的输出相结合,研究了系统组合。实验是在一个大词汇量的语音识别任务的背景下进行的,该任务包括在欧洲议会全体会议(EPPS)期间用英语发表的演讲的转录。虽然在系统组合实验中,当包含电话到单词换能器的输出时,仅实现了边际性能改进,但部分结果显示出巨大的改进潜力。
{"title":"Phone-to-word decoding through statistical machine translation and complementary system combination","authors":"D. Falavigna, M. Gerosa, R. Gretter, D. Giuliani","doi":"10.1109/ASRU.2009.5373281","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373281","url":null,"abstract":"In this paper, phone-to-word transduction is first investigated by coupling a speech recognizer, generating for each speech segment a phone sequence or a phone confusion network, with the efficient decoder of confusion networks adopted by MOSES, a popular statistical machine translation toolkit. Then, system combination is investigated by combining the outputs of several conventional ASR systems with the output of a system embedding phone-to-word decoding through statistical machine translation. Experiments are carried out in the context of a large vocabulary speech recognition task consisting of transcription of speeches delivered in English during the European Parliament Plenary Sessions (EPPS). While only a marginal performance improvements is achieved in system combination experiments when the output of the phone-to-word transducer is included in the combination, partial results show a great potential for improvements.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1