首页 > 最新文献

2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Detection of OOV words by combining acoustic confidence measures with linguistic features 声学置信度与语言特征相结合的OOV词检测
Pub Date : 2009-12-13 DOI: 10.1109/ASRU.2009.5372877
F. Stouten, D. Fohr, I. Illina
This paper describes the design of an Out-Of-Vocabulary words (OOV) detector. Such a system is assumed to detect segments that correspond to OOV words (words that are not included in the lexicon) in the output of a LVCSR system. The OOV detector uses acoustic confidence measures that are derived from several systems: a word recognizer constrained by a lexicon, a phone recognizer constrained by a grammar and a phone recognizer without constraints. On top of that it also uses some linguistic features. The experimental results on a French broadcast news transcription task showed that for our approach precision equals recall at 35%.
本文介绍了一种超词汇检测器的设计。假设这样的系统可以检测LVCSR系统输出中与OOV单词(不包括在词典中的单词)对应的片段。OOV检测器使用来自几个系统的声学置信度度量:受词典约束的单词识别器、受语法约束的电话识别器和没有约束的电话识别器。除此之外,它还使用了一些语言特征。在一个法语广播新闻转录任务上的实验结果表明,我们的方法的准确率等于召回率为35%。
{"title":"Detection of OOV words by combining acoustic confidence measures with linguistic features","authors":"F. Stouten, D. Fohr, I. Illina","doi":"10.1109/ASRU.2009.5372877","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372877","url":null,"abstract":"This paper describes the design of an Out-Of-Vocabulary words (OOV) detector. Such a system is assumed to detect segments that correspond to OOV words (words that are not included in the lexicon) in the output of a LVCSR system. The OOV detector uses acoustic confidence measures that are derived from several systems: a word recognizer constrained by a lexicon, a phone recognizer constrained by a grammar and a phone recognizer without constraints. On top of that it also uses some linguistic features. The experimental results on a French broadcast news transcription task showed that for our approach precision equals recall at 35%.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"114 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120851906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Active learning for rule-based and corpus-based Spoken Language Understanding models 基于规则和基于语料库的口语理解模型的主动学习
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373377
Pierre Gotab, Frédéric Béchet, Géraldine Damnati
Active learning can be used for the maintenance of a deployed Spoken Dialog System (SDS) that evolves with time and when large collection of dialog traces can be collected on a daily basis. At the Spoken Language Understanding (SLU) level this maintenance process is crucial as a deployed SDS evolves quickly when services are added, modified or dropped. Knowledge-based approaches, based on manually written grammars or inference rules, are often preferred as system designers can modify directly the SLU models in order to take into account such a modification in the service, even if no or very little related data has been collected. However as new examples are added to the annotated corpus, corpus-based methods can then be applied, replacing or in addition to the initial knowledge-based models. This paper describes an active learning scheme, based on an SLU criterion, which is used for automatically updating the SLU models of a deployed SDS. Two kind of SLU models are going to be compared: rule-based ones, used in the deployed system and consisting of several thousands of hand-crafted rules; corpus-based ones, based on the automatic learning of classifiers on an annotated corpus.
主动学习可以用于维护已部署的口语对话系统(SDS),该系统随着时间的推移而发展,并且每天可以收集大量对话跟踪。在口语理解(SLU)级别,这种维护过程至关重要,因为当添加、修改或删除服务时,部署的SDS会迅速发展。基于手动编写的语法或推理规则的基于知识的方法通常是首选方法,因为系统设计人员可以直接修改SLU模型,以便考虑服务中的这种修改,即使没有或很少收集相关数据。然而,随着新的示例被添加到注释的语料库中,基于语料库的方法可以被应用,取代或补充最初的基于知识的模型。本文描述了一种基于SLU标准的主动学习方案,用于自动更新已部署的SDS的SLU模型。我们将比较两种类型的SLU模型:基于规则的模型,在部署的系统中使用,由数千条手工制作的规则组成;基于语料库的,基于标注语料库上分类器的自动学习。
{"title":"Active learning for rule-based and corpus-based Spoken Language Understanding models","authors":"Pierre Gotab, Frédéric Béchet, Géraldine Damnati","doi":"10.1109/ASRU.2009.5373377","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373377","url":null,"abstract":"Active learning can be used for the maintenance of a deployed Spoken Dialog System (SDS) that evolves with time and when large collection of dialog traces can be collected on a daily basis. At the Spoken Language Understanding (SLU) level this maintenance process is crucial as a deployed SDS evolves quickly when services are added, modified or dropped. Knowledge-based approaches, based on manually written grammars or inference rules, are often preferred as system designers can modify directly the SLU models in order to take into account such a modification in the service, even if no or very little related data has been collected. However as new examples are added to the annotated corpus, corpus-based methods can then be applied, replacing or in addition to the initial knowledge-based models. This paper describes an active learning scheme, based on an SLU criterion, which is used for automatically updating the SLU models of a deployed SDS. Two kind of SLU models are going to be compared: rule-based ones, used in the deployed system and consisting of several thousands of hand-crafted rules; corpus-based ones, based on the automatic learning of classifiers on an annotated corpus.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127522086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Noise robust model adaptation using linear spline interpolation 基于线性样条插值的噪声鲁棒模型自适应
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373430
K. Kalgaonkar, M. Seltzer, A. Acero
This paper presents a novel data-driven technique for performing acoustic model adaptation to noisy environments. In the presence of additive noise, the relationship between log mel spectra of speech, noise and noisy speech is nonlinear. Traditional methods linearize this relationship using the mode of the nonlinearity or use some other approximation. The approach presented in this paper models this nonlinear relationship using linear spline regression. In this method, the set of spline parameters that minimizes the error between the predicted and actual noisy speech features is learned from training data, and used at runtime to adapt clean acoustic model parameters to the current noise conditions. Experiments were performed to evaluate the performance of the system on the Aurora 2 task. Results show that the proposed adaptation algorithm (word accuracy 89.22%) outperforms VTS model adaptation (word accuracy 88.38%).
提出了一种新的数据驱动技术,用于噪声环境下的声学模型自适应。在加性噪声存在的情况下,语音、噪声和带噪声语音的对数谱之间的关系是非线性的。传统的方法是利用非线性模态或其他近似方法将这种关系线性化。本文提出的方法是用线性样条回归对这种非线性关系进行建模。在该方法中,从训练数据中学习最小预测和实际噪声语音特征之间误差的样条参数集,并在运行时使用该样条参数使干净的声学模型参数适应当前噪声条件。通过实验来评估该系统在极光2号任务中的性能。结果表明,本文提出的自适应算法(词正确率89.22%)优于VTS模型自适应算法(词正确率88.38%)。
{"title":"Noise robust model adaptation using linear spline interpolation","authors":"K. Kalgaonkar, M. Seltzer, A. Acero","doi":"10.1109/ASRU.2009.5373430","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373430","url":null,"abstract":"This paper presents a novel data-driven technique for performing acoustic model adaptation to noisy environments. In the presence of additive noise, the relationship between log mel spectra of speech, noise and noisy speech is nonlinear. Traditional methods linearize this relationship using the mode of the nonlinearity or use some other approximation. The approach presented in this paper models this nonlinear relationship using linear spline regression. In this method, the set of spline parameters that minimizes the error between the predicted and actual noisy speech features is learned from training data, and used at runtime to adapt clean acoustic model parameters to the current noise conditions. Experiments were performed to evaluate the performance of the system on the Aurora 2 task. Results show that the proposed adaptation algorithm (word accuracy 89.22%) outperforms VTS model adaptation (word accuracy 88.38%).","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125178029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Automatic detection of vowel pronunciation errors using multiple information sources 使用多个信息源自动检测元音发音错误
Pub Date : 2009-12-01 DOI: 10.1109/asru.2009.5373335
Joost van Doremalen, C. Cucchiarini, H. Strik
Frequent pronunciation errors made by L2 learners of Dutch often concern vowel substitutions. To detect such pronunciation errors, ASR-based confidence measures (CMs) are generally used. In the current paper we compare and combine confidence measures with MFCCs and phonetic features. The results show that the best results are obtained by using MFCCs, then CMs, and finally phonetic features, and that substantial improvements can be obtained by combining different features.
第二语言荷兰语学习者经常犯的发音错误通常与元音替换有关。为了检测此类发音错误,通常使用基于asr的置信度测量(CMs)。本文将置信度度量与语音特征和语音特征进行比较和结合。结果表明,使用mfccc、CMs、语音特征的效果最好,不同特征的组合可以获得较大的改进。
{"title":"Automatic detection of vowel pronunciation errors using multiple information sources","authors":"Joost van Doremalen, C. Cucchiarini, H. Strik","doi":"10.1109/asru.2009.5373335","DOIUrl":"https://doi.org/10.1109/asru.2009.5373335","url":null,"abstract":"Frequent pronunciation errors made by L2 learners of Dutch often concern vowel substitutions. To detect such pronunciation errors, ASR-based confidence measures (CMs) are generally used. In the current paper we compare and combine confidence measures with MFCCs and phonetic features. The results show that the best results are obtained by using MFCCs, then CMs, and finally phonetic features, and that substantial improvements can be obtained by combining different features.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"47 23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123587260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
The Asian network-based speech-to-speech translation system 基于亚洲网络的语音到语音翻译系统
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373353
S. Sakti, Noriyuki Kimura, Michael Paul, Chiori Hori, E. Sumita, Satoshi Nakamura, Jun Park, C. Wutiwiwatchai, Bo Xu, Hammam Riza, K. Arora, C. Luong, Haizhou Li
This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. The system was designed to translate common spoken utterances of travel conversations from a certain source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. Each A-STAR member contributes one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. Currently, the system has successfully covered 9 languages— namely, 8 Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, Chinese) and additionally, the English language. The system's domain covers about 20,000 travel expressions, including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we discuss the difficulties involved in connecting various different spoken language translation systems through Web servers. We also present speech-translation results on the first A-STAR demo experiments carried out in July 2009.
本文概述了由亚洲语音翻译高级研究协会(A-STAR)开发的首个基于网络的语音到语音翻译系统。该系统旨在将旅行对话中的常用口语从某种源语言翻译成多种目标语言,以方便使用不同亚洲语言的人之间的多方旅行对话。每个A-STAR成员都贡献了以下一种或多种口语技术:自动语音识别、机器翻译和通过Web服务器的文本到语音转换。目前,该系统已成功覆盖9种语言,即8种亚洲语言(印地语、印尼语、日语、韩语、马来语、泰语、越南语、中文),以及英语。该系统的领域涵盖了大约2万个旅游表达,包括专有名词,即亚洲国家的著名景点或景点的名称。在本文中,我们讨论了通过Web服务器连接各种不同的口语翻译系统所涉及的困难。我们还介绍了2009年7月进行的第一次A-STAR演示实验的语音翻译结果。
{"title":"The Asian network-based speech-to-speech translation system","authors":"S. Sakti, Noriyuki Kimura, Michael Paul, Chiori Hori, E. Sumita, Satoshi Nakamura, Jun Park, C. Wutiwiwatchai, Bo Xu, Hammam Riza, K. Arora, C. Luong, Haizhou Li","doi":"10.1109/ASRU.2009.5373353","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373353","url":null,"abstract":"This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. The system was designed to translate common spoken utterances of travel conversations from a certain source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. Each A-STAR member contributes one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. Currently, the system has successfully covered 9 languages— namely, 8 Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, Chinese) and additionally, the English language. The system's domain covers about 20,000 travel expressions, including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we discuss the difficulties involved in connecting various different spoken language translation systems through Web servers. We also present speech-translation results on the first A-STAR demo experiments carried out in July 2009.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125316443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Acoustic emotion recognition: A benchmark comparison of performances 声学情感识别:性能的基准比较
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372886
Björn Schuller, Bogdan Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth
In the light of the first challenge on emotion recognition from speech we provide the largest-to-date benchmark comparison under equal conditions on nine standard corpora in the field using the two pre-dominant paradigms: modeling on a frame-level by means of hidden Markov models and supra-segmental modeling by systematic feature brute-forcing. Investigated corpora are the ABC, AVIC, DES, EMO-DB, eNTERFACE, SAL, SmartKom, SUSAS, and VAM databases. To provide better comparability among sets, we additionally cluster each database's emotions into binary valence and arousal discrimination tasks. In the result large differences are found among corpora that mostly stem from naturalistic emotions and spontaneous speech vs. more prototypical events. Further, supra-segmental modeling proves significantly beneficial on average when several classes are addressed at a time.
鉴于语音情感识别的第一个挑战,我们使用两种占主导地位的范式,在相同条件下对该领域的九个标准语料库进行了迄今为止最大的基准比较:基于隐马尔可夫模型的帧级建模和基于系统特征暴力强迫的超分段建模。调查的语料库有ABC、AVIC、DES、EMO-DB、eNTERFACE、SAL、SmartKom、SUSAS和VAM数据库。为了更好地提供集合之间的可比性,我们还将每个数据库的情绪聚类为二值价和唤醒辨别任务。结果发现,主要源于自然情绪和自发言语的语料库与更典型的事件的语料库之间存在巨大差异。此外,当一次处理几个类时,超分段建模被证明是非常有益的。
{"title":"Acoustic emotion recognition: A benchmark comparison of performances","authors":"Björn Schuller, Bogdan Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth","doi":"10.1109/ASRU.2009.5372886","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372886","url":null,"abstract":"In the light of the first challenge on emotion recognition from speech we provide the largest-to-date benchmark comparison under equal conditions on nine standard corpora in the field using the two pre-dominant paradigms: modeling on a frame-level by means of hidden Markov models and supra-segmental modeling by systematic feature brute-forcing. Investigated corpora are the ABC, AVIC, DES, EMO-DB, eNTERFACE, SAL, SmartKom, SUSAS, and VAM databases. To provide better comparability among sets, we additionally cluster each database's emotions into binary valence and arousal discrimination tasks. In the result large differences are found among corpora that mostly stem from naturalistic emotions and spontaneous speech vs. more prototypical events. Further, supra-segmental modeling proves significantly beneficial on average when several classes are addressed at a time.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122951614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 268
Speaker de-identification via voice transformation 通过语音转换去识别说话人
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373356
Qin Jin, Arthur R. Toth, Tanja Schultz, A. Black
It is a common feature of modern automated voice-driven applications and services to record and transmit a user's spoken request. At the same time, several domains and applications may require keeping the content of the user's request confidential and at the same time preserving the speaker's identity. This requires a technology that allows the speaker's voice to be de-identified in the sense that the voice sounds natural and intelligible but does not reveal the identity of the speaker. In this paper we investigate different voice transformation strategies on a large population of speakers to disguise the speakers' identities while preserving the intelligibility of the voices. We apply two automatic speaker identification approaches to verify the success of de-identification with voice transformation, a GMM-based and a Phonetic approach. The evaluation based on the automatic speaker identification systems verifies that the proposed voice transformation technique enables transmission of the content of the users' spoken requests while successfully preserving their identities. Also, the results indicate that different speakers still sound distinct after the transformation. Furthermore, we carried out a human listening test that proved the transformed speech to be both intelligible and securely de-identified, as it hid the identity of the speakers even to listeners who knew the speakers very well.
记录和传输用户的语音请求是现代自动化语音驱动应用程序和服务的共同特征。与此同时,一些领域和应用程序可能需要对用户请求的内容保密,同时保留说话者的身份。这需要一种技术,可以让说话者的声音去识别,在某种意义上说,声音听起来自然和可理解,但不透露说话者的身份。本文研究了不同的语音转换策略,在大量说话人的情况下伪装说话人的身份,同时保持声音的可理解性。我们应用了两种自动说话人识别方法来验证语音转换去识别的成功,一种是基于gmm的方法,另一种是语音方法。基于自动说话人识别系统的评估验证了所提出的语音转换技术能够传输用户语音请求的内容,同时成功地保留其身份。结果表明,不同的说话人在转换后的声音仍然是不同的。此外,我们进行了一项人类听力测试,证明了转换后的语音既可理解又安全地去识别,因为它隐藏了说话者的身份,甚至对非常了解说话者的听众也是如此。
{"title":"Speaker de-identification via voice transformation","authors":"Qin Jin, Arthur R. Toth, Tanja Schultz, A. Black","doi":"10.1109/ASRU.2009.5373356","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373356","url":null,"abstract":"It is a common feature of modern automated voice-driven applications and services to record and transmit a user's spoken request. At the same time, several domains and applications may require keeping the content of the user's request confidential and at the same time preserving the speaker's identity. This requires a technology that allows the speaker's voice to be de-identified in the sense that the voice sounds natural and intelligible but does not reveal the identity of the speaker. In this paper we investigate different voice transformation strategies on a large population of speakers to disguise the speakers' identities while preserving the intelligibility of the voices. We apply two automatic speaker identification approaches to verify the success of de-identification with voice transformation, a GMM-based and a Phonetic approach. The evaluation based on the automatic speaker identification systems verifies that the proposed voice transformation technique enables transmission of the content of the users' spoken requests while successfully preserving their identities. Also, the results indicate that different speakers still sound distinct after the transformation. Furthermore, we carried out a human listening test that proved the transformed speech to be both intelligible and securely de-identified, as it hid the identity of the speakers even to listeners who knew the speakers very well.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117128422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Sub-band modulation spectrum compensation for robust speech recognition 鲁棒语音识别的子带调制频谱补偿
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373506
Wen-hsiang Tu, Sheng-Yuan Huang, J. Hung
This paper proposes a novel scheme in performing feature statistics normalization techniques for robust speech recognition. In the proposed approach, the processed temporal-domain feature sequence is first converted into the modulation spectral domain. The magnitude part of the modulation spectrum is decomposed into non-uniform sub-band segments, and then each sub-band segment is individually processed by the well-known normalization methods, like mean normalization (MN), mean and variance normalization (MVN) and histogram equalization (HEQ). Finally, we reconstruct the feature stream with all the modified sub-band magnitude spectral segments and the original phase spectrum using the inverse DFT. With this process, the components that correspond to more important modulation spectral bands in the feature sequence can be processed separately. For the Aurora-2 clean-condition training task, the new proposed sub-band spectral MVN and HEQ provide relative error rate reductions of 18.66% and 23.58% over the conventional temporal MVN and HEQ, respectively.
本文提出了一种基于特征统计归一化的鲁棒语音识别新方案。在该方法中,首先将处理后的时域特征序列转换为调制谱域。将调制频谱的幅度部分分解为非均匀的子带段,然后分别采用均值归一化(MN)、均值方差归一化(MVN)和直方图均衡(HEQ)等常用归一化方法对每个子带段进行处理。最后,利用逆DFT将所有修改后的子带幅度谱段和原始相位谱重构为特征流。利用该方法,可以对特征序列中更重要的调制谱带对应的分量进行单独处理。对于“极光-2”清洁条件训练任务,新提出的子波段光谱MVN和HEQ比传统的时间MVN和HEQ分别降低了18.66%和23.58%的相对错误率。
{"title":"Sub-band modulation spectrum compensation for robust speech recognition","authors":"Wen-hsiang Tu, Sheng-Yuan Huang, J. Hung","doi":"10.1109/ASRU.2009.5373506","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373506","url":null,"abstract":"This paper proposes a novel scheme in performing feature statistics normalization techniques for robust speech recognition. In the proposed approach, the processed temporal-domain feature sequence is first converted into the modulation spectral domain. The magnitude part of the modulation spectrum is decomposed into non-uniform sub-band segments, and then each sub-band segment is individually processed by the well-known normalization methods, like mean normalization (MN), mean and variance normalization (MVN) and histogram equalization (HEQ). Finally, we reconstruct the feature stream with all the modified sub-band magnitude spectral segments and the original phase spectrum using the inverse DFT. With this process, the components that correspond to more important modulation spectral bands in the feature sequence can be processed separately. For the Aurora-2 clean-condition training task, the new proposed sub-band spectral MVN and HEQ provide relative error rate reductions of 18.66% and 23.58% over the conventional temporal MVN and HEQ, respectively.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Diagonal priors for full covariance speech recognition 用于全协方差语音识别的对角线先验
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373344
P. Bell, Simon King
We investigate the use of full covariance Gaussians for large-vocabulary speech recognition. The large number of parameters gives high modelling power, but when training data is limited, the standard sample covariance matrix is often poorly conditioned, and has high variance. We explain how these problems may be solved by the use of a diagonal covariance smoothing prior, and relate this to the shrinkage estimator, for which the optimal shrinkage parameter may itself be estimated from the training data. We also compare the use of generatively and discriminatively trained priors. Results are presented on a large vocabulary conversational telephone speech recognition task.
我们研究了全协方差高斯在大词汇量语音识别中的应用。大量参数带来了很强的建模能力,但当训练数据有限时,标准样本协方差矩阵往往条件较差,方差较大。我们解释了如何通过使用对角协方差平滑先验来解决这些问题,并将其与收缩估计器联系起来,收缩估计器的最优收缩参数本身可以从训练数据中估计出来。我们还比较了生成训练先验和判别训练先验的使用情况。我们展示了大词汇量会话电话语音识别任务的结果。
{"title":"Diagonal priors for full covariance speech recognition","authors":"P. Bell, Simon King","doi":"10.1109/ASRU.2009.5373344","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373344","url":null,"abstract":"We investigate the use of full covariance Gaussians for large-vocabulary speech recognition. The large number of parameters gives high modelling power, but when training data is limited, the standard sample covariance matrix is often poorly conditioned, and has high variance. We explain how these problems may be solved by the use of a diagonal covariance smoothing prior, and relate this to the shrinkage estimator, for which the optimal shrinkage parameter may itself be estimated from the training data. We also compare the use of generatively and discriminatively trained priors. Results are presented on a large vocabulary conversational telephone speech recognition task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131269495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Optimal quantization and bit allocation for compressing large discriminative feature space transforms 压缩大型判别特征空间变换的最优量化和位分配
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373407
E. Marcheret, V. Goel, P. Olsen
Discriminative training of the feature space using the minimum phone error (MPE) objective function has been shown to yield remarkable accuracy improvements. These gains, however, come at a high cost of memory required to store the transform. In a previous paper we reduced this memory requirement by 94% by quantizing the transform parameters. We used dimension dependent quantization tables and learned the quantization values with a fixed assignment of transform parameters to quantization values. In this paper we refine and extend the techniques to attain a further 35% reduction in memory with no degradation in sentence error rate. We discuss a principled method to assign the transform parameters to quantization values. We also show how the memory can be gradually reduced using a Viterbi algorithm to optimally assign variable number of bits to dimension dependent quantization tables. The techniques described could also be applied to the quantization of general linear transforms - a problem that should be of wider interest.
利用最小电话误差(MPE)目标函数对特征空间进行判别训练可以显著提高准确率。然而,这些增益是以存储转换所需的高内存成本为代价的。在之前的一篇论文中,我们通过量化变换参数减少了94%的内存需求。我们使用维度相关的量化表,并通过固定的转换参数分配量化值来学习量化值。在本文中,我们改进和扩展了这些技术,在不降低句子错误率的情况下,进一步减少了35%的记忆。讨论了一种将变换参数赋给量化值的原则性方法。我们还展示了如何使用Viterbi算法逐步减少内存,以最优地将可变数量的比特分配给维度相关的量化表。所描述的技术也可以应用于一般线性变换的量化——一个应该引起更广泛兴趣的问题。
{"title":"Optimal quantization and bit allocation for compressing large discriminative feature space transforms","authors":"E. Marcheret, V. Goel, P. Olsen","doi":"10.1109/ASRU.2009.5373407","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373407","url":null,"abstract":"Discriminative training of the feature space using the minimum phone error (MPE) objective function has been shown to yield remarkable accuracy improvements. These gains, however, come at a high cost of memory required to store the transform. In a previous paper we reduced this memory requirement by 94% by quantizing the transform parameters. We used dimension dependent quantization tables and learned the quantization values with a fixed assignment of transform parameters to quantization values. In this paper we refine and extend the techniques to attain a further 35% reduction in memory with no degradation in sentence error rate. We discuss a principled method to assign the transform parameters to quantization values. We also show how the memory can be gradually reduced using a Viterbi algorithm to optimally assign variable number of bits to dimension dependent quantization tables. The techniques described could also be applied to the quantization of general linear transforms - a problem that should be of wider interest.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114446903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1