首页 > 最新文献

2009 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
SNR features for automatic speech recognition SNR功能自动语音识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372895
Philip N. Garner
When combined with cepstral normalisation techniques, the features normally used in Automatic Speech Recognition are based on Signal to Noise Ratio (SNR). We show that calculating SNR from the outset, rather than relying on cepstral normalisation to produce it, gives features with a number of practical and mathematical advantages over power-spectral based ones. In a detailed analysis, we derive Maximum Likelihood and Maximum a-Posteriori estimates for SNR based features, and show that they can outperform more conventional ones, especially when subsequently combined with cepstral variance normalisation. We further show anecdotal evidence that SNR based features lend themselves well to noise estimates based on low-energy envelope tracking.
当与倒谱归一化技术相结合时,通常用于自动语音识别的特征是基于信噪比(SNR)的。我们表明,从一开始就计算信噪比,而不是依靠倒谱归一化来产生信噪比,与基于功率谱的特征相比,它具有许多实用和数学上的优势。在详细的分析中,我们推导了基于信噪比特征的最大似然和最大后验估计,并表明它们可以优于更传统的估计,特别是当随后与倒谱方差归一化相结合时。我们进一步展示了轶事证据,基于信噪比的特征可以很好地用于基于低能量包络跟踪的噪声估计。
{"title":"SNR features for automatic speech recognition","authors":"Philip N. Garner","doi":"10.1109/ASRU.2009.5372895","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372895","url":null,"abstract":"When combined with cepstral normalisation techniques, the features normally used in Automatic Speech Recognition are based on Signal to Noise Ratio (SNR). We show that calculating SNR from the outset, rather than relying on cepstral normalisation to produce it, gives features with a number of practical and mathematical advantages over power-spectral based ones. In a detailed analysis, we derive Maximum Likelihood and Maximum a-Posteriori estimates for SNR based features, and show that they can outperform more conventional ones, especially when subsequently combined with cepstral variance normalisation. We further show anecdotal evidence that SNR based features lend themselves well to noise estimates based on low-energy envelope tracking.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116224084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Any questions? Automatic question detection in meetings 有什么问题吗?会议中的自动问题检测
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373293
K. Boakye, Benoit Favre, Dilek Z. Hakkani-Tür
In this paper, we describe our efforts toward the automatic detection of English questions in meetings. We analyze the utility of various features for this task, originating from three distinct classes: lexico-syntactic, turn-related, and pitch-related. Of particular interest is the use of parse tree information in classification, an approach as yet unexplored. Results from experiments on the ICSI MRDA corpus demonstrate that lexico-syntactic features are most useful for this task, with turn-and pitch-related features providing complementary information in combination. In addition, experiments using reference parse trees on the broadcast conversation portion of the OntoNotes release 2.9 data set illustrate the potential of parse trees to outperform word lexical features.
在本文中,我们描述了我们在会议英语问题自动检测方面所做的努力。我们分析了这项任务的各种功能的效用,这些功能来自三个不同的类别:词汇语法、转折相关和音高相关。特别有趣的是在分类中使用解析树信息,这是一种尚未探索的方法。在ICSI MRDA语料库上的实验结果表明,词汇-句法特征对该任务最有用,与转折和音高相关的特征组合提供了互补的信息。此外,在OntoNotes release 2.9数据集的广播会话部分上使用参考解析树的实验说明了解析树优于单词词汇特征的潜力。
{"title":"Any questions? Automatic question detection in meetings","authors":"K. Boakye, Benoit Favre, Dilek Z. Hakkani-Tür","doi":"10.1109/ASRU.2009.5373293","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373293","url":null,"abstract":"In this paper, we describe our efforts toward the automatic detection of English questions in meetings. We analyze the utility of various features for this task, originating from three distinct classes: lexico-syntactic, turn-related, and pitch-related. Of particular interest is the use of parse tree information in classification, an approach as yet unexplored. Results from experiments on the ICSI MRDA corpus demonstrate that lexico-syntactic features are most useful for this task, with turn-and pitch-related features providing complementary information in combination. In addition, experiments using reference parse trees on the broadcast conversation portion of the OntoNotes release 2.9 data set illustrate the potential of parse trees to outperform word lexical features.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124805628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Query-by-example spoken term detection using phonetic posteriorgram templates 使用语音后图模板进行按例查询的口语术语检测
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372889
Timothy J. Hazen, Wade Shen, Christopher M. White
This paper examines a query-by-example approach to spoken term detection in audio files. The approach is designed for low-resource situations in which limited or no in-domain training material is available and accurate word-based speech recognition capability is unavailable. Instead of using word or phone strings as search terms, the user presents the system with audio snippets of desired search terms to act as the queries. Query and test materials are represented using phonetic posteriorgrams obtained from a phonetic recognition system. Query matches in the test data are located using a modified dynamic time warping search between query templates and test utterances. Experiments using this approach are presented using data from the Fisher corpus.
本文研究了一种基于实例的查询方法来检测音频文件中的口语术语。该方法是为资源匮乏的情况而设计的,在这种情况下,可用的域内培训材料有限或没有,并且无法获得准确的基于单词的语音识别能力。用户不使用单词或电话字符串作为搜索词,而是向系统提供所需搜索词的音频片段作为查询。使用从语音识别系统获得的语音后图来表示查询和测试材料。在查询模板和测试话语之间使用修改的动态时间扭曲搜索来定位测试数据中的查询匹配。本文用Fisher语料库中的数据进行了实验。
{"title":"Query-by-example spoken term detection using phonetic posteriorgram templates","authors":"Timothy J. Hazen, Wade Shen, Christopher M. White","doi":"10.1109/ASRU.2009.5372889","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372889","url":null,"abstract":"This paper examines a query-by-example approach to spoken term detection in audio files. The approach is designed for low-resource situations in which limited or no in-domain training material is available and accurate word-based speech recognition capability is unavailable. Instead of using word or phone strings as search terms, the user presents the system with audio snippets of desired search terms to act as the queries. Query and test materials are represented using phonetic posteriorgrams obtained from a phonetic recognition system. Query matches in the test data are located using a modified dynamic time warping search between query templates and test utterances. Experiments using this approach are presented using data from the Fisher corpus.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126969833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 300
Improving online incremental speaker adaptation with eigen feature space MLLR 基于特征空间MLLR改进在线增量说话人自适应
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373227
Xiaodong Cui, Jian Xue, Bowen Zhou
This paper investigates an eigen feature space maximum likelihood linear regression (fMLLR) scheme to improve the performance of online speaker adaptation in automatic speech recognition systems. In this stochastic-approximation-like framework, the traditional incremental fMLLR estimation is considered as a slowly changing mean of the eigen fMLLR. It helps the adaptation when only a limited amount of data is available at the beginning of the conversation. The scheme is shown to be able to balance the transformation estimation given the data and yields reasonable improvements for online systems.
为了提高自动语音识别系统中在线说话人自适应的性能,研究了一种特征空间最大似然线性回归(fMLLR)方案。在这种类似随机逼近的框架中,传统的增量fMLLR估计被认为是特征fMLLR的一个缓慢变化的平均值。当在对话开始时只有有限的数据可用时,它有助于适应。该方案能够平衡给定数据的转换估计,并对在线系统产生合理的改进。
{"title":"Improving online incremental speaker adaptation with eigen feature space MLLR","authors":"Xiaodong Cui, Jian Xue, Bowen Zhou","doi":"10.1109/ASRU.2009.5373227","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373227","url":null,"abstract":"This paper investigates an eigen feature space maximum likelihood linear regression (fMLLR) scheme to improve the performance of online speaker adaptation in automatic speech recognition systems. In this stochastic-approximation-like framework, the traditional incremental fMLLR estimation is considered as a slowly changing mean of the eigen fMLLR. It helps the adaptation when only a limited amount of data is available at the beginning of the conversation. The scheme is shown to be able to balance the transformation estimation given the data and yields reasonable improvements for online systems.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121669364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Vietnamese large vocabulary continuous speech recognition 越南语大词汇量连续语音识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373424
Ngoc Thang Vu, Tanja Schultz
We report on our recent efforts toward a large vocabulary Vietnamese speech recognition system. In particular, we describe the Vietnamese text and speech database recently collected as part of our GlobalPhone corpus. The data was complemented by a large collection of text data crawled from various Vietnamese websites. To bootstrap the Vietnamese speech recognition system we used our Rapid Language Adaptation scheme applying a multilingual phone inventory. After initialization we investigated the peculiarities of the Vietnamese language and achieved significant improvements by implementing different tone modeling schemes, extended by pitch extraction, handling multiwords to address the monosyllable structure of Vietnamese, and featuring language modeling based on 5-grams. Furthermore, we addressed the issue of dialectal variations between South and North Vietnam by creating dialect dependent pronunciations and including dialect in the context decision tree of the recognizer. Our currently best recognition system achieves a word error rate of 11.7% on read newspaper speech.
我们报告了我们最近对一个大词汇越南语语音识别系统的努力。特别是,我们描述了越南语文本和语音数据库最近收集作为我们的全球电话语料库的一部分。这些数据还得到了从越南各个网站抓取的大量文本数据的补充。为了引导越南语语音识别系统,我们使用了快速语言适应方案,应用了多语言电话清单。在初始化之后,我们研究了越南语的特点,并通过实施不同的音调建模方案,通过音高提取扩展,处理多词来解决越南语的单音节结构,以及基于5-g的语言建模,取得了显著的改进。此外,我们通过创建方言依赖的发音,并将方言纳入识别器的上下文决策树,解决了南越和北越之间方言差异的问题。我们目前最好的识别系统对读报语音的错误率为11.7%。
{"title":"Vietnamese large vocabulary continuous speech recognition","authors":"Ngoc Thang Vu, Tanja Schultz","doi":"10.1109/ASRU.2009.5373424","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373424","url":null,"abstract":"We report on our recent efforts toward a large vocabulary Vietnamese speech recognition system. In particular, we describe the Vietnamese text and speech database recently collected as part of our GlobalPhone corpus. The data was complemented by a large collection of text data crawled from various Vietnamese websites. To bootstrap the Vietnamese speech recognition system we used our Rapid Language Adaptation scheme applying a multilingual phone inventory. After initialization we investigated the peculiarities of the Vietnamese language and achieved significant improvements by implementing different tone modeling schemes, extended by pitch extraction, handling multiwords to address the monosyllable structure of Vietnamese, and featuring language modeling based on 5-grams. Furthermore, we addressed the issue of dialectal variations between South and North Vietnam by creating dialect dependent pronunciations and including dialect in the context decision tree of the recognizer. Our currently best recognition system achieves a word error rate of 11.7% on read newspaper speech.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115952057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
Comparing automatic rich transcription for Portuguese, Spanish and English Broadcast News 比较自动丰富转录葡萄牙语,西班牙语和英语广播新闻
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373371
Fernando Batista, I. Trancoso, N. Mamede
This paper describes and evaluates a language independent approach for automatically enriching the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach, which achieves results within state-of-the-art. The language independence of the approach is attested with experiments conducted on Portuguese, Spanish and English Broadcast News corpora. This paper provides the first comparative study between the three languages, concerning these tasks.
本文描述并评价了一种独立于语言的方法,用于自动丰富语音识别输出的标点符号和大写信息。这两个任务被视为两个分类问题,使用最大熵建模方法,达到最先进的结果。通过对葡萄牙语、西班牙语和英语广播新闻语料库的实验,证明了该方法的语言独立性。本文首次对这三种语言进行了比较研究。
{"title":"Comparing automatic rich transcription for Portuguese, Spanish and English Broadcast News","authors":"Fernando Batista, I. Trancoso, N. Mamede","doi":"10.1109/ASRU.2009.5373371","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373371","url":null,"abstract":"This paper describes and evaluates a language independent approach for automatically enriching the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach, which achieves results within state-of-the-art. The language independence of the approach is attested with experiments conducted on Portuguese, Spanish and English Broadcast News corpora. This paper provides the first comparative study between the three languages, concerning these tasks.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A study on Hidden Structural Model and its application to labeling sequences 隐结构模型及其在序列标记中的应用研究
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373239
Y. Qiao, Masayuki Suzuki, N. Minematsu
This paper proposes Hidden Structure Model (HSM) for statistical modeling of sequence data. The HSM generalizes our previous proposal on structural representation by introducing hidden states and probabilistic models. Compared with the previous structural representation, HSM not only can solve the problem of misalignment of events, but also can conduct structure-based decoding, which allows us to apply HSM to general speech recognition tasks. Different from HMM, HSM accounts for the probability of both locally absolute and globally contrastive features. This paper focuses on the fundamental formulation and theories of HSM. We also develop methods for the problems of state inference, probability calculation and parameter estimation of HSM. Especially, we show that the state inference of HSM can be reduced to a quadratic programming problem. We carry out two experiments to examine the performance of HSM on labeling sequences. The first experiment tests HSM by using artificially transformed sequences, and the second experiment is based on a Japanese corpus of connected vowel utterances. The experimental results demonstrate the effectiveness of HSM.
提出了隐结构模型(HSM)用于序列数据的统计建模。HSM通过引入隐藏状态和概率模型,推广了我们之前关于结构表示的建议。与之前的结构化表示相比,HSM不仅可以解决事件的不对齐问题,而且可以进行基于结构的解码,这使得我们可以将HSM应用于一般的语音识别任务。与隐马尔可夫模型不同,HSM模型既考虑了局部绝对特征的概率,也考虑了全局对比特征的概率。本文重点介绍了高速切削的基本公式和理论。研究了高速切削的状态推理、概率计算和参数估计问题。特别地,我们证明了高速切削的状态推理可以简化为一个二次规划问题。我们进行了两个实验来检验HSM在标记序列上的性能。第一个实验通过人工转换序列来测试HSM,第二个实验基于日语元音连接语料库。实验结果证明了该方法的有效性。
{"title":"A study on Hidden Structural Model and its application to labeling sequences","authors":"Y. Qiao, Masayuki Suzuki, N. Minematsu","doi":"10.1109/ASRU.2009.5373239","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373239","url":null,"abstract":"This paper proposes Hidden Structure Model (HSM) for statistical modeling of sequence data. The HSM generalizes our previous proposal on structural representation by introducing hidden states and probabilistic models. Compared with the previous structural representation, HSM not only can solve the problem of misalignment of events, but also can conduct structure-based decoding, which allows us to apply HSM to general speech recognition tasks. Different from HMM, HSM accounts for the probability of both locally absolute and globally contrastive features. This paper focuses on the fundamental formulation and theories of HSM. We also develop methods for the problems of state inference, probability calculation and parameter estimation of HSM. Especially, we show that the state inference of HSM can be reduced to a quadratic programming problem. We carry out two experiments to examine the performance of HSM on labeling sequences. The first experiment tests HSM by using artificially transformed sequences, and the second experiment is based on a Japanese corpus of connected vowel utterances. The experimental results demonstrate the effectiveness of HSM.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131522837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams 基于高斯后验图的分段DTW的无监督语音关键字识别
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372931
Yaodong Zhang, James R. Glass
In this paper, we present an unsupervised learning framework to address the problem of detecting spoken keywords. Without any transcription information, a Gaussian Mixture Model is trained to label speech frames with a Gaussian posteriorgram. Given one or more spoken examples of a keyword, we use segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances. The keyword detection result is then obtained by ranking the distortion scores of all the test utterances. We examine the TIMIT corpus as a development set to tune the parameters in our system, and the MIT Lecture corpus for more substantial evaluation. The results demonstrate the viability and effectiveness of our unsupervised learning framework on the keyword spotting task.
在本文中,我们提出了一个无监督学习框架来解决语音关键字的检测问题。在没有任何转录信息的情况下,训练高斯混合模型用高斯后图标记语音帧。给定一个或多个关键字的口语示例,我们使用分段动态时间扭曲来比较关键字样本和测试话语之间的高斯后验图。然后对所有测试话语的失真分数进行排序,得到关键字检测结果。我们将TIMIT语料库作为一个开发集来调整我们系统中的参数,并将MIT Lecture语料库用于更实质性的评估。结果证明了我们的无监督学习框架在关键词识别任务上的可行性和有效性。
{"title":"Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams","authors":"Yaodong Zhang, James R. Glass","doi":"10.1109/ASRU.2009.5372931","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372931","url":null,"abstract":"In this paper, we present an unsupervised learning framework to address the problem of detecting spoken keywords. Without any transcription information, a Gaussian Mixture Model is trained to label speech frames with a Gaussian posteriorgram. Given one or more spoken examples of a keyword, we use segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances. The keyword detection result is then obtained by ranking the distortion scores of all the test utterances. We examine the TIMIT corpus as a development set to tune the parameters in our system, and the MIT Lecture corpus for more substantial evaluation. The results demonstrate the viability and effectiveness of our unsupervised learning framework on the keyword spotting task.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131686081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 363
An improved perceptual speech enhancement technique employing a psychoacoustically motivated weighting factor 一种采用心理声学激励加权因子的改进的感知语音增强技术
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5372883
Md. Jahangir Alam, S. Selouani, D. O'Shaughnessy
Suppression of speech components after perceptual speech enhancement (SE) lowers the noise masking threshold (NMT) level of the enhanced signal. This may re-introduce noise components that are initially masked but not processed by the denoising filter, thereby, favoring the emergence of musical noise. This paper presents a modified perceptual speech enhancement algorithm based on a perceptually motivated weighting factor to effectively suppress the background noise without introducing much distortion in the enhanced signal using the perceptual speech enhancement methods. The performance of the proposed enhancement algorithm is evaluated by the Segmental SNR and Perceptual Evaluation of Speech Quality (PESQ) measures under various noisy environments and yields better results compared to the perceptual speech enhancement methods.
感知语音增强(SE)后对语音成分的抑制降低了增强信号的噪声掩蔽阈值(NMT)水平。这可能会重新引入最初被掩盖但未被去噪滤波器处理的噪声成分,从而有利于音乐噪声的出现。本文提出了一种改进的基于感知动机加权因子的感知语音增强算法,该算法可以有效地抑制背景噪声,而不会在使用感知语音增强方法的增强信号中引入太多失真。在不同噪声环境下,采用分段信噪比和语音质量感知评价(PESQ)指标对所提增强算法的性能进行了评价,结果优于感知语音增强方法。
{"title":"An improved perceptual speech enhancement technique employing a psychoacoustically motivated weighting factor","authors":"Md. Jahangir Alam, S. Selouani, D. O'Shaughnessy","doi":"10.1109/ASRU.2009.5372883","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372883","url":null,"abstract":"Suppression of speech components after perceptual speech enhancement (SE) lowers the noise masking threshold (NMT) level of the enhanced signal. This may re-introduce noise components that are initially masked but not processed by the denoising filter, thereby, favoring the emergence of musical noise. This paper presents a modified perceptual speech enhancement algorithm based on a perceptually motivated weighting factor to effectively suppress the background noise without introducing much distortion in the enhanced signal using the perceptual speech enhancement methods. The performance of the proposed enhancement algorithm is evaluated by the Segmental SNR and Perceptual Evaluation of Speech Quality (PESQ) measures under various noisy environments and yields better results compared to the perceptual speech enhancement methods.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128768910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving joint uncertainty decoding performance by predictive methods for noise robust speech recognition 基于预测方法的噪声鲁棒语音识别联合不确定性解码性能改进
Pub Date : 2009-12-01 DOI: 10.1109/ASRU.2009.5373317
H. Xu, M. Gales, K. K. Chin
Model-based noise compensation techniques, such as Vector Taylor Series (VTS) compensation, have been applied to a range of noise robustness tasks. However one of the issues with these forms of approach is that for large speech recognition systems they are computationally expensive. To address this problem schemes such as Joint Uncertainty Decoding (JUD) have been proposed. Though computationally more efficient, the performance of the system is typically degraded. This paper proposes an alternative scheme, related to JUD, but making fewer approximations, VTS-JUD. Unfortunately this approach also removes some of the computational advantages of JUD. To address this, rather than using VTS-JUD directly, it is used instead to obtain statistics to estimate a predictive linear transform, PCMLLR. This is both computationally efficient and limits some of the issues associated with the diagonal covariance matrices typically used with schemes such as VTS. PCMLLR can also be simply used within an adaptive training framework (PAT). The performance of the VTS-JUD, PCMLLR and PAT system were compared to a number of standard approaches on an in-car speech recognition task. The proposed scheme is an attractive alternative to existing approaches.
基于模型的噪声补偿技术,如矢量泰勒级数(VTS)补偿,已经应用于一系列噪声鲁棒性任务。然而,这些方法的一个问题是,对于大型语音识别系统来说,它们的计算成本很高。为了解决这一问题,人们提出了联合不确定性解码(JUD)等方案。虽然计算效率更高,但系统的性能通常会下降。本文提出了一种替代方案,与JUD相关,但近似较少,即VTS-JUD。不幸的是,这种方法也消除了JUD的一些计算优势。为了解决这个问题,不是直接使用VTS-JUD,而是使用它来获取统计数据来估计预测线性变换PCMLLR。这既提高了计算效率,又限制了与对角协方差矩阵(通常用于VTS等方案)相关的一些问题。PCMLLR也可以简单地在自适应训练框架(PAT)中使用。在车载语音识别任务中,将VTS-JUD、PCMLLR和PAT系统的性能与几种标准方法进行了比较。拟议的方案是现有方法的一个有吸引力的替代方案。
{"title":"Improving joint uncertainty decoding performance by predictive methods for noise robust speech recognition","authors":"H. Xu, M. Gales, K. K. Chin","doi":"10.1109/ASRU.2009.5373317","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373317","url":null,"abstract":"Model-based noise compensation techniques, such as Vector Taylor Series (VTS) compensation, have been applied to a range of noise robustness tasks. However one of the issues with these forms of approach is that for large speech recognition systems they are computationally expensive. To address this problem schemes such as Joint Uncertainty Decoding (JUD) have been proposed. Though computationally more efficient, the performance of the system is typically degraded. This paper proposes an alternative scheme, related to JUD, but making fewer approximations, VTS-JUD. Unfortunately this approach also removes some of the computational advantages of JUD. To address this, rather than using VTS-JUD directly, it is used instead to obtain statistics to estimate a predictive linear transform, PCMLLR. This is both computationally efficient and limits some of the issues associated with the diagonal covariance matrices typically used with schemes such as VTS. PCMLLR can also be simply used within an adaptive training framework (PAT). The performance of the VTS-JUD, PCMLLR and PAT system were compared to a number of standard approaches on an in-car speech recognition task. The proposed scheme is an attractive alternative to existing approaches.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116752319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2009 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1