首页 > 最新文献

2011 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Matched-condition robust Dynamic Noise Adaptation 匹配条件鲁棒动态噪声自适应
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163919
Steven J. Rennie, Pierre L. Dognin, P. Fousek
In this paper we describe how the model-based noise robustness algorithm for previously unseen noise conditions, Dynamic Noise Adaptation (DNA), can be made robust to matched data, without the need to do any system re-training. The approach is to do online model selection and averaging between two DNA models of noise: one that is tracking the evolving state of the background noise, and one clamped to the null mis-match hypothesis. The approach, which we call DNA with (matched) condition detection (DNA-CD), improves the performance of a commerical-grade speech recognizer that utilizes feature-space Maximum Mutual Information (fMMI), boosted MMI (bMMI), and feature-space Maximum Likelihood Linear Regression (fMLLR) compensation by 15% relative at signal-to-noise ratios (SNRs) below 10 dB, and over 8% relative overall.
在本文中,我们描述了基于模型的噪声鲁棒性算法,动态噪声适应(DNA),如何在不需要进行任何系统重新训练的情况下对匹配数据具有鲁棒性。该方法是在两个DNA噪声模型之间进行在线模型选择和平均:一个跟踪背景噪声的演变状态,另一个被限制在零不匹配假设。该方法,我们称之为DNA(匹配)条件检测(DNA- cd),提高了商业级语音识别器的性能,该识别器利用特征空间最大互信息(fMMI)、增强MMI (bMMI)和特征空间最大似然线性回归(fMLLR)补偿,在信噪比(SNRs)低于10 dB时相对提高15%,相对总体提高8%。
{"title":"Matched-condition robust Dynamic Noise Adaptation","authors":"Steven J. Rennie, Pierre L. Dognin, P. Fousek","doi":"10.1109/ASRU.2011.6163919","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163919","url":null,"abstract":"In this paper we describe how the model-based noise robustness algorithm for previously unseen noise conditions, Dynamic Noise Adaptation (DNA), can be made robust to matched data, without the need to do any system re-training. The approach is to do online model selection and averaging between two DNA models of noise: one that is tracking the evolving state of the background noise, and one clamped to the null mis-match hypothesis. The approach, which we call DNA with (matched) condition detection (DNA-CD), improves the performance of a commerical-grade speech recognizer that utilizes feature-space Maximum Mutual Information (fMMI), boosted MMI (bMMI), and feature-space Maximum Likelihood Linear Regression (fMLLR) compensation by 15% relative at signal-to-noise ratios (SNRs) below 10 dB, and over 8% relative overall.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125093735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Efficient spoken term discovery using randomized algorithms 使用随机化算法的高效口语术语发现
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163965
A. Jansen, Benjamin Van Durme
Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warping-based searches across the entire similarity matrix, a method whose scalability is ultimately limited by the O(n2) nature of the search space. Recent strategies have attempted to improve search efficiency by using either unsupervised or mismatched-language acoustic models to reduce the complexity of the feature representation. Taking a completely different approach, this paper investigates the use of randomized algorithms that operate directly on the raw acoustic features to produce sparse approximate similarity matrices in O(n) space and O(n log n) time. We demonstrate these techniques facilitate spoken term discovery performance capable of outperforming a model-based strategy in the zero resource setting.
口语术语发现是通过搜索长时间重复的声学模式来自动识别语音数据中的单词和短语的任务。最初的解决方案依赖于在整个相似矩阵中基于时间扭曲的穷极动态搜索,这种方法的可伸缩性最终受到搜索空间的O(n2)性质的限制。最近的策略试图通过使用无监督或不匹配语言声学模型来降低特征表示的复杂性来提高搜索效率。采用完全不同的方法,本文研究了使用随机化算法直接对原始声学特征进行操作,在O(n)空间和O(n log n)时间内生成稀疏近似相似矩阵。我们演示了这些技术促进口语术语发现性能,能够在零资源设置中优于基于模型的策略。
{"title":"Efficient spoken term discovery using randomized algorithms","authors":"A. Jansen, Benjamin Van Durme","doi":"10.1109/ASRU.2011.6163965","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163965","url":null,"abstract":"Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warping-based searches across the entire similarity matrix, a method whose scalability is ultimately limited by the O(n2) nature of the search space. Recent strategies have attempted to improve search efficiency by using either unsupervised or mismatched-language acoustic models to reduce the complexity of the feature representation. Taking a completely different approach, this paper investigates the use of randomized algorithms that operate directly on the raw acoustic features to produce sparse approximate similarity matrices in O(n) space and O(n log n) time. We demonstrate these techniques facilitate spoken term discovery performance capable of outperforming a model-based strategy in the zero resource setting.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126694581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 165
Improved spoken term detection using support vector machines with acoustic and context features from pseudo-relevance feedback 利用基于伪相关反馈的声学和上下文特征的支持向量机改进口语术语检测
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163962
Tsung-wei Tu, Hung-yi Lee, Lin-Shan Lee
This paper reports a new approach to improving spoken term detection that uses support vector machine (SVM) with acoustic and linguistic features. As SVM is a good technique for discriminating different features in vector space, we recently proposed to use pseudo-relevance feedback to automatically generate training data for SVM training and use SVM to re-rank the first-pass results considering the context consistency in the lattices. In this paper, we further extend this concept by considering acoustic features at word, phone and HMM state levels and linguistic features of different order. Extensive experiments under various recognition environments demonstrate significant improvements in all cases. In particular, the acoustic features at the HMM state level offered the most significant improvements, and the improvements achieved by acoustic and linguistic features are shown to be additive.
本文报道了一种基于声学和语言特征的支持向量机(SVM)的语音术语检测方法。由于支持向量机是一种很好的识别向量空间中不同特征的技术,我们最近提出了使用伪相关反馈自动生成支持向量机训练的训练数据,并利用支持向量机在考虑格内上下文一致性的情况下对第一次通过的结果进行重新排序。在本文中,我们进一步扩展了这一概念,考虑了单词、电话和HMM状态级别的声学特征以及不同顺序的语言特征。在各种识别环境下的大量实验表明,在所有情况下都有显著的改进。特别是,HMM状态水平的声学特征提供了最显著的改进,声学和语言特征所取得的改进被证明是加性的。
{"title":"Improved spoken term detection using support vector machines with acoustic and context features from pseudo-relevance feedback","authors":"Tsung-wei Tu, Hung-yi Lee, Lin-Shan Lee","doi":"10.1109/ASRU.2011.6163962","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163962","url":null,"abstract":"This paper reports a new approach to improving spoken term detection that uses support vector machine (SVM) with acoustic and linguistic features. As SVM is a good technique for discriminating different features in vector space, we recently proposed to use pseudo-relevance feedback to automatically generate training data for SVM training and use SVM to re-rank the first-pass results considering the context consistency in the lattices. In this paper, we further extend this concept by considering acoustic features at word, phone and HMM state levels and linguistic features of different order. Extensive experiments under various recognition environments demonstrate significant improvements in all cases. In particular, the acoustic features at the HMM state level offered the most significant improvements, and the improvements achieved by acoustic and linguistic features are shown to be additive.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123331120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Applying feature bagging for more accurate and robust automated speaking assessment 应用特征套袋更准确和强大的自动说话评估
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163977
L. Chen
The scoring model used in automated speaking assessment systems is critical for achieving accurate and robust scoring of speaking skills automatically. In the automated speaking assessment research field, using a single classifier model is still a dominant approach. However, ensemble learning, which relies on a committee of classifiers to predict jointly (to overcome each individual classifier's weakness) has been actively advocated by the machine learning researchers and widely used in many machine learning tasks. In this paper, we investigated applying a special ensemble learning method, feature-bagging, on the task of automatically scoring non-native spontaneous speech. Our experiments show that this method is superior to the method of using a single classifier in terms of scoring accuracy and the robustness to cope with possible feature variations.
在自动口语评估系统中使用的评分模型对于实现准确和可靠的口语技能自动评分至关重要。在自动语音评估研究领域,使用单一分类器模型仍是主流方法。然而,集成学习依赖于一个分类器委员会共同预测(克服每个分类器的弱点),已经被机器学习研究者积极提倡,并广泛应用于许多机器学习任务中。在本文中,我们研究了一种特殊的集成学习方法——特征bagging在非母语自发语音自动评分任务中的应用。我们的实验表明,该方法在评分精度和鲁棒性方面优于使用单个分类器的方法,以应对可能的特征变化。
{"title":"Applying feature bagging for more accurate and robust automated speaking assessment","authors":"L. Chen","doi":"10.1109/ASRU.2011.6163977","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163977","url":null,"abstract":"The scoring model used in automated speaking assessment systems is critical for achieving accurate and robust scoring of speaking skills automatically. In the automated speaking assessment research field, using a single classifier model is still a dominant approach. However, ensemble learning, which relies on a committee of classifiers to predict jointly (to overcome each individual classifier's weakness) has been actively advocated by the machine learning researchers and widely used in many machine learning tasks. In this paper, we investigated applying a special ensemble learning method, feature-bagging, on the task of automatically scoring non-native spontaneous speech. Our experiments show that this method is superior to the method of using a single classifier in terms of scoring accuracy and the robustness to cope with possible feature variations.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"121 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114306102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A factored conditional random field model for articulatory feature forced transcription 发音特征强制转录的因子条件随机场模型
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163909
Rohit Prabhavalkar, E. Fosler-Lussier, Karen Livescu
We investigate joint models of articulatory features and apply these models to the problem of automatically generating articulatory transcriptions of spoken utterances given their word transcriptions. The task is motivated by the need for larger amounts of labeled articulatory data for both speech recognition and linguistics research, which is costly and difficult to obtain through manual transcription or physical measurement. Unlike phonetic transcription, in our task it is important to account for the fact that the articulatory features can desynchronize. We consider factored models of the articulatory state space with an explicit model of articulator asynchrony. We compare two types of graphical models: a dynamic Bayesian network (DBN), based on previously proposed models; and a conditional random field (CRF), which we develop here. We demonstrate how task-specific constraints can be leveraged to allow for efficient exact inference in the CRF. On the transcription task, the CRF outperforms the DBN, with relative improvements of 2.2% to 10.0%.
我们研究了发音特征的联合模型,并将这些模型应用于给定单词转录的口语话语自动生成发音转录的问题。这项任务的动机是语音识别和语言学研究需要大量标记的发音数据,这些数据既昂贵又难以通过人工转录或物理测量获得。与语音转录不同,在我们的任务中,重要的是要考虑到发音特征可能不同步的事实。我们考虑了具有显式发音器异步模型的发音状态空间的因子模型。我们比较了两种类型的图形模型:基于先前提出的模型的动态贝叶斯网络(DBN);条件随机场(CRF),我们在这里开发的。我们演示了如何利用特定于任务的约束来允许在CRF中进行有效的精确推理。在转录任务上,CRF优于DBN,相对提高2.2%至10.0%。
{"title":"A factored conditional random field model for articulatory feature forced transcription","authors":"Rohit Prabhavalkar, E. Fosler-Lussier, Karen Livescu","doi":"10.1109/ASRU.2011.6163909","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163909","url":null,"abstract":"We investigate joint models of articulatory features and apply these models to the problem of automatically generating articulatory transcriptions of spoken utterances given their word transcriptions. The task is motivated by the need for larger amounts of labeled articulatory data for both speech recognition and linguistics research, which is costly and difficult to obtain through manual transcription or physical measurement. Unlike phonetic transcription, in our task it is important to account for the fact that the articulatory features can desynchronize. We consider factored models of the articulatory state space with an explicit model of articulator asynchrony. We compare two types of graphical models: a dynamic Bayesian network (DBN), based on previously proposed models; and a conditional random field (CRF), which we develop here. We demonstrate how task-specific constraints can be leveraged to allow for efficient exact inference in the CRF. On the transcription task, the CRF outperforms the DBN, with relative improvements of 2.2% to 10.0%.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124127514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Exploiting distance based similarity in topic models for user intent detection 利用主题模型中基于距离的相似度进行用户意图检测
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163969
Asli Celikyilmaz, Dilek Z. Hakkani-Tür, Gökhan Tür, Ashley Fidler, D. Hillard
One of the main components of spoken language understanding is intent detection, which allows user goals to be identified. A challenging sub-task of intent detection is the identification of intent bearing phrases from a limited amount of training data, while maintaining the ability to generalize well. We present a new probabilistic topic model for jointly identifying semantic intents and common phrases in spoken language utterances. Our model jointly learns a set of intent dependent phrases and captures semantic intent clusters as distributions over these phrases based on a distance dependent sampling method. This sampling method uses proximity of words utterances when assigning words to latent topics. We evaluate our method on labeled utterances and present several examples of discovered semantic units. We demonstrate that our model outperforms standard topic models based on bag-of-words assumption.
口语理解的主要组成部分之一是意图检测,它允许识别用户的目标。意图检测的一个具有挑战性的子任务是从有限数量的训练数据中识别出带有意图的短语,同时保持良好的泛化能力。提出了一种新的概率主题模型,用于联合识别口语话语中的语义意图和常用短语。我们的模型共同学习一组意图相关的短语,并基于距离相关采样方法捕获语义意图聚类作为这些短语的分布。该采样方法利用单词的接近度来分配潜在主题。我们在标记的话语上评估了我们的方法,并给出了几个发现的语义单位的例子。我们证明了我们的模型优于基于词袋假设的标准主题模型。
{"title":"Exploiting distance based similarity in topic models for user intent detection","authors":"Asli Celikyilmaz, Dilek Z. Hakkani-Tür, Gökhan Tür, Ashley Fidler, D. Hillard","doi":"10.1109/ASRU.2011.6163969","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163969","url":null,"abstract":"One of the main components of spoken language understanding is intent detection, which allows user goals to be identified. A challenging sub-task of intent detection is the identification of intent bearing phrases from a limited amount of training data, while maintaining the ability to generalize well. We present a new probabilistic topic model for jointly identifying semantic intents and common phrases in spoken language utterances. Our model jointly learns a set of intent dependent phrases and captures semantic intent clusters as distributions over these phrases based on a distance dependent sampling method. This sampling method uses proximity of words utterances when assigning words to latent topics. We evaluate our method on labeled utterances and present several examples of discovered semantic units. We demonstrate that our model outperforms standard topic models based on bag-of-words assumption.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126419821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Estimating document frequencies in a speech corpus 估计语料库中的文档频率
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163966
D. Karakos, Mark Dredze, K. Church, A. Jansen, S. Khudanpur
Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df (w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.
逆文档频率(IDF)在包括信息检索在内的许多应用中都是一个重要的量。IDF是根据文档频率df (w)定义的,df (w)是至少提到w一次的文档的数量。这个数量相对容易计算文本文档,但口语文档更具挑战性。本文考虑了两个基线:(1)基于1-最佳ASR输出的估计和(2)基于从晶格计算的期望项频率的估计。我们通过利用重复来改进这些基线。无论文件的内容是什么,都有可能被重复,不像ASR错误,它往往更随机(泊松)。此外,我们发现考虑语言模型的集合是有帮助的。假设跨语言模型的错误相对不相关,那么集成就有机会减少噪声。当WER高时,改进的机会更大。本文考虑了一种可以从改进的df估计中获益的配对任务应用。配对任务从英语Fisher语料库中输入会话方,并输出对来自同一会话的哪一方的估计。对df进行更好的估计,可以在此任务中获得更好的性能。
{"title":"Estimating document frequencies in a speech corpus","authors":"D. Karakos, Mark Dredze, K. Church, A. Jansen, S. Khudanpur","doi":"10.1109/ASRU.2011.6163966","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163966","url":null,"abstract":"Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df (w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132137021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Minimum detection error training of subword detectors 子词检测器的最小检测误差训练
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163983
Alfonso M. Canterla, M. H. Johnsen
This paper presents methods and results for optimizing subword detectors in continuous speech. Speech detectors are useful within areas like detection-based ASR, pronunciation training, phonetic analysis, word spotting, etc. We propose a new discriminative training criterion for subword unit detectors that is based on the Minimum Phone Error framework. The criterion can optimize the F-score or any other detection performance metric. The method is applied to the optimization of HMMs and MFCC filterbanks in phone detectors. The resulting filterbanks differ from each other and reflect acoustic properties of the corresponding detection classes. For the experiments in TIMIT, the best optimized detectors had a relative accuracy improvement of 31.3% over baseline and 18.2% over our previous MCE-based method.
本文介绍了连续语音中子词检测器的优化方法和结果。语音检测器在基于检测的语音识别、发音训练、语音分析、单词识别等领域都很有用。我们提出了一种新的基于最小电话错误框架的子词单元检测器判别训练准则。该标准可以优化f分数或任何其他检测性能指标。将该方法应用于手机探测器中hmm滤波器组和MFCC滤波器组的优化。所得到的滤波器组彼此不同,并反映相应检测类的声学特性。在TIMIT的实验中,优化后的检测器的相对精度比基线提高了31.3%,比之前基于mce的方法提高了18.2%。
{"title":"Minimum detection error training of subword detectors","authors":"Alfonso M. Canterla, M. H. Johnsen","doi":"10.1109/ASRU.2011.6163983","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163983","url":null,"abstract":"This paper presents methods and results for optimizing subword detectors in continuous speech. Speech detectors are useful within areas like detection-based ASR, pronunciation training, phonetic analysis, word spotting, etc. We propose a new discriminative training criterion for subword unit detectors that is based on the Minimum Phone Error framework. The criterion can optimize the F-score or any other detection performance metric. The method is applied to the optimization of HMMs and MFCC filterbanks in phone detectors. The resulting filterbanks differ from each other and reflect acoustic properties of the corresponding detection classes. For the experiments in TIMIT, the best optimized detectors had a relative accuracy improvement of 31.3% over baseline and 18.2% over our previous MCE-based method.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134613689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Cross-lingual portability of Chinese and english neural network features for French and German LVCSR 法语和德语LVCSR中英文神经网络特征的跨语言可移植性
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163960
Christian Plahl, R. Schlüter, H. Ney
This paper investigates neural network (NN) based cross-lingual probabilistic features. Earlier work reports that intra-lingual features consistently outperform the corresponding cross-lingual features. We show that this may not generalize. Depending on the complexity of the NN features, cross-lingual features reduce the resources used for training —the NN has to be trained on one language only— without any loss in performance w.r.t. word error rate (WER). To further investigate this inconsistency concerning intra- vs. cross-lingual neural network features, we analyze the performance of these features w.r.t. the degree of kinship between training and testing language, and the amount of training data used. Whenever the same amount of data is used for NN training, a close relationship between training and testing language is required to achieve similar results. By increasing the training data the relationship becomes less, as well as changing the topology of the NN to the bottle neck structure. Moreover, cross-lingual features trained on English or Chinese improve the best intra-lingual system for German up to 2% relative in WER and up to 3% relative for French and achieve the same improvement as for discriminative training. Moreover, we gain again up to 8% relative in WER by combining intra- and cross-lingual systems.
本文研究了基于神经网络的跨语言概率特征。早期的工作报告表明,语内特征始终优于相应的跨语言特征。我们证明这可能不能推广。根据神经网络特征的复杂性,跨语言特征减少了用于训练的资源——神经网络只能在一种语言上训练——而不会在性能上有任何损失。为了进一步研究跨语言和内语言神经网络特征的不一致性,我们分析了这些特征的性能,包括训练语言和测试语言之间的亲缘关系程度,以及使用的训练数据量。每当使用相同数量的数据进行神经网络训练时,需要训练语言和测试语言之间的密切关系才能获得相似的结果。通过增加训练数据,这种关系变得更小,并将神经网络的拓扑结构改变为瓶颈结构。此外,用英语或中文训练的跨语言特征使德语的最佳语内系统相对于WER提高了2%,相对于法语提高了3%,并且取得了与歧视性训练相同的改进。此外,通过结合内部和跨语言系统,我们在WER中再次获得高达8%的相对收益。
{"title":"Cross-lingual portability of Chinese and english neural network features for French and German LVCSR","authors":"Christian Plahl, R. Schlüter, H. Ney","doi":"10.1109/ASRU.2011.6163960","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163960","url":null,"abstract":"This paper investigates neural network (NN) based cross-lingual probabilistic features. Earlier work reports that intra-lingual features consistently outperform the corresponding cross-lingual features. We show that this may not generalize. Depending on the complexity of the NN features, cross-lingual features reduce the resources used for training —the NN has to be trained on one language only— without any loss in performance w.r.t. word error rate (WER). To further investigate this inconsistency concerning intra- vs. cross-lingual neural network features, we analyze the performance of these features w.r.t. the degree of kinship between training and testing language, and the amount of training data used. Whenever the same amount of data is used for NN training, a close relationship between training and testing language is required to achieve similar results. By increasing the training data the relationship becomes less, as well as changing the topology of the NN to the bottle neck structure. Moreover, cross-lingual features trained on English or Chinese improve the best intra-lingual system for German up to 2% relative in WER and up to 3% relative for French and achieve the same improvement as for discriminative training. Moreover, we gain again up to 8% relative in WER by combining intra- and cross-lingual systems.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"3 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131272181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Model-based parametric features for emotion recognition from speech 基于模型的语音情感识别参数特征
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163987
Sankaranarayanan Ananthakrishnan, Aravind Namandi Vembu, R. Prasad
Automatic emotion recognition from speech is desirable in many applications relying on spoken language processing. Telephone-based customer service systems, psychological healthcare initiatives, and virtual training modules are examples of real-world applications that would significantly benefit from such capability. Traditional utterance-level emotion recognition relies on a global feature set obtained by computing various statistics from raw segmental and supra-segmental measurements, including fundamental frequency (F0), energy, and MFCCs. In this paper, we propose a novel, model-based parametric feature set that better discriminates between the competing emotion classes. Our approach relaxes modeling assumptions associated with using global statistics (e.g. mean, standard deviation, etc.) of traditional segment-level features for classification, and results in significant improvements over the state-of-the-art in 7-way emotion classification accuracy on the standard, freely-available Berlin Emotional Speech Corpus. These improvements are consistent even in a reduced feature space obtained by Fisher's Multiple Linear Discriminant Analysis, demonstrating the signficantly higher discriminative power of the proposed feature set.
在许多依赖于口语处理的应用中,语音的自动情感识别是需要的。基于电话的客户服务系统、心理医疗保健计划和虚拟培训模块是可以从这种能力中显著受益的实际应用程序的示例。传统的话语级情感识别依赖于通过计算原始分段和超分段测量的各种统计数据获得的全局特征集,包括基频(F0)、能量和mfccc。在本文中,我们提出了一种新的,基于模型的参数特征集,可以更好地区分竞争情绪类别。我们的方法放松了与使用传统分段级特征的全局统计(例如平均值、标准差等)进行分类相关的建模假设,并在标准的、免费的柏林情感语音语料库上显著提高了最先进的7向情感分类精度。即使在Fisher多元线性判别分析得到的简化特征空间中,这些改进也是一致的,这表明所提出的特征集具有显着更高的判别能力。
{"title":"Model-based parametric features for emotion recognition from speech","authors":"Sankaranarayanan Ananthakrishnan, Aravind Namandi Vembu, R. Prasad","doi":"10.1109/ASRU.2011.6163987","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163987","url":null,"abstract":"Automatic emotion recognition from speech is desirable in many applications relying on spoken language processing. Telephone-based customer service systems, psychological healthcare initiatives, and virtual training modules are examples of real-world applications that would significantly benefit from such capability. Traditional utterance-level emotion recognition relies on a global feature set obtained by computing various statistics from raw segmental and supra-segmental measurements, including fundamental frequency (F0), energy, and MFCCs. In this paper, we propose a novel, model-based parametric feature set that better discriminates between the competing emotion classes. Our approach relaxes modeling assumptions associated with using global statistics (e.g. mean, standard deviation, etc.) of traditional segment-level features for classification, and results in significant improvements over the state-of-the-art in 7-way emotion classification accuracy on the standard, freely-available Berlin Emotional Speech Corpus. These improvements are consistent even in a reduced feature space obtained by Fisher's Multiple Linear Discriminant Analysis, demonstrating the signficantly higher discriminative power of the proposed feature set.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124559733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
2011 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1