首页 > 最新文献

2012 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Transcription of multi-genre media archives using out-of-domain data 利用域外数据转录多类型媒体档案
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424244
Peter Bell, M. Gales, P. Lanchantin, Xunying Liu, Yanhua Long, Steve Renals, P. Swietojanski, P. Woodland
We describe our work on developing a speech recognition system for multi-genre media archives. The high diversity of the data makes this a challenging recognition task, which may benefit from systems trained on a combination of in-domain and out-of-domain data. Working with tandem HMMs, we present Multi-level Adaptive Networks (MLAN), a novel technique for incorporating information from out-of-domain posterior features using deep neural networks. We show that it provides a substantial reduction in WER over other systems, with relative WER reductions of 15% over a PLP baseline, 9% over in-domain tandem features and 8% over the best out-of-domain tandem features.
我们描述了我们在开发多类型媒体档案语音识别系统方面的工作。数据的高度多样性使这成为一项具有挑战性的识别任务,这可能受益于基于域内和域外数据组合训练的系统。与串联hmm合作,我们提出了多层次自适应网络(MLAN),这是一种利用深度神经网络整合域外后验特征信息的新技术。我们表明,与其他系统相比,它提供了大幅降低的WER,与PLP基线相比,相对降低了15%的WER,比域内串联特性降低了9%,比最佳域外串联特性降低了8%。
{"title":"Transcription of multi-genre media archives using out-of-domain data","authors":"Peter Bell, M. Gales, P. Lanchantin, Xunying Liu, Yanhua Long, Steve Renals, P. Swietojanski, P. Woodland","doi":"10.1109/SLT.2012.6424244","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424244","url":null,"abstract":"We describe our work on developing a speech recognition system for multi-genre media archives. The high diversity of the data makes this a challenging recognition task, which may benefit from systems trained on a combination of in-domain and out-of-domain data. Working with tandem HMMs, we present Multi-level Adaptive Networks (MLAN), a novel technique for incorporating information from out-of-domain posterior features using deep neural networks. We show that it provides a substantial reduction in WER over other systems, with relative WER reductions of 15% over a PLP baseline, 9% over in-domain tandem features and 8% over the best out-of-domain tandem features.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115852854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
Localized detection of speech recognition errors 语音识别错误的局部检测
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424164
Svetlana Stoyanchev, Philipp Salletmayr, Jingbo Yang, Julia Hirschberg
We address the problem of localized error detection in Automatic Speech Recognition (ASR) output. Localized error detection seeks to identify which particular words in a user's utterance have been misrecognized. Identifying misrecognized words permits one to create targeted clarification strategies for spoken dialogue systems, allowing the system to ask clarification questions targeting the particular type of misrecognition, in contrast to the “please repeat/rephrase” strategies used in most current dialogue systems. We present results of machine learning experiments using ASR confidence scores together with prosodic and syntactic features to predict whether 1) an utterance contains an error, and 2) whether a word in a misrecognized utterance is misrecognized. We show that by adding syntactic features to the ASR features when predicting misrecognized utterances the F-measure improves by 13.3% compared to using ASR features alone. By adding syntactic and prosodic features when predicting misrecognized words F-measure improves by 40%.
研究了自动语音识别(ASR)输出中的局部错误检测问题。本地化错误检测旨在识别用户话语中哪些特定的单词被错误识别。识别被错误识别的单词允许人们为口语对话系统创建有针对性的澄清策略,允许系统针对特定类型的错误识别提出澄清问题,而不是在大多数当前对话系统中使用的“请重复/重新措辞”策略。我们展示了机器学习实验的结果,使用ASR置信度评分以及韵律和句法特征来预测是否1)一个话语包含错误,以及2)一个被错误识别的话语中的一个单词是否被错误识别。研究表明,与单独使用ASR特征相比,在预测错误话语时,通过在ASR特征中添加句法特征,F-measure提高了13.3%。通过增加句法和韵律特征来预测误认单词,F-measure提高了40%。
{"title":"Localized detection of speech recognition errors","authors":"Svetlana Stoyanchev, Philipp Salletmayr, Jingbo Yang, Julia Hirschberg","doi":"10.1109/SLT.2012.6424164","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424164","url":null,"abstract":"We address the problem of localized error detection in Automatic Speech Recognition (ASR) output. Localized error detection seeks to identify which particular words in a user's utterance have been misrecognized. Identifying misrecognized words permits one to create targeted clarification strategies for spoken dialogue systems, allowing the system to ask clarification questions targeting the particular type of misrecognition, in contrast to the “please repeat/rephrase” strategies used in most current dialogue systems. We present results of machine learning experiments using ASR confidence scores together with prosodic and syntactic features to predict whether 1) an utterance contains an error, and 2) whether a word in a misrecognized utterance is misrecognized. We show that by adding syntactic features to the ASR features when predicting misrecognized utterances the F-measure improves by 13.3% compared to using ASR features alone. By adding syntactic and prosodic features when predicting misrecognized words F-measure improves by 40%.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130162818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Frame-based phonotactic Language Identification 基于框架的语音语音识别
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424240
Kyu Jeong Han, Jason W. Pelecanos
This paper describes a frame-based phonotactic Language Identification (LID) system, which was used for the LID evaluation of the Robust Automatic Transcription of Speech (RATS) program by the Defense Advanced Research Projects Agency (DARPA). The proposed approach utilizes features derived from frame-level phone log-likelihoods from a phone recognizer. It is an attempt to capture not only phone sequence information but also short-term timing information for phone N-gram events, which is lacking in conventional phonotactic LID systems that simply count phone N-gram events. Based on this new method, we achieved 26% relative improvement in terms of Cavg for the RATS LID evaluation data compared to phone N-gram counts modeling. We also observed that it had a significant impact on score combination with our best acoustic system based on Mel-Frequency Cepstral Coefficients (MFCCs).
本文介绍了一种基于帧的语音定向语言识别(LID)系统,该系统用于美国国防高级研究计划局(DARPA)的鲁棒语音自动转录(RATS)项目的LID评估。所提出的方法利用了来自手机识别器的帧级电话日志似然的特征。它不仅试图捕获电话序列信息,而且还试图捕获电话N-gram事件的短期定时信息,这在传统的语音定向LID系统中是缺乏的,它只是简单地计数电话N-gram事件。基于这种新方法,与手机N-gram计数模型相比,我们在RATS LID评估数据的Cavg方面实现了26%的相对改进。我们还观察到,它对基于Mel-Frequency倒谱系数(MFCCs)的最佳声学系统的评分组合有显著影响。
{"title":"Frame-based phonotactic Language Identification","authors":"Kyu Jeong Han, Jason W. Pelecanos","doi":"10.1109/SLT.2012.6424240","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424240","url":null,"abstract":"This paper describes a frame-based phonotactic Language Identification (LID) system, which was used for the LID evaluation of the Robust Automatic Transcription of Speech (RATS) program by the Defense Advanced Research Projects Agency (DARPA). The proposed approach utilizes features derived from frame-level phone log-likelihoods from a phone recognizer. It is an attempt to capture not only phone sequence information but also short-term timing information for phone N-gram events, which is lacking in conventional phonotactic LID systems that simply count phone N-gram events. Based on this new method, we achieved 26% relative improvement in terms of Cavg for the RATS LID evaluation data compared to phone N-gram counts modeling. We also observed that it had a significant impact on score combination with our best acoustic system based on Mel-Frequency Cepstral Coefficients (MFCCs).","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130245453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
MediaParl: Bilingual mixed language accented speech database MediaParl:双语混合语言重音语音数据库
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424233
David Imseng, H. Bourlard, Holger Caesar, Philip N. Garner, G. Lecorvé, Alexandre Nanchen
MediaParl is a Swiss accented bilingual database containing recordings in both French and German as they are spoken in Switzerland. The data were recorded at the Valais Parliament. Valais is a bilingual Swiss canton with many local accents and dialects. Therefore, the database contains data with high variability and is suitable to study multilingual, accented and non-native speech recognition as well as language identification and language switch detection. We also define monolingual and mixed language automatic speech recognition and language identification tasks and evaluate baseline systems. The database is publicly available for download.
MediaParl是一个瑞士口音的双语数据库,包含在瑞士使用的法语和德语录音。这些数据是在瓦莱州议会记录的。瓦莱州是瑞士的一个双语州,有许多当地口音和方言。因此,该数据库包含的数据变异性大,适合研究多语言、重音和非母语语音识别,以及语言识别和语言切换检测。我们还定义了单语言和混合语言自动语音识别和语言识别任务,并评估了基线系统。该数据库可公开下载。
{"title":"MediaParl: Bilingual mixed language accented speech database","authors":"David Imseng, H. Bourlard, Holger Caesar, Philip N. Garner, G. Lecorvé, Alexandre Nanchen","doi":"10.1109/SLT.2012.6424233","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424233","url":null,"abstract":"MediaParl is a Swiss accented bilingual database containing recordings in both French and German as they are spoken in Switzerland. The data were recorded at the Valais Parliament. Valais is a bilingual Swiss canton with many local accents and dialects. Therefore, the database contains data with high variability and is suitable to study multilingual, accented and non-native speech recognition as well as language identification and language switch detection. We also define monolingual and mixed language automatic speech recognition and language identification tasks and evaluate baseline systems. The database is publicly available for download.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133408767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Deep-level acoustic-to-articulatory mapping for DBN-HMM based phone recognition 基于DBN-HMM的电话识别的深层次声学-发音映射
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424252
Leonardo Badino, Claudia Canevari, L. Fadiga, G. Metta
In this paper we experiment with methods based on Deep Belief Networks (DBNs) to recover measured articulatory data from speech acoustics. Our acoustic-to-articulatory mapping (AAM) processes go through multi-layered and hierarchical (i.e., deep) representations of the acoustic and the articulatory domains obtained through unsupervised learning of DBNs. The unsupervised learning of DBNs can serve two purposes: (i) pre-training of the Multi-layer Perceptrons that perform AAM; (ii) transformation of the articulatory domain that is recovered from acoustics through AAM. The recovered articulatory features are combined with MFCCs to compute phone posteriors for phone recognition. Tested on the MOCHA-TIMIT corpus, the recovered articulatory features, when combined with MFCCs, lead to up to a remarkable 16.6% relative phone error reduction w.r.t. a phone recognizer that only uses MFCCs.
在本文中,我们尝试了基于深度信念网络(DBNs)的方法来从语音声学中恢复测量的发音数据。我们的声学到发音映射(AAM)过程通过对dbn的无监督学习获得的声学和发音域的多层和分层(即深度)表示。dbn的无监督学习可以达到两个目的:(i)执行AAM的多层感知器的预训练;(ii)通过AAM从声学中恢复的发音域的转换。将恢复的发音特征与mfc相结合,计算手机后验,用于手机识别。在MOCHA-TIMIT语料库上进行的测试表明,与仅使用mfccc的手机识别器相比,恢复的发音特征与mfccc相结合,可使相对电话错误减少16.6%。
{"title":"Deep-level acoustic-to-articulatory mapping for DBN-HMM based phone recognition","authors":"Leonardo Badino, Claudia Canevari, L. Fadiga, G. Metta","doi":"10.1109/SLT.2012.6424252","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424252","url":null,"abstract":"In this paper we experiment with methods based on Deep Belief Networks (DBNs) to recover measured articulatory data from speech acoustics. Our acoustic-to-articulatory mapping (AAM) processes go through multi-layered and hierarchical (i.e., deep) representations of the acoustic and the articulatory domains obtained through unsupervised learning of DBNs. The unsupervised learning of DBNs can serve two purposes: (i) pre-training of the Multi-layer Perceptrons that perform AAM; (ii) transformation of the articulatory domain that is recovered from acoustics through AAM. The recovered articulatory features are combined with MFCCs to compute phone posteriors for phone recognition. Tested on the MOCHA-TIMIT corpus, the recovered articulatory features, when combined with MFCCs, lead to up to a remarkable 16.6% relative phone error reduction w.r.t. a phone recognizer that only uses MFCCs.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"192 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124264498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
POMDP-based Let's Go system for spoken dialog challenge 基于pomdp的Let's Go口语对话挑战系统
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424198
Sungjin Lee, M. Eskénazi
This paper describes a POMDP-based Let's Go system which incorporates belief tracking and dialog policy optimization into the dialog manager of the reference system for the Spoken Dialog Challenge (SDC). Since all components except for the dialog manager were kept the same, component-wise comparison can be performed to investigate the effect of belief tracking and dialog policy optimization on the overall system performance. In addition, since unsupervised methods have been adopted to learn all required models to reduce human labor and development time, the effectiveness of the unsupervised approaches compared to conventional supervised approaches can be investigated. The result system participated in the 2011 SDC and showed comparable performance with the base system which has been enhanced from the reference system for the 2010 SDC. This shows the capability of the proposed method to rapidly produce an effective system with minimal human labor and experts' knowledge.
本文描述了一个基于pomdp的Let’s Go系统,该系统将信念跟踪和对话策略优化集成到口语对话挑战(SDC)参考系统的对话管理器中。由于除了对话管理器之外的所有组件都保持相同,因此可以执行组件比较,以研究信念跟踪和对话策略优化对整体系统性能的影响。此外,由于采用无监督方法来学习所有所需的模型以减少人力劳动和开发时间,因此可以研究与传统监督方法相比,无监督方法的有效性。结果系统参与2011年SDC,表现与基准系统相当,基准系统由2010年SDC的参考系统改进而成。这表明所提出的方法能够以最少的人力和专家知识快速生成有效的系统。
{"title":"POMDP-based Let's Go system for spoken dialog challenge","authors":"Sungjin Lee, M. Eskénazi","doi":"10.1109/SLT.2012.6424198","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424198","url":null,"abstract":"This paper describes a POMDP-based Let's Go system which incorporates belief tracking and dialog policy optimization into the dialog manager of the reference system for the Spoken Dialog Challenge (SDC). Since all components except for the dialog manager were kept the same, component-wise comparison can be performed to investigate the effect of belief tracking and dialog policy optimization on the overall system performance. In addition, since unsupervised methods have been adopted to learn all required models to reduce human labor and development time, the effectiveness of the unsupervised approaches compared to conventional supervised approaches can be investigated. The result system participated in the 2011 SDC and showed comparable performance with the base system which has been enhanced from the reference system for the 2010 SDC. This shows the capability of the proposed method to rapidly produce an effective system with minimal human labor and experts' knowledge.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129354376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
The language-independent bottleneck features 与语言无关的瓶颈特性
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424246
Karel Veselý, M. Karafiát, F. Grézl, M. Janda, E. Egorova
In this paper we present novel language-independent bottleneck (BN) feature extraction framework. In our experiments we have used Multilingual Artificial Neural Network (ANN), where each language is modelled by separate output layer, while all the hidden layers jointly model the variability of all the source languages. The key idea is that the entire ANN is trained on all the languages simultaneously, thus the BN-features are not biased towards any of the languages. Exactly for this reason, the final BN-features are considered as language independent. In the experiments with GlobalPhone database, we show that Multilingual BN-features consistently outperform Monolingual BN-features. Also, cross-lingual generalization is evaluated, where we train on 5 source languages and test on 3 other languages. The results show that the ANN can produce very good BN-features even for unseen languages, in some cases even better than if we trained the ANN on the target language only.
本文提出了一种新的语言无关瓶颈(BN)特征提取框架。在我们的实验中,我们使用了多语言人工神经网络(ANN),其中每种语言由单独的输出层建模,而所有隐藏层共同建模所有源语言的可变性。关键思想是整个人工神经网络同时在所有语言上进行训练,因此bn特征不会偏向于任何语言。正是由于这个原因,最终的bn特性被认为是独立于语言的。在GlobalPhone数据库的实验中,我们证明了多语言bn特征始终优于单语言bn特征。此外,我们还评估了跨语言泛化,我们在5种源语言上进行训练,并在3种其他语言上进行测试。结果表明,即使对于未知的语言,人工神经网络也能产生非常好的bn特征,在某些情况下,甚至比我们只在目标语言上训练人工神经网络还要好。
{"title":"The language-independent bottleneck features","authors":"Karel Veselý, M. Karafiát, F. Grézl, M. Janda, E. Egorova","doi":"10.1109/SLT.2012.6424246","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424246","url":null,"abstract":"In this paper we present novel language-independent bottleneck (BN) feature extraction framework. In our experiments we have used Multilingual Artificial Neural Network (ANN), where each language is modelled by separate output layer, while all the hidden layers jointly model the variability of all the source languages. The key idea is that the entire ANN is trained on all the languages simultaneously, thus the BN-features are not biased towards any of the languages. Exactly for this reason, the final BN-features are considered as language independent. In the experiments with GlobalPhone database, we show that Multilingual BN-features consistently outperform Monolingual BN-features. Also, cross-lingual generalization is evaluated, where we train on 5 source languages and test on 3 other languages. The results show that the ANN can produce very good BN-features even for unseen languages, in some cases even better than if we trained the ANN on the target language only.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127187886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 216
Comparison of adaptation methods for GMM-SVM based speech emotion recognition 基于GMM-SVM的语音情感识别自适应方法比较
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424234
Jianbo Jiang, Zhiyong Wu, Mingxing Xu, Jia Jia, Lianhong Cai
The required length of the utterance is one of the key factors affecting the performance of automatic emotion recognition. To gain the accuracy rate of emotion distinction, adaptation algorithms that can be manipulated on short utterances are highly essential. Regarding this, this paper compares two classical model adaptation methods, maximum a posteriori (MAP) and maximum likelihood linear regression (MLLR), in GMM-SVM based emotion recognition, and tries to find which method can perform better on different length of the enrollment of the utterances. Experiment results show that MLLR adaptation performs better for very short enrollment utterances (with the length shorter than 2s) while MAP adaptation is more effective for longer utterances.
话语所需长度是影响情感自动识别性能的关键因素之一。为了提高情感识别的准确率,能够对短话语进行操作的自适应算法是非常必要的。为此,本文比较了基于GMM-SVM的情感识别中两种经典的模型自适应方法——最大后验(MAP)和最大似然线性回归(MLLR),并试图找出哪种方法在不同的话语注册长度下表现更好。实验结果表明,MLLR自适应在极短的注册话语(长度小于2s)中表现较好,MAP自适应在较长的话语中表现较好。
{"title":"Comparison of adaptation methods for GMM-SVM based speech emotion recognition","authors":"Jianbo Jiang, Zhiyong Wu, Mingxing Xu, Jia Jia, Lianhong Cai","doi":"10.1109/SLT.2012.6424234","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424234","url":null,"abstract":"The required length of the utterance is one of the key factors affecting the performance of automatic emotion recognition. To gain the accuracy rate of emotion distinction, adaptation algorithms that can be manipulated on short utterances are highly essential. Regarding this, this paper compares two classical model adaptation methods, maximum a posteriori (MAP) and maximum likelihood linear regression (MLLR), in GMM-SVM based emotion recognition, and tries to find which method can perform better on different length of the enrollment of the utterances. Experiment results show that MLLR adaptation performs better for very short enrollment utterances (with the length shorter than 2s) while MAP adaptation is more effective for longer utterances.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121163182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Use of kernel deep convex networks and end-to-end learning for spoken language understanding 使用核深度凸网络和端到端学习口语理解
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424224
L. Deng, Gökhan Tür, Xiaodong He, Dilek Z. Hakkani-Tür
We present our recent and ongoing work on applying deep learning techniques to spoken language understanding (SLU) problems. The previously developed deep convex network (DCN) is extended to its kernel version (K-DCN) where the number of hidden units in each DCN layer approaches infinity using the kernel trick. We report experimental results demonstrating dramatic error reduction achieved by the K-DCN over both the Boosting-based baseline and the DCN on a domain classification task of SLU, especially when a highly correlated set of features extracted from search query click logs are used. Not only can DCN and K-DCN be used as a domain or intent classifier for SLU, they can also be used as local, discriminative feature extractors for the slot filling task of SLU. The interface of K-DCN to slot filling systems via the softmax function is presented. Finally, we outline an end-to-end learning strategy for training the softmax parameters (and potentially all DCN and K-DCN parameters) where the learning objective can take any performance measure (e.g. the F-measure) for the full SLU system.
我们介绍了我们最近和正在进行的将深度学习技术应用于口语理解(SLU)问题的工作。将以前开发的深度凸网络(DCN)扩展到其内核版本(K-DCN),其中每个DCN层中的隐藏单元数量使用核技巧接近无穷大。我们报告的实验结果表明,K-DCN在基于boost的基线和DCN在SLU的域分类任务上实现了显着的误差减少,特别是当使用从搜索查询点击日志中提取的高度相关的特征集时。DCN和K-DCN不仅可以作为SLU的域或意图分类器,还可以作为SLU槽填充任务的局部判别特征提取器。给出了K-DCN通过softmax函数与槽位填充系统的接口。最后,我们概述了一个端到端学习策略,用于训练softmax参数(以及潜在的所有DCN和K-DCN参数),其中学习目标可以对整个SLU系统采取任何性能度量(例如f度量)。
{"title":"Use of kernel deep convex networks and end-to-end learning for spoken language understanding","authors":"L. Deng, Gökhan Tür, Xiaodong He, Dilek Z. Hakkani-Tür","doi":"10.1109/SLT.2012.6424224","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424224","url":null,"abstract":"We present our recent and ongoing work on applying deep learning techniques to spoken language understanding (SLU) problems. The previously developed deep convex network (DCN) is extended to its kernel version (K-DCN) where the number of hidden units in each DCN layer approaches infinity using the kernel trick. We report experimental results demonstrating dramatic error reduction achieved by the K-DCN over both the Boosting-based baseline and the DCN on a domain classification task of SLU, especially when a highly correlated set of features extracted from search query click logs are used. Not only can DCN and K-DCN be used as a domain or intent classifier for SLU, they can also be used as local, discriminative feature extractors for the slot filling task of SLU. The interface of K-DCN to slot filling systems via the softmax function is presented. Finally, we outline an end-to-end learning strategy for training the softmax parameters (and potentially all DCN and K-DCN parameters) where the learning objective can take any performance measure (e.g. the F-measure) for the full SLU system.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125176100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 114
N-best error simulation for training spoken dialogue systems 训练口语对话系统的N-best误差模拟
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424194
Blaise Thomson, Milica Gasic, Matthew Henderson, P. Tsiakoulis, S. Young
A recent trend in spoken dialogue research is the use of reinforcement learning to train dialogue systems in a simulated environment. Past researchers have shown that the types of errors that are simulated can have a significant effect on simulated dialogue performance. Since modern systems typically receive an N-best list of possible user utterances, it is important to be able to simulate a full N-best list of hypotheses. This paper presents a new method for simulating such errors based on logistic regression, as well as a new method for simulating the structure of N-best lists of semantics and their probabilities, based on the Dirichlet distribution. Off-line evaluations show that the new Dirichlet model results in a much closer match to the receiver operating characteristics (ROC) of the live data. Experiments also show that the logistic model gives confusions that are closer to the type of confusions observed in live situations. The hope is that these new error models will be able to improve the resulting performance of trained dialogue systems.
口语对话研究的最新趋势是在模拟环境中使用强化学习来训练对话系统。过去的研究表明,模拟的错误类型会对模拟对话的表现产生重大影响。由于现代系统通常接收可能用户话语的n个最佳列表,因此能够模拟完整的n个最佳假设列表非常重要。本文提出了一种基于逻辑回归的模拟这种误差的新方法,以及一种基于狄利克雷分布的模拟n -最佳语义表结构及其概率的新方法。离线评估表明,新的Dirichlet模型的结果更接近于实时数据的接收机工作特性(ROC)。实验还表明,逻辑模型给出的混淆更接近于在实际情况中观察到的混淆类型。希望这些新的错误模型能够提高训练对话系统的最终性能。
{"title":"N-best error simulation for training spoken dialogue systems","authors":"Blaise Thomson, Milica Gasic, Matthew Henderson, P. Tsiakoulis, S. Young","doi":"10.1109/SLT.2012.6424194","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424194","url":null,"abstract":"A recent trend in spoken dialogue research is the use of reinforcement learning to train dialogue systems in a simulated environment. Past researchers have shown that the types of errors that are simulated can have a significant effect on simulated dialogue performance. Since modern systems typically receive an N-best list of possible user utterances, it is important to be able to simulate a full N-best list of hypotheses. This paper presents a new method for simulating such errors based on logistic regression, as well as a new method for simulating the structure of N-best lists of semantics and their probabilities, based on the Dirichlet distribution. Off-line evaluations show that the new Dirichlet model results in a much closer match to the receiver operating characteristics (ROC) of the live data. Experiments also show that the logistic model gives confusions that are closer to the type of confusions observed in live situations. The hope is that these new error models will be able to improve the resulting performance of trained dialogue systems.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122673526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
期刊
2012 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1