首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features 通过语音和词汇特征的层次融合来识别口语对话中的情绪
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846319
Leimin Tian, Johanna D. Moore, Catherine Lai
Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.
自动情感识别对于构建自然且引人入胜的人机交互系统至关重要。结合来自多种模式的信息通常可以提高情绪识别的性能。在以前的工作中,来自不同模态的特征通常通过两种类型的融合策略在同一级别进行融合:特征级融合,在识别之前将特征集连接起来;决策级融合,基于单峰模型的输出进行最终决策。然而,不同的特征可能在不同的时间尺度上描述数据或具有不同的抽象级别。认知科学研究也表明,在感知情绪时,人类在不同的认知水平和时间步上使用来自不同模态的信息。因此,我们提出了一种多模态情感识别的分层融合策略,该策略在其知识启发结构的更高层次上包含全局或更抽象的特征。我们建立了多模态情感识别模型,结合最先进的声学和词汇特征来研究所提出的分层融合的性能。在两个口语对话情感数据库上的实验表明,该融合策略始终优于特征级和决策级融合。采用层次融合策略的多模态情绪识别模型在识别自发对话和行为对话中的情绪方面都取得了较好的效果。
{"title":"Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features","authors":"Leimin Tian, Johanna D. Moore, Catherine Lai","doi":"10.1109/SLT.2016.7846319","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846319","url":null,"abstract":"Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124895494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Improved prediction of the accent gap between speakers of English for individual-based clustering of World Englishes 改进了基于个体的世界英语聚类的英语说话者之间的口音差距预测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846255
Fumiya Shiozawa, D. Saito, N. Minematsu
The term of “World Englishes” describes the current state of English and one of their main characteristics is a large diversity of pronunciation, called accents. In our previous studies, we developed several techniques to realize effective clustering and visualization of the diversity. For this aim, the accent gap between two speakers has to be quantified independently of extra-linguistic factors such as age and gender. To realize this, a unique representation of speech, called speech structure, which is theoretically invariant against these factors, was applied to represent pronunciation. In the current study, by controlling the degree of invariance, we attempt to improve accent gap prediction. Two techniques are tested: DNN-based model-free estimation of divergence and multi-stream speech structures. In the former, instead of estimating separability between two speech events based on some model assumptions, DNN-based class posteriors are utilized for estimation. In the latter, by deriving one speech structure for each sub-space of acoustic features, constrained invariance is realized. Our proposals are tested in terms of the correlation between reference accent gaps and the predicted and quantified gaps. Experiments show that the correlation is improved from 0.718 to 0.730.
“世界英语”一词描述了英语的现状,其主要特征之一是发音的多样性,称为口音。在我们之前的研究中,我们开发了几种技术来实现有效的聚类和可视化的多样性。为了达到这个目的,两个说话者之间的口音差异必须独立于年龄和性别等语言外因素进行量化。为了实现这一点,我们采用了一种独特的语音表示,即语音结构,它在理论上对这些因素是不变的,用于表示发音。在目前的研究中,我们试图通过控制不变性的程度来改进重音间隙的预测。测试了两种技术:基于dnn的散度无模型估计和多流语音结构。在前者中,使用基于dnn的类后验进行估计,而不是基于一些模型假设来估计两个语音事件之间的可分离性。后者通过为声学特征的每个子空间推导一个语音结构,实现约束不变性。我们的建议在参考重音间隙和预测和量化间隙之间的相关性方面进行了测试。实验表明,相关系数由0.718提高到0.730。
{"title":"Improved prediction of the accent gap between speakers of English for individual-based clustering of World Englishes","authors":"Fumiya Shiozawa, D. Saito, N. Minematsu","doi":"10.1109/SLT.2016.7846255","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846255","url":null,"abstract":"The term of “World Englishes” describes the current state of English and one of their main characteristics is a large diversity of pronunciation, called accents. In our previous studies, we developed several techniques to realize effective clustering and visualization of the diversity. For this aim, the accent gap between two speakers has to be quantified independently of extra-linguistic factors such as age and gender. To realize this, a unique representation of speech, called speech structure, which is theoretically invariant against these factors, was applied to represent pronunciation. In the current study, by controlling the degree of invariance, we attempt to improve accent gap prediction. Two techniques are tested: DNN-based model-free estimation of divergence and multi-stream speech structures. In the former, instead of estimating separability between two speech events based on some model assumptions, DNN-based class posteriors are utilized for estimation. In the latter, by deriving one speech structure for each sub-space of acoustic features, constrained invariance is realized. Our proposals are tested in terms of the correlation between reference accent gaps and the predicted and quantified gaps. Experiments show that the correlation is improved from 0.718 to 0.730.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129088224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A log-linear weighting approach in the Word2vec space for spoken language understanding 用于口语理解的Word2vec空间的对数线性加权方法
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846289
Killian Janod, Mohamed Morchid, Richard Dufour, G. Linarès
This paper proposes an original method which integrates contextual information of words into Word2vec neural networks that learn from words and their respective context windows. In the classical word embedding approach, context windows are represented as bag-of-words, i.e. every word in the context is treated equally. A log-linear weighting approach modeling the continuous context is proposed in our model to take into account the relative position of words in the surrounding context of the word. Quality improvements implied by this method are shown on the the Semantic-Syntactic Word Relationship test and on a real application framework implying a theme identification task of human dialogues. The promising gains of our adapted Word2vec model of 7 and 5 points for Skip-gram and CBOW approaches respectively demonstrate that the proposed models are a step forward for word and document representation.
本文提出了一种新颖的方法,将单词的上下文信息整合到Word2vec神经网络中,Word2vec神经网络从单词及其上下文窗口中学习。在经典的词嵌入方法中,上下文窗口被表示为词袋,即上下文中的每个词都被平等对待。在我们的模型中提出了一种对数线性加权方法来建模连续上下文,以考虑单词在单词周围上下文中的相对位置。在语义-句法词关系测试和一个隐含人类对话主题识别任务的实际应用框架上,表明了该方法所隐含的质量改进。在Skip-gram和CBOW方法中,我们的Word2vec模型分别获得了7和5个点,这表明我们提出的模型在单词和文档表示方面向前迈进了一步。
{"title":"A log-linear weighting approach in the Word2vec space for spoken language understanding","authors":"Killian Janod, Mohamed Morchid, Richard Dufour, G. Linarès","doi":"10.1109/SLT.2016.7846289","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846289","url":null,"abstract":"This paper proposes an original method which integrates contextual information of words into Word2vec neural networks that learn from words and their respective context windows. In the classical word embedding approach, context windows are represented as bag-of-words, i.e. every word in the context is treated equally. A log-linear weighting approach modeling the continuous context is proposed in our model to take into account the relative position of words in the surrounding context of the word. Quality improvements implied by this method are shown on the the Semantic-Syntactic Word Relationship test and on a real application framework implying a theme identification task of human dialogues. The promising gains of our adapted Word2vec model of 7 and 5 points for Skip-gram and CBOW approaches respectively demonstrate that the proposed models are a step forward for word and document representation.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131889255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving multi-stream classification by mapping sequence-embedding in a high dimensional space 利用高维空间映射序列嵌入改进多流分类
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846269
Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès
Most of the Natural and Spoken Language Processing tasks now employ Neural Networks (NN), allowing them to reach impressive performances. Embedding features allow the NLP systems to represent input vectors in a latent space and to improve the observed performances. In this context, Recurrent Neural Network (RNN) based architectures such as Long Short-Term Memory (LSTM) are well known for their capacity to encode sequential data into a non-sequential hidden vector representation, called sequence embedding. In this paper, we propose an LSTM-based multi-stream sequence embedding in order to encode parallel sequences by a single non-sequential latent representation vector. We then propose to map this embedding representation in a high-dimensional space using a Support Vector Machine (SVM) in order to classify the multi-stream sequences by finding out an optimal hyperplane. Multi-stream sequence embedding allowed the SVM classifier to more efficiently profit from information carried by both parallel streams and longer sequences. The system achieved the best performance, in a multi-stream sequence classification task, with a gain of 9 points in error rate compared to an SVM trained on the original input sequences.
大多数自然语言和口语处理任务现在都使用神经网络(NN),使它们能够达到令人印象深刻的性能。嵌入特征允许NLP系统在潜在空间中表示输入向量并改善观察性能。在这种情况下,基于循环神经网络(RNN)的架构,如长短期记忆(LSTM),以其将序列数据编码为非序列隐藏向量表示的能力而闻名,称为序列嵌入。在本文中,我们提出了一种基于lstm的多流序列嵌入方法,以便通过单个非顺序潜在表示向量对并行序列进行编码。然后,我们提出使用支持向量机(SVM)将该嵌入表示映射到高维空间中,以便通过寻找最优超平面对多流序列进行分类。多流序列嵌入使得支持向量机分类器能够更有效地利用并行流和长序列所携带的信息。该系统在多流序列分类任务中取得了最佳性能,与在原始输入序列上训练的SVM相比,错误率提高了9个点。
{"title":"Improving multi-stream classification by mapping sequence-embedding in a high dimensional space","authors":"Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès","doi":"10.1109/SLT.2016.7846269","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846269","url":null,"abstract":"Most of the Natural and Spoken Language Processing tasks now employ Neural Networks (NN), allowing them to reach impressive performances. Embedding features allow the NLP systems to represent input vectors in a latent space and to improve the observed performances. In this context, Recurrent Neural Network (RNN) based architectures such as Long Short-Term Memory (LSTM) are well known for their capacity to encode sequential data into a non-sequential hidden vector representation, called sequence embedding. In this paper, we propose an LSTM-based multi-stream sequence embedding in order to encode parallel sequences by a single non-sequential latent representation vector. We then propose to map this embedding representation in a high-dimensional space using a Support Vector Machine (SVM) in order to classify the multi-stream sequences by finding out an optimal hyperplane. Multi-stream sequence embedding allowed the SVM classifier to more efficiently profit from information carried by both parallel streams and longer sequences. The system achieved the best performance, in a multi-stream sequence classification task, with a gain of 9 points in error rate compared to an SVM trained on the original input sequences.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114891322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic optimization of data perturbation distributions for multi-style training in speech recognition 语音识别中多风格训练数据扰动分布的自动优化
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846240
Mortaza Doulaty, R. Rose, O. Siohan
Speech recognition performance using deep neural network based acoustic models is known to degrade when the acoustic environment and the speaker population in the target utterances are significantly different from the conditions represented in the training data. To address these mismatched scenarios, multi-style training (MTR) has been used to perturb utterances in an existing uncorrupted and potentially mismatched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels for a given set of perturbation types that best matches the target speech utterances. An approach is presented that, given a small set of utterances from a target domain, automatically identifies an empirical distribution of perturbation levels that can be applied to utterances in an existing training set. Distributions are estimated for perturbation types that include acoustic background environments, reverberant room configurations, and speaker related variation like frequency and temporal warping. The end goal is for the resulting perturbed training set to characterize the variability in the target domain and thereby optimize ASR performance. An experimental study is performed to evaluate the impact of this approach on ASR performance when the target utterances are taken from a simulated far-field acoustic environment.
当目标话语中的声环境和说话人群体与训练数据中表示的条件显著不同时,使用基于深度神经网络的声学模型的语音识别性能会下降。为了解决这些不匹配的情况,多风格训练(MTR)被用于干扰现有的未损坏的和可能不匹配的训练语音语料库中的话语,以更好地匹配目标域的话语。本文解决的问题是确定最适合目标语音的给定扰动类型集的扰动水平分布。提出了一种方法,给定一小组来自目标域的话语,自动识别可应用于现有训练集中的话语的扰动水平的经验分布。估计扰动类型的分布,包括声学背景环境,混响室配置和扬声器相关的变化,如频率和时间翘曲。最终目标是使得到的扰动训练集表征目标域的可变性,从而优化ASR性能。我们进行了一项实验研究,以评估当目标话语来自模拟远场声环境时,该方法对ASR性能的影响。
{"title":"Automatic optimization of data perturbation distributions for multi-style training in speech recognition","authors":"Mortaza Doulaty, R. Rose, O. Siohan","doi":"10.1109/SLT.2016.7846240","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846240","url":null,"abstract":"Speech recognition performance using deep neural network based acoustic models is known to degrade when the acoustic environment and the speaker population in the target utterances are significantly different from the conditions represented in the training data. To address these mismatched scenarios, multi-style training (MTR) has been used to perturb utterances in an existing uncorrupted and potentially mismatched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels for a given set of perturbation types that best matches the target speech utterances. An approach is presented that, given a small set of utterances from a target domain, automatically identifies an empirical distribution of perturbation levels that can be applied to utterances in an existing training set. Distributions are estimated for perturbation types that include acoustic background environments, reverberant room configurations, and speaker related variation like frequency and temporal warping. The end goal is for the resulting perturbed training set to characterize the variability in the target domain and thereby optimize ASR performance. An experimental study is performed to evaluate the impact of this approach on ASR performance when the target utterances are taken from a simulated far-field acoustic environment.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114923030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Unsupervised context learning for speech recognition 语音识别的无监督上下文学习
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846302
A. Michaely, M. Ghodsi, Zelin Wu, Justin Scheiner, Petar S. Aleksic
It has been shown in the literature that automatic speech recognition systems can greatly benefit from contextual information [1, 2, 3, 4, 5]. Contextual information can be used to simplify the beam search and improve recognition accuracy. Types of useful contextual information can include the name of the application the user is in, the contents of the user's phone screen, the user's location, a certain dialog state, etc. Building a separate language model for each of these types of context is not feasible due to limited resources or limited amounts of training data. In this paper we describe an approach for unsupervised learning of contextual information and automatic building of contextual biasing models. Our approach can be used to build a large number of small contextual models from a limited amount of available unsupervised training data. We describe how n-grams relevant for a particular context are automatically selected as well as how an optimal size of a final contextual model is chosen. Our experimental results show great accuracy improvements for several types of context.
文献表明,自动语音识别系统可以极大地受益于上下文信息[1,2,3,4,5]。利用上下文信息可以简化波束搜索,提高识别精度。有用的上下文信息类型可以包括用户所在应用程序的名称、用户手机屏幕的内容、用户的位置、某个对话框状态等。由于资源有限或训练数据有限,为每种类型的上下文构建单独的语言模型是不可行的。本文描述了一种上下文信息的无监督学习和上下文偏差模型的自动构建方法。我们的方法可用于从有限数量的可用无监督训练数据中构建大量小型上下文模型。我们描述了如何自动选择与特定上下文相关的n-gram,以及如何选择最终上下文模型的最佳大小。我们的实验结果表明,对几种类型的上下文有很大的准确性提高。
{"title":"Unsupervised context learning for speech recognition","authors":"A. Michaely, M. Ghodsi, Zelin Wu, Justin Scheiner, Petar S. Aleksic","doi":"10.1109/SLT.2016.7846302","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846302","url":null,"abstract":"It has been shown in the literature that automatic speech recognition systems can greatly benefit from contextual information [1, 2, 3, 4, 5]. Contextual information can be used to simplify the beam search and improve recognition accuracy. Types of useful contextual information can include the name of the application the user is in, the contents of the user's phone screen, the user's location, a certain dialog state, etc. Building a separate language model for each of these types of context is not feasible due to limited resources or limited amounts of training data. In this paper we describe an approach for unsupervised learning of contextual information and automatic building of contextual biasing models. Our approach can be used to build a large number of small contextual models from a limited amount of available unsupervised training data. We describe how n-grams relevant for a particular context are automatically selected as well as how an optimal size of a final contextual model is chosen. Our experimental results show great accuracy improvements for several types of context.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116112553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Deep learning with maximal figure-of-merit cost to advance multi-label speech attribute detection 基于最优值代价的深度学习推进多标签语音属性检测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846308
Ivan Kukanov, Ville Hautamäki, S. Siniscalchi, Kehuang Li
In this work, we are interested in boosting speech attribute detection by formulating it as a multi-label classification task, and deep neural networks (DNNs) are used to design speech attribute detectors. A straightforward way to tackle the speech attribute detection task is to estimate DNN parameters using the mean squared error (MSE) loss function and employ a sigmoid function in the DNN output nodes. A more principled way is nonetheless to incorporate the micro-F1 measure, which is a widely used metric in the multi-label classification, into the DNN loss function to directly improve the metric of interest at training time. Micro-F1 is not differentiable, yet we overcome such a problem by casting our task under the maximal figure-of-merit (MFoM) learning framework. The results demonstrate that our MFoM approach consistently outperforms the baseline systems.
在这项工作中,我们感兴趣的是通过将语音属性检测制定为多标签分类任务来增强语音属性检测,并使用深度神经网络(dnn)来设计语音属性检测器。解决语音属性检测任务的一种直接方法是使用均方误差(MSE)损失函数估计DNN参数,并在DNN输出节点中使用sigmoid函数。然而,一种更有原则的方法是将微f1度量(这是多标签分类中广泛使用的度量)纳入DNN损失函数中,以直接改进训练时的感兴趣度量。Micro-F1是不可微的,但我们通过将任务置于最大价值图(MFoM)学习框架下来克服这一问题。结果表明,我们的MFoM方法始终优于基线系统。
{"title":"Deep learning with maximal figure-of-merit cost to advance multi-label speech attribute detection","authors":"Ivan Kukanov, Ville Hautamäki, S. Siniscalchi, Kehuang Li","doi":"10.1109/SLT.2016.7846308","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846308","url":null,"abstract":"In this work, we are interested in boosting speech attribute detection by formulating it as a multi-label classification task, and deep neural networks (DNNs) are used to design speech attribute detectors. A straightforward way to tackle the speech attribute detection task is to estimate DNN parameters using the mean squared error (MSE) loss function and employ a sigmoid function in the DNN output nodes. A more principled way is nonetheless to incorporate the micro-F1 measure, which is a widely used metric in the multi-label classification, into the DNN loss function to directly improve the metric of interest at training time. Micro-F1 is not differentiable, yet we overcome such a problem by casting our task under the maximal figure-of-merit (MFoM) learning framework. The results demonstrate that our MFoM approach consistently outperforms the baseline systems.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121625188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
An unsupervised vocabulary selection technique for Chinese automatic speech recognition 中文语音自动识别的无监督词汇选择技术
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846298
Yike Zhang, Pengyuan Zhang, Ta Li, Yonghong Yan
The vocabulary is a vital component of automatic speech recognition(ASR) systems. For a specific Chinese speech recognition task, using a large general vocabulary not only leads to a much longer time to decode, but also hurts the recognition accuracy. In this paper, we proposed an unsupervised algorithm to select task-specific words from a large general vocabulary. The out-of-vocabulary(OOV) rate is a measure of vocabularies, and it is related to the recognition accuracy. However, it is hard to compute OOV rate for a Chinese vocabulary, since OOVs are often segmented into single Chinese characters and most Chinese vocabularies contain all the single Chinese characters. To deal with this problem, we proposed a novel method to estimate the OOV rate of Chinese vocabularies. In experiments, we found that our estimated OOV rate is related to the character error rate(CER) of recognition. Our proposed vocabulary selection method provided both the lowest OOV rate and CER on two Chinese conversational telephone speech(CTS) evaluation sets compared to the general vocabulary and frequency based vocabulary selection method. In addition, our proposed method significantly reduced the size of the language model(LM) and the corresponding weighted finite state transducer(WFST) network, which led to a more efficient decoding.
词汇表是自动语音识别(ASR)系统的重要组成部分。对于特定的汉语语音识别任务,使用大量的通用词汇不仅会导致解码时间更长,而且会损害识别的准确性。在本文中,我们提出了一种从大量通用词汇中选择任务特定词汇的无监督算法。词汇外率是词汇量的度量,它关系到识别的准确性。然而,汉语词汇表的面向对象率很难计算,因为面向对象经常被分割成单个汉字,而且大多数汉语词汇表包含所有单个汉字。为了解决这一问题,我们提出了一种估算汉语词汇OOV率的新方法。在实验中,我们发现我们估计的OOV率与识别的字符错误率(CER)有关。与普通词汇和基于频率的词汇选择方法相比,我们提出的词汇选择方法在两个汉语会话电话语音(CTS)评价集上的OOV率和CER均最低。此外,我们提出的方法显著减小了语言模型(LM)和相应的加权有限状态传感器(WFST)网络的大小,从而提高了解码效率。
{"title":"An unsupervised vocabulary selection technique for Chinese automatic speech recognition","authors":"Yike Zhang, Pengyuan Zhang, Ta Li, Yonghong Yan","doi":"10.1109/SLT.2016.7846298","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846298","url":null,"abstract":"The vocabulary is a vital component of automatic speech recognition(ASR) systems. For a specific Chinese speech recognition task, using a large general vocabulary not only leads to a much longer time to decode, but also hurts the recognition accuracy. In this paper, we proposed an unsupervised algorithm to select task-specific words from a large general vocabulary. The out-of-vocabulary(OOV) rate is a measure of vocabularies, and it is related to the recognition accuracy. However, it is hard to compute OOV rate for a Chinese vocabulary, since OOVs are often segmented into single Chinese characters and most Chinese vocabularies contain all the single Chinese characters. To deal with this problem, we proposed a novel method to estimate the OOV rate of Chinese vocabularies. In experiments, we found that our estimated OOV rate is related to the character error rate(CER) of recognition. Our proposed vocabulary selection method provided both the lowest OOV rate and CER on two Chinese conversational telephone speech(CTS) evaluation sets compared to the general vocabulary and frequency based vocabulary selection method. In addition, our proposed method significantly reduced the size of the language model(LM) and the corresponding weighted finite state transducer(WFST) network, which led to a more efficient decoding.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123135691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Semantic model for fast tagging of word lattices 快速标注词格的语义模型
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846295
L. Velikovich
This paper introduces a semantic tagger that inserts tags into a word lattice, such as one produced by a real-time large-vocabulary speech recognition system. Benefits of such a tagger include the ability to rescore speech recognition hypotheses based on this metadata, as well as providing rich annotations to clients downstream. We focus on the domain of spoken search queries and voice commands, which can be useful for building an intelligent assistant. We explore a method to distill a pre-existing very large named entity disambiguation (NED) model into a lightweight tagger. This is accomplished by constructing a joint distribution of tagged n-grams from a supervised training corpus, then deriving a conditional distribution for a given lattice. With 300 tagging categories, the tagger achieves a precision of 88.2% and recall of 93.1% on 1-best paths in speech recognition lattices with 2.8ms median latency.
本文介绍了一种语义标注器,它将标记插入到词格中,例如实时大词汇量语音识别系统产生的词格。这种标注器的好处包括能够基于此元数据重新记录语音识别假设,以及向下游客户端提供丰富的注释。我们专注于语音搜索查询和语音命令领域,这对于构建智能助手很有用。我们探索了一种将已有的超大型命名实体消歧(NED)模型提炼成轻量级标注器的方法。这是通过从一个有监督的训练语料库中构造一个带标签的n-gram的联合分布,然后为给定的格推导一个条件分布来实现的。在平均延迟2.8ms的语音识别格中,使用300个标注类别,标注器在1-best路径上达到了88.2%的准确率和93.1%的召回率。
{"title":"Semantic model for fast tagging of word lattices","authors":"L. Velikovich","doi":"10.1109/SLT.2016.7846295","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846295","url":null,"abstract":"This paper introduces a semantic tagger that inserts tags into a word lattice, such as one produced by a real-time large-vocabulary speech recognition system. Benefits of such a tagger include the ability to rescore speech recognition hypotheses based on this metadata, as well as providing rich annotations to clients downstream. We focus on the domain of spoken search queries and voice commands, which can be useful for building an intelligent assistant. We explore a method to distill a pre-existing very large named entity disambiguation (NED) model into a lightweight tagger. This is accomplished by constructing a joint distribution of tagged n-grams from a supervised training corpus, then deriving a conditional distribution for a given lattice. With 300 tagging categories, the tagger achieves a precision of 88.2% and recall of 93.1% on 1-best paths in speech recognition lattices with 2.8ms median latency.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130033847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Environmentally robust audio-visual speaker identification 环境稳健的视听扬声器识别
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846282
Lea Schonherr, Dennis Orth, M. Heckmann, D. Kolossa
To improve the accuracy of audio-visual speaker identification, we propose a new approach, which achieves an optimal combination of the different modalities on the score level. We use the i-vector method for the acoustics and the local binary pattern (LBP) for the visual speaker recognition. Regarding the input data of both modalities, multiple confidence measures are utilized to calculate an optimal weight for the fusion. Thus, oracle weights are chosen in such a way as to maximize the difference between the score of the genuine speaker and the person with the best competing score. Based on these oracle weights a mapping function for weight estimation is learned. To test the approach, various combinations of noise levels for the acoustic and visual data are considered. We show that the weighted multimodal identification is far less influenced by the presence of noise or distortions in acoustic or visual observations in comparison to an unweighted combination.
为了提高说话人识别的准确性,本文提出了一种新的方法,该方法在评分水平上实现了不同模式的最佳组合。我们使用i向量方法进行声学识别,使用局部二值模式(LBP)进行视觉说话人识别。对于两种模态的输入数据,使用多个置信度度量来计算融合的最佳权重。因此,选择oracle权重的方式是使真正的演讲者和具有最佳竞争分数的人的分数之间的差异最大化。基于这些oracle权值,学习一个用于权值估计的映射函数。为了测试该方法,考虑了声学和视觉数据的各种噪声水平组合。我们表明,与未加权的组合相比,加权的多模态识别受声学或视觉观测中存在的噪声或扭曲的影响要小得多。
{"title":"Environmentally robust audio-visual speaker identification","authors":"Lea Schonherr, Dennis Orth, M. Heckmann, D. Kolossa","doi":"10.1109/SLT.2016.7846282","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846282","url":null,"abstract":"To improve the accuracy of audio-visual speaker identification, we propose a new approach, which achieves an optimal combination of the different modalities on the score level. We use the i-vector method for the acoustics and the local binary pattern (LBP) for the visual speaker recognition. Regarding the input data of both modalities, multiple confidence measures are utilized to calculate an optimal weight for the fusion. Thus, oracle weights are chosen in such a way as to maximize the difference between the score of the genuine speaker and the person with the best competing score. Based on these oracle weights a mapping function for weight estimation is learned. To test the approach, various combinations of noise levels for the acoustic and visual data are considered. We show that the weighted multimodal identification is far less influenced by the presence of noise or distortions in acoustic or visual observations in comparison to an unweighted combination.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124626342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1