首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features 通过语音和词汇特征的层次融合来识别口语对话中的情绪
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846319
Leimin Tian, Johanna D. Moore, Catherine Lai
Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.
自动情感识别对于构建自然且引人入胜的人机交互系统至关重要。结合来自多种模式的信息通常可以提高情绪识别的性能。在以前的工作中,来自不同模态的特征通常通过两种类型的融合策略在同一级别进行融合:特征级融合,在识别之前将特征集连接起来;决策级融合,基于单峰模型的输出进行最终决策。然而,不同的特征可能在不同的时间尺度上描述数据或具有不同的抽象级别。认知科学研究也表明,在感知情绪时,人类在不同的认知水平和时间步上使用来自不同模态的信息。因此,我们提出了一种多模态情感识别的分层融合策略,该策略在其知识启发结构的更高层次上包含全局或更抽象的特征。我们建立了多模态情感识别模型,结合最先进的声学和词汇特征来研究所提出的分层融合的性能。在两个口语对话情感数据库上的实验表明,该融合策略始终优于特征级和决策级融合。采用层次融合策略的多模态情绪识别模型在识别自发对话和行为对话中的情绪方面都取得了较好的效果。
{"title":"Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features","authors":"Leimin Tian, Johanna D. Moore, Catherine Lai","doi":"10.1109/SLT.2016.7846319","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846319","url":null,"abstract":"Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124895494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Automated structure discovery and parameter tuning of neural network language model based on evolution strategy 基于进化策略的神经网络语言模型自动结构发现与参数整定
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846334
Tomohiro Tanaka, Takafumi Moriya, T. Shinozaki, Shinji Watanabe, Takaaki Hori, Kevin Duh
Long short-term memory (LSTM) recurrent neural network based language models are known to improve speech recognition performance. However, significant effort is required to optimize network structures and training configurations. In this study, we automate the development process using evolutionary algorithms. In particular, we apply the covariance matrix adaptation-evolution strategy (CMA-ES), which has demonstrated robustness in other black box hyper-parameter optimization problems. By flexibly allowing optimization of various meta-parameters including layer wise unit types, our method automatically finds a configuration that gives improved recognition performance. Further, by using a Pareto based multi-objective CMA-ES, both WER and computational time were reduced jointly: after 10 generations, relative WER and computational time reductions for decoding were 4.1% and 22.7% respectively, compared to an initial baseline system whose WER was 8.7%.
基于长短期记忆(LSTM)的递归神经网络语言模型可以提高语音识别性能。然而,优化网络结构和训练配置需要大量的努力。在这项研究中,我们使用进化算法自动化开发过程。特别地,我们应用了协方差矩阵自适应进化策略(CMA-ES),该策略在其他黑盒超参数优化问题中显示出鲁棒性。通过灵活地允许各种元参数(包括分层单元类型)的优化,我们的方法自动找到一种能够提高识别性能的配置。此外,通过使用基于Pareto的多目标CMA-ES, WER和计算时间都得到了降低:经过10代,解码的相对WER和计算时间分别减少了4.1%和22.7%,而初始基线系统的WER为8.7%。
{"title":"Automated structure discovery and parameter tuning of neural network language model based on evolution strategy","authors":"Tomohiro Tanaka, Takafumi Moriya, T. Shinozaki, Shinji Watanabe, Takaaki Hori, Kevin Duh","doi":"10.1109/SLT.2016.7846334","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846334","url":null,"abstract":"Long short-term memory (LSTM) recurrent neural network based language models are known to improve speech recognition performance. However, significant effort is required to optimize network structures and training configurations. In this study, we automate the development process using evolutionary algorithms. In particular, we apply the covariance matrix adaptation-evolution strategy (CMA-ES), which has demonstrated robustness in other black box hyper-parameter optimization problems. By flexibly allowing optimization of various meta-parameters including layer wise unit types, our method automatically finds a configuration that gives improved recognition performance. Further, by using a Pareto based multi-objective CMA-ES, both WER and computational time were reduced jointly: after 10 generations, relative WER and computational time reductions for decoding were 4.1% and 22.7% respectively, compared to an initial baseline system whose WER was 8.7%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114425383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Unsupervised context learning for speech recognition 语音识别的无监督上下文学习
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846302
A. Michaely, M. Ghodsi, Zelin Wu, Justin Scheiner, Petar S. Aleksic
It has been shown in the literature that automatic speech recognition systems can greatly benefit from contextual information [1, 2, 3, 4, 5]. Contextual information can be used to simplify the beam search and improve recognition accuracy. Types of useful contextual information can include the name of the application the user is in, the contents of the user's phone screen, the user's location, a certain dialog state, etc. Building a separate language model for each of these types of context is not feasible due to limited resources or limited amounts of training data. In this paper we describe an approach for unsupervised learning of contextual information and automatic building of contextual biasing models. Our approach can be used to build a large number of small contextual models from a limited amount of available unsupervised training data. We describe how n-grams relevant for a particular context are automatically selected as well as how an optimal size of a final contextual model is chosen. Our experimental results show great accuracy improvements for several types of context.
文献表明,自动语音识别系统可以极大地受益于上下文信息[1,2,3,4,5]。利用上下文信息可以简化波束搜索,提高识别精度。有用的上下文信息类型可以包括用户所在应用程序的名称、用户手机屏幕的内容、用户的位置、某个对话框状态等。由于资源有限或训练数据有限,为每种类型的上下文构建单独的语言模型是不可行的。本文描述了一种上下文信息的无监督学习和上下文偏差模型的自动构建方法。我们的方法可用于从有限数量的可用无监督训练数据中构建大量小型上下文模型。我们描述了如何自动选择与特定上下文相关的n-gram,以及如何选择最终上下文模型的最佳大小。我们的实验结果表明,对几种类型的上下文有很大的准确性提高。
{"title":"Unsupervised context learning for speech recognition","authors":"A. Michaely, M. Ghodsi, Zelin Wu, Justin Scheiner, Petar S. Aleksic","doi":"10.1109/SLT.2016.7846302","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846302","url":null,"abstract":"It has been shown in the literature that automatic speech recognition systems can greatly benefit from contextual information [1, 2, 3, 4, 5]. Contextual information can be used to simplify the beam search and improve recognition accuracy. Types of useful contextual information can include the name of the application the user is in, the contents of the user's phone screen, the user's location, a certain dialog state, etc. Building a separate language model for each of these types of context is not feasible due to limited resources or limited amounts of training data. In this paper we describe an approach for unsupervised learning of contextual information and automatic building of contextual biasing models. Our approach can be used to build a large number of small contextual models from a limited amount of available unsupervised training data. We describe how n-grams relevant for a particular context are automatically selected as well as how an optimal size of a final contextual model is chosen. Our experimental results show great accuracy improvements for several types of context.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116112553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
An unsupervised vocabulary selection technique for Chinese automatic speech recognition 中文语音自动识别的无监督词汇选择技术
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846298
Yike Zhang, Pengyuan Zhang, Ta Li, Yonghong Yan
The vocabulary is a vital component of automatic speech recognition(ASR) systems. For a specific Chinese speech recognition task, using a large general vocabulary not only leads to a much longer time to decode, but also hurts the recognition accuracy. In this paper, we proposed an unsupervised algorithm to select task-specific words from a large general vocabulary. The out-of-vocabulary(OOV) rate is a measure of vocabularies, and it is related to the recognition accuracy. However, it is hard to compute OOV rate for a Chinese vocabulary, since OOVs are often segmented into single Chinese characters and most Chinese vocabularies contain all the single Chinese characters. To deal with this problem, we proposed a novel method to estimate the OOV rate of Chinese vocabularies. In experiments, we found that our estimated OOV rate is related to the character error rate(CER) of recognition. Our proposed vocabulary selection method provided both the lowest OOV rate and CER on two Chinese conversational telephone speech(CTS) evaluation sets compared to the general vocabulary and frequency based vocabulary selection method. In addition, our proposed method significantly reduced the size of the language model(LM) and the corresponding weighted finite state transducer(WFST) network, which led to a more efficient decoding.
词汇表是自动语音识别(ASR)系统的重要组成部分。对于特定的汉语语音识别任务,使用大量的通用词汇不仅会导致解码时间更长,而且会损害识别的准确性。在本文中,我们提出了一种从大量通用词汇中选择任务特定词汇的无监督算法。词汇外率是词汇量的度量,它关系到识别的准确性。然而,汉语词汇表的面向对象率很难计算,因为面向对象经常被分割成单个汉字,而且大多数汉语词汇表包含所有单个汉字。为了解决这一问题,我们提出了一种估算汉语词汇OOV率的新方法。在实验中,我们发现我们估计的OOV率与识别的字符错误率(CER)有关。与普通词汇和基于频率的词汇选择方法相比,我们提出的词汇选择方法在两个汉语会话电话语音(CTS)评价集上的OOV率和CER均最低。此外,我们提出的方法显著减小了语言模型(LM)和相应的加权有限状态传感器(WFST)网络的大小,从而提高了解码效率。
{"title":"An unsupervised vocabulary selection technique for Chinese automatic speech recognition","authors":"Yike Zhang, Pengyuan Zhang, Ta Li, Yonghong Yan","doi":"10.1109/SLT.2016.7846298","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846298","url":null,"abstract":"The vocabulary is a vital component of automatic speech recognition(ASR) systems. For a specific Chinese speech recognition task, using a large general vocabulary not only leads to a much longer time to decode, but also hurts the recognition accuracy. In this paper, we proposed an unsupervised algorithm to select task-specific words from a large general vocabulary. The out-of-vocabulary(OOV) rate is a measure of vocabularies, and it is related to the recognition accuracy. However, it is hard to compute OOV rate for a Chinese vocabulary, since OOVs are often segmented into single Chinese characters and most Chinese vocabularies contain all the single Chinese characters. To deal with this problem, we proposed a novel method to estimate the OOV rate of Chinese vocabularies. In experiments, we found that our estimated OOV rate is related to the character error rate(CER) of recognition. Our proposed vocabulary selection method provided both the lowest OOV rate and CER on two Chinese conversational telephone speech(CTS) evaluation sets compared to the general vocabulary and frequency based vocabulary selection method. In addition, our proposed method significantly reduced the size of the language model(LM) and the corresponding weighted finite state transducer(WFST) network, which led to a more efficient decoding.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123135691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Semantic model for fast tagging of word lattices 快速标注词格的语义模型
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846295
L. Velikovich
This paper introduces a semantic tagger that inserts tags into a word lattice, such as one produced by a real-time large-vocabulary speech recognition system. Benefits of such a tagger include the ability to rescore speech recognition hypotheses based on this metadata, as well as providing rich annotations to clients downstream. We focus on the domain of spoken search queries and voice commands, which can be useful for building an intelligent assistant. We explore a method to distill a pre-existing very large named entity disambiguation (NED) model into a lightweight tagger. This is accomplished by constructing a joint distribution of tagged n-grams from a supervised training corpus, then deriving a conditional distribution for a given lattice. With 300 tagging categories, the tagger achieves a precision of 88.2% and recall of 93.1% on 1-best paths in speech recognition lattices with 2.8ms median latency.
本文介绍了一种语义标注器,它将标记插入到词格中,例如实时大词汇量语音识别系统产生的词格。这种标注器的好处包括能够基于此元数据重新记录语音识别假设,以及向下游客户端提供丰富的注释。我们专注于语音搜索查询和语音命令领域,这对于构建智能助手很有用。我们探索了一种将已有的超大型命名实体消歧(NED)模型提炼成轻量级标注器的方法。这是通过从一个有监督的训练语料库中构造一个带标签的n-gram的联合分布,然后为给定的格推导一个条件分布来实现的。在平均延迟2.8ms的语音识别格中,使用300个标注类别,标注器在1-best路径上达到了88.2%的准确率和93.1%的召回率。
{"title":"Semantic model for fast tagging of word lattices","authors":"L. Velikovich","doi":"10.1109/SLT.2016.7846295","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846295","url":null,"abstract":"This paper introduces a semantic tagger that inserts tags into a word lattice, such as one produced by a real-time large-vocabulary speech recognition system. Benefits of such a tagger include the ability to rescore speech recognition hypotheses based on this metadata, as well as providing rich annotations to clients downstream. We focus on the domain of spoken search queries and voice commands, which can be useful for building an intelligent assistant. We explore a method to distill a pre-existing very large named entity disambiguation (NED) model into a lightweight tagger. This is accomplished by constructing a joint distribution of tagged n-grams from a supervised training corpus, then deriving a conditional distribution for a given lattice. With 300 tagging categories, the tagger achieves a precision of 88.2% and recall of 93.1% on 1-best paths in speech recognition lattices with 2.8ms median latency.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130033847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Environmentally robust audio-visual speaker identification 环境稳健的视听扬声器识别
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846282
Lea Schonherr, Dennis Orth, M. Heckmann, D. Kolossa
To improve the accuracy of audio-visual speaker identification, we propose a new approach, which achieves an optimal combination of the different modalities on the score level. We use the i-vector method for the acoustics and the local binary pattern (LBP) for the visual speaker recognition. Regarding the input data of both modalities, multiple confidence measures are utilized to calculate an optimal weight for the fusion. Thus, oracle weights are chosen in such a way as to maximize the difference between the score of the genuine speaker and the person with the best competing score. Based on these oracle weights a mapping function for weight estimation is learned. To test the approach, various combinations of noise levels for the acoustic and visual data are considered. We show that the weighted multimodal identification is far less influenced by the presence of noise or distortions in acoustic or visual observations in comparison to an unweighted combination.
为了提高说话人识别的准确性,本文提出了一种新的方法,该方法在评分水平上实现了不同模式的最佳组合。我们使用i向量方法进行声学识别,使用局部二值模式(LBP)进行视觉说话人识别。对于两种模态的输入数据,使用多个置信度度量来计算融合的最佳权重。因此,选择oracle权重的方式是使真正的演讲者和具有最佳竞争分数的人的分数之间的差异最大化。基于这些oracle权值,学习一个用于权值估计的映射函数。为了测试该方法,考虑了声学和视觉数据的各种噪声水平组合。我们表明,与未加权的组合相比,加权的多模态识别受声学或视觉观测中存在的噪声或扭曲的影响要小得多。
{"title":"Environmentally robust audio-visual speaker identification","authors":"Lea Schonherr, Dennis Orth, M. Heckmann, D. Kolossa","doi":"10.1109/SLT.2016.7846282","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846282","url":null,"abstract":"To improve the accuracy of audio-visual speaker identification, we propose a new approach, which achieves an optimal combination of the different modalities on the score level. We use the i-vector method for the acoustics and the local binary pattern (LBP) for the visual speaker recognition. Regarding the input data of both modalities, multiple confidence measures are utilized to calculate an optimal weight for the fusion. Thus, oracle weights are chosen in such a way as to maximize the difference between the score of the genuine speaker and the person with the best competing score. Based on these oracle weights a mapping function for weight estimation is learned. To test the approach, various combinations of noise levels for the acoustic and visual data are considered. We show that the weighted multimodal identification is far less influenced by the presence of noise or distortions in acoustic or visual observations in comparison to an unweighted combination.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124626342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Deep learning with maximal figure-of-merit cost to advance multi-label speech attribute detection 基于最优值代价的深度学习推进多标签语音属性检测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846308
Ivan Kukanov, Ville Hautamäki, S. Siniscalchi, Kehuang Li
In this work, we are interested in boosting speech attribute detection by formulating it as a multi-label classification task, and deep neural networks (DNNs) are used to design speech attribute detectors. A straightforward way to tackle the speech attribute detection task is to estimate DNN parameters using the mean squared error (MSE) loss function and employ a sigmoid function in the DNN output nodes. A more principled way is nonetheless to incorporate the micro-F1 measure, which is a widely used metric in the multi-label classification, into the DNN loss function to directly improve the metric of interest at training time. Micro-F1 is not differentiable, yet we overcome such a problem by casting our task under the maximal figure-of-merit (MFoM) learning framework. The results demonstrate that our MFoM approach consistently outperforms the baseline systems.
在这项工作中,我们感兴趣的是通过将语音属性检测制定为多标签分类任务来增强语音属性检测,并使用深度神经网络(dnn)来设计语音属性检测器。解决语音属性检测任务的一种直接方法是使用均方误差(MSE)损失函数估计DNN参数,并在DNN输出节点中使用sigmoid函数。然而,一种更有原则的方法是将微f1度量(这是多标签分类中广泛使用的度量)纳入DNN损失函数中,以直接改进训练时的感兴趣度量。Micro-F1是不可微的,但我们通过将任务置于最大价值图(MFoM)学习框架下来克服这一问题。结果表明,我们的MFoM方法始终优于基线系统。
{"title":"Deep learning with maximal figure-of-merit cost to advance multi-label speech attribute detection","authors":"Ivan Kukanov, Ville Hautamäki, S. Siniscalchi, Kehuang Li","doi":"10.1109/SLT.2016.7846308","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846308","url":null,"abstract":"In this work, we are interested in boosting speech attribute detection by formulating it as a multi-label classification task, and deep neural networks (DNNs) are used to design speech attribute detectors. A straightforward way to tackle the speech attribute detection task is to estimate DNN parameters using the mean squared error (MSE) loss function and employ a sigmoid function in the DNN output nodes. A more principled way is nonetheless to incorporate the micro-F1 measure, which is a widely used metric in the multi-label classification, into the DNN loss function to directly improve the metric of interest at training time. Micro-F1 is not differentiable, yet we overcome such a problem by casting our task under the maximal figure-of-merit (MFoM) learning framework. The results demonstrate that our MFoM approach consistently outperforms the baseline systems.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121625188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Parallel Long Short-Term Memory for multi-stream classification 多流分类的并行长短期记忆
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846268
Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès, R. Mori
Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells allow these DNN-based models to manage long-term dependencies such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM). Nevertheless, these RNNs process a single input stream in one (LSTM) or two (Bidirectional LSTM) directions. But most of the information available nowadays is from multistreams or multimedia documents, and require RNNs to process these information synchronously during the training. This paper presents an original LSTM-based architecture, named Parallel LSTM (PLSTM), that carries out multiple parallel synchronized input sequences in order to predict a common output. The proposed PLSTM method could be used for parallel sequence classification purposes. The PLSTM approach is evaluated on an automatic telecast genre sequences classification task and compared with different state-of-the-art architectures. Results show that the proposed PLSTM method outperforms the baseline n-gram models as well as the state-of-the-art LSTM approach.
最近,机器学习方法提供了广泛的基于深度神经网络(DNN)的原始和高效算法,以自动预测相对于输入序列的结果。循环隐藏细胞允许这些基于dnn的模型管理长期依赖,如循环神经网络(RNN)和长短期记忆(LSTM)。然而,这些rnn在一个(LSTM)或两个(双向LSTM)方向上处理单个输入流。但目前大多数可用的信息来自多流或多媒体文档,并且要求rnn在训练过程中同步处理这些信息。本文提出了一种基于LSTM的原始体系结构,称为并行LSTM (PLSTM),它执行多个并行同步输入序列以预测公共输出。所提出的PLSTM方法可用于并行序列分类。在一个电视节目类型序列自动分类任务中对PLSTM方法进行了评估,并与不同的最先进的体系结构进行了比较。结果表明,所提出的PLSTM方法优于基线n-gram模型以及最先进的LSTM方法。
{"title":"Parallel Long Short-Term Memory for multi-stream classification","authors":"Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès, R. Mori","doi":"10.1109/SLT.2016.7846268","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846268","url":null,"abstract":"Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells allow these DNN-based models to manage long-term dependencies such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM). Nevertheless, these RNNs process a single input stream in one (LSTM) or two (Bidirectional LSTM) directions. But most of the information available nowadays is from multistreams or multimedia documents, and require RNNs to process these information synchronously during the training. This paper presents an original LSTM-based architecture, named Parallel LSTM (PLSTM), that carries out multiple parallel synchronized input sequences in order to predict a common output. The proposed PLSTM method could be used for parallel sequence classification purposes. The PLSTM approach is evaluated on an automatic telecast genre sequences classification task and compared with different state-of-the-art architectures. Results show that the proposed PLSTM method outperforms the baseline n-gram models as well as the state-of-the-art LSTM approach.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126700902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
BBN technologies' OpenSAD system BBN technologies的OpenSAD系统
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846238
Scott Novotney, D. Karakos, J. Silovský, R. Schwartz
We describe our submission to the NIST OpenSAD evaluation of speech activity detection of noisy audio generated by the DARPA RATS program. With frequent transmission degradation, channel interference and other noises added, simple energy thresholds do a poor job at SAD for this audio. The evaluation measured performance on both in-training and novel channels. Our approach used a system combination of feed-forward neural networks and bidirectional LSTM recurrent neural networks. System combination and unsupervised adaptation provided further gains on novel channels that lack training data. These improvements lead to a 26% relative improvement for novel channels over simple decoding. Our system resulted in the lowest error rate on the in-training channels and second on the out-of-training channels.
我们向NIST openad提交了对DARPA RATS项目产生的噪声音频的语音活动检测的评估。随着频繁的传输退化、信道干扰和其他噪声的增加,简单的能量阈值对这种音频的SAD效果很差。该评估测量了培训和新渠道的绩效。我们的方法使用了前馈神经网络和双向LSTM递归神经网络的系统组合。系统组合和无监督自适应在缺乏训练数据的新信道上提供了进一步的增益。与简单解码相比,这些改进使新信道的相对性能提高了26%。我们的系统在培训频道的错误率最低,在培训频道的错误率第二。
{"title":"BBN technologies' OpenSAD system","authors":"Scott Novotney, D. Karakos, J. Silovský, R. Schwartz","doi":"10.1109/SLT.2016.7846238","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846238","url":null,"abstract":"We describe our submission to the NIST OpenSAD evaluation of speech activity detection of noisy audio generated by the DARPA RATS program. With frequent transmission degradation, channel interference and other noises added, simple energy thresholds do a poor job at SAD for this audio. The evaluation measured performance on both in-training and novel channels. Our approach used a system combination of feed-forward neural networks and bidirectional LSTM recurrent neural networks. System combination and unsupervised adaptation provided further gains on novel channels that lack training data. These improvements lead to a 26% relative improvement for novel channels over simple decoding. Our system resulted in the lowest error rate on the in-training channels and second on the out-of-training channels.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125855402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automated optimization of decoder hyper-parameters for online LVCSR 在线LVCSR解码器超参数的自动优化
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846303
Akshay Chandrashekaran, Ian Lane
In this paper, we explore the usage of automated hyper-parameter optimization techniques with scalarization of multiple objectives to find decoder hyper-parameters suitable for a given acoustic and language model for an LVCSR task. We compare manual optimization, random sampling, tree of Parzen estimators, Bayesian Optimization, and genetic algorithm to find a technique that yields better performance than manual optimization in a comparable number of hyper-parameter evaluations. We focus on a scalar combination of word error rate (WER), log of real time factor (logRTF), and peak memory usage, formulated using the augmented Tchebyscheff function(ATF), as the objective function for the automated techniques. For this task, with a constraint on the maximum number of objective evaluations, we find that the best automated optimization technique: Bayesian Optimization outperforms manual optimization by 8% in terms of ATF. We find that memory usage was not a very useful distinguishing factor between different hyper-parameter settings, with trade-offs occurring between RTF and WER a majority of the time. We also try to perform optimization of WER with a hard constraint on the real time factor of 0.1. In this case, performing constrained Bayesian Optimization yields a model that provides an improvement of 2.7% over the best model obtained from manual optimization with 60% the number of evaluations.
在本文中,我们探索了使用多目标标化的自动超参数优化技术来寻找适合LVCSR任务的给定声学和语言模型的解码器超参数。我们比较了手动优化、随机抽样、Parzen估计器树、贝叶斯优化和遗传算法,以找到在相当数量的超参数评估中比手动优化产生更好性能的技术。我们将重点放在单词错误率(WER)、实时因子对数(logRTF)和峰值内存使用的标量组合上,使用增强Tchebyscheff函数(ATF)作为自动化技术的目标函数。对于这个任务,在限制最大客观评价次数的情况下,我们发现最好的自动优化技术:贝叶斯优化在ATF方面比人工优化高出8%。我们发现,内存使用并不是区分不同超参数设置的一个非常有用的因素,在大多数情况下,RTF和WER之间存在权衡。我们还尝试在实时性因子为0.1的硬性约束下对WER进行优化。在这种情况下,执行约束贝叶斯优化产生的模型比手动优化获得的最佳模型改进了2.7%,评估次数减少了60%。
{"title":"Automated optimization of decoder hyper-parameters for online LVCSR","authors":"Akshay Chandrashekaran, Ian Lane","doi":"10.1109/SLT.2016.7846303","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846303","url":null,"abstract":"In this paper, we explore the usage of automated hyper-parameter optimization techniques with scalarization of multiple objectives to find decoder hyper-parameters suitable for a given acoustic and language model for an LVCSR task. We compare manual optimization, random sampling, tree of Parzen estimators, Bayesian Optimization, and genetic algorithm to find a technique that yields better performance than manual optimization in a comparable number of hyper-parameter evaluations. We focus on a scalar combination of word error rate (WER), log of real time factor (logRTF), and peak memory usage, formulated using the augmented Tchebyscheff function(ATF), as the objective function for the automated techniques. For this task, with a constraint on the maximum number of objective evaluations, we find that the best automated optimization technique: Bayesian Optimization outperforms manual optimization by 8% in terms of ATF. We find that memory usage was not a very useful distinguishing factor between different hyper-parameter settings, with trade-offs occurring between RTF and WER a majority of the time. We also try to perform optimization of WER with a hard constraint on the real time factor of 0.1. In this case, performing constrained Bayesian Optimization yields a model that provides an improvement of 2.7% over the best model obtained from manual optimization with 60% the number of evaluations.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125964616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1