首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Tracking dialog states using an Author-Topic based representation 使用基于作者-主题的表示跟踪对话框状态
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846316
Richard Dufour, Mohamed Morchid, Titouan Parcollet
Automatically translating textual documents from one language to another inevitably results in translation errors. In addition to language specificities, this automatic translation appears more difficult in the context of spoken dialogues since, for example, the language register is far from “clean speech”. Speech analytics suffer from these translation errors. To tackle this difficulty, a solution consists in mapping translations into a space of hidden topics. In the classical topic-based representation obtained from a Latent Dirichlet Allocation (LDA), distribution of words into each topic is estimated automatically. Nonetheless, the targeted classes are ignored in the particular context of a classification task. In the DSTC5 main task, this targeted class information is crucial, the main objective being to track dialog states for sub-dialog segments. For this challenge, we propose to apply an original topic-based representation for each sub-dialogue based not only on the sub-dialogue content itself (words), but also on the dialogue state related to the sub-dialogue. This original representation is based on the Author-Topic (AT) model, previously successfully applied on a different classification task. Promising results confirmed the interest of such a method, the AT model reaching performance slightly better in terms of F-measure than baseline ones given by the task's organizers.
自动将文本文档从一种语言翻译成另一种语言不可避免地会导致翻译错误。除了语言的特殊性之外,这种自动翻译在口语对话的语境中显得更加困难,例如,语言域远非“干净的语言”。语音分析深受这些翻译错误之苦。要解决这个困难,一个解决方案是将翻译映射到隐藏主题的空间。在经典的基于主题的表示中,潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)自动估计单词在每个主题中的分布。尽管如此,目标类在分类任务的特定上下文中被忽略。在DSTC5的主要任务中,目标类信息是至关重要的,主要目标是跟踪子对话段的对话状态。对于这一挑战,我们建议对每个子对话应用原始的基于主题的表示,不仅基于子对话内容本身(单词),而且基于与子对话相关的对话状态。这种原始表示基于作者-主题(AT)模型,该模型以前成功地应用于不同的分类任务。有希望的结果证实了这种方法的兴趣,AT模型在f测量方面达到的性能略好于任务组织者给出的基线。
{"title":"Tracking dialog states using an Author-Topic based representation","authors":"Richard Dufour, Mohamed Morchid, Titouan Parcollet","doi":"10.1109/SLT.2016.7846316","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846316","url":null,"abstract":"Automatically translating textual documents from one language to another inevitably results in translation errors. In addition to language specificities, this automatic translation appears more difficult in the context of spoken dialogues since, for example, the language register is far from “clean speech”. Speech analytics suffer from these translation errors. To tackle this difficulty, a solution consists in mapping translations into a space of hidden topics. In the classical topic-based representation obtained from a Latent Dirichlet Allocation (LDA), distribution of words into each topic is estimated automatically. Nonetheless, the targeted classes are ignored in the particular context of a classification task. In the DSTC5 main task, this targeted class information is crucial, the main objective being to track dialog states for sub-dialog segments. For this challenge, we propose to apply an original topic-based representation for each sub-dialogue based not only on the sub-dialogue content itself (words), but also on the dialogue state related to the sub-dialogue. This original representation is based on the Author-Topic (AT) model, previously successfully applied on a different classification task. Promising results confirmed the interest of such a method, the AT model reaching performance slightly better in terms of F-measure than baseline ones given by the task's organizers.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128446816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches 使用声学和词汇方法的多体裁广播的标点符号转录
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846300
Ondrej Klejch, P. Bell, S. Renals
In this paper we investigate the punctuated transcription of multi-genre broadcast media. We examine four systems, three of which are based on lexical features, the fourth of which uses acoustic features by integrating punctuation into the speech recognition acoustic models. We also explore the combination of these component systems using voting and log-linear interpolation. We performed experiments on the English language MGB Challenge data, which comprises about 1,600h of BBC television recordings. Our results indicate that a lexical system, based on a neural machine translation approach is significantly better than other systems achieving an F-Measure of 62.6% on reference text, with a relative degradation of 19% on ASR output. Our analysis of the results in terms of specific punctuation indicated that using longer context improves the prediction of question marks and acoustic information improves prediction of exclamation marks. Finally, we show that even though the systems are complementary, their straightforward combination does not yield better F-measures than a single system using neural machine translation.
本文对多体裁广播媒体的标点转写进行了研究。我们研究了四个系统,其中三个系统基于词汇特征,第四个系统通过将标点符号集成到语音识别声学模型中来使用声学特征。我们还探讨了这些组件系统的组合使用投票和对数线性插值。我们对英语语言MGB挑战数据进行了实验,该数据包括大约1600小时的BBC电视录音。我们的研究结果表明,基于神经机器翻译方法的词汇系统明显优于其他系统,在参考文本上实现了62.6%的F-Measure,而在ASR输出上相对下降了19%。我们对特定标点符号的分析结果表明,使用较长的上下文可以提高问号的预测,而声学信息可以提高感叹号的预测。最后,我们表明,即使系统是互补的,它们的直接组合也不会产生比使用神经机器翻译的单个系统更好的f度量。
{"title":"Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches","authors":"Ondrej Klejch, P. Bell, S. Renals","doi":"10.1109/SLT.2016.7846300","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846300","url":null,"abstract":"In this paper we investigate the punctuated transcription of multi-genre broadcast media. We examine four systems, three of which are based on lexical features, the fourth of which uses acoustic features by integrating punctuation into the speech recognition acoustic models. We also explore the combination of these component systems using voting and log-linear interpolation. We performed experiments on the English language MGB Challenge data, which comprises about 1,600h of BBC television recordings. Our results indicate that a lexical system, based on a neural machine translation approach is significantly better than other systems achieving an F-Measure of 62.6% on reference text, with a relative degradation of 19% on ASR output. Our analysis of the results in terms of specific punctuation indicated that using longer context improves the prediction of question marks and acoustic information improves prediction of exclamation marks. Finally, we show that even though the systems are complementary, their straightforward combination does not yield better F-measures than a single system using neural machine translation.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129148739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
A multichannel convolutional neural network for cross-language dialog state tracking 基于多通道卷积神经网络的跨语言对话状态跟踪
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846318
Hongjie Shi, Takashi Ushio, M. Endo, K. Yamagami, Noriaki Horii
The fifth Dialog State Tracking Challenge (DSTC5) introduces a new cross-language dialog state tracking scenario, where the participants are asked to build their trackers based on the English training corpus, while evaluating them with the unlabeled Chinese corpus. Although the computer-generated translations for both English and Chinese corpus are provided in the dataset, these translations contain errors and careless use of them can easily hurt the performance of the built trackers. To address this problem, we propose a multichannel Convolutional Neural Networks (CNN) architecture, in which we treat English and Chinese language as different input channels of one single CNN model. In the evaluation of DSTC5, we found that such multichannel architecture can effectively improve the robustness against translation errors. Additionally, our method for DSTC5 is purely machine learning based and requires no prior knowledge about the target language. We consider this a desirable property for building a tracker in the cross-language context, as not every developer will be familiar with both languages.
第五次对话状态跟踪挑战(DSTC5)引入了一个新的跨语言对话状态跟踪场景,参与者被要求基于英语训练语料库构建自己的跟踪器,同时使用未标记的中文语料库对其进行评估。尽管数据集中提供了计算机生成的中英文语料库翻译,但这些翻译包含错误,并且不小心使用它们很容易损害构建的跟踪器的性能。为了解决这个问题,我们提出了一种多通道卷积神经网络(CNN)架构,其中我们将英语和中文作为一个CNN模型的不同输入通道。在对DSTC5的评估中,我们发现这种多通道架构可以有效地提高对翻译错误的鲁棒性。此外,我们的DSTC5方法纯粹是基于机器学习的,不需要关于目标语言的先验知识。我们认为这是在跨语言上下文中构建跟踪器的理想属性,因为不是每个开发人员都熟悉这两种语言。
{"title":"A multichannel convolutional neural network for cross-language dialog state tracking","authors":"Hongjie Shi, Takashi Ushio, M. Endo, K. Yamagami, Noriaki Horii","doi":"10.1109/SLT.2016.7846318","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846318","url":null,"abstract":"The fifth Dialog State Tracking Challenge (DSTC5) introduces a new cross-language dialog state tracking scenario, where the participants are asked to build their trackers based on the English training corpus, while evaluating them with the unlabeled Chinese corpus. Although the computer-generated translations for both English and Chinese corpus are provided in the dataset, these translations contain errors and careless use of them can easily hurt the performance of the built trackers. To address this problem, we propose a multichannel Convolutional Neural Networks (CNN) architecture, in which we treat English and Chinese language as different input channels of one single CNN model. In the evaluation of DSTC5, we found that such multichannel architecture can effectively improve the robustness against translation errors. Additionally, our method for DSTC5 is purely machine learning based and requires no prior knowledge about the target language. We consider this a desirable property for building a tracker in the cross-language context, as not every developer will be familiar with both languages.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126160523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Comparing speaker independent and speaker adapted classification for word prominence detection 比较说话人独立分类和说话人适应分类对单词突出检测的影响
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846271
Andrea Schnall, M. Heckmann
Prosodic cues are an important part of human communication. One of these cues is the word prominence which is used to e.g. highlight important information. Since individual speakers use different ways of expressing prominence, it is not easily extracted and incorporated in a dialog system. As a consequence, up to date prominence only plays a marginal role in human-machine communication. In this paper we compare DNNs and SVMs trained speaker independently with the results of classification with SVM using a speaker adaptation method we recently developed. This adaptation method is based on the radial basis function of the SVM with a Gaussian regularization, which is derived from fMLLR. With this adaptation, we can notably reduce the problem of speaker variations. We present detailed evaluations of the methods and discuss advantages and shortcomings of the proposed approaches for word prominence detection.
韵律线索是人类交流的重要组成部分。其中一个提示是单词prominent,用来强调重要的信息。由于每个说话者使用不同的方式来表达突出,所以它不容易被提取并纳入对话系统。因此,在人机交流中,最新的突出只起着边缘作用。在本文中,我们将dnn和SVM独立训练的说话人与使用我们最近开发的说话人自适应方法的支持向量机分类结果进行了比较。该自适应方法是基于基于高斯正则化的径向基函数的支持向量机,它是由fMLLR衍生而来的。通过这种适应,我们可以显著减少说话人变化的问题。我们对这些方法进行了详细的评估,并讨论了这些方法的优点和缺点。
{"title":"Comparing speaker independent and speaker adapted classification for word prominence detection","authors":"Andrea Schnall, M. Heckmann","doi":"10.1109/SLT.2016.7846271","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846271","url":null,"abstract":"Prosodic cues are an important part of human communication. One of these cues is the word prominence which is used to e.g. highlight important information. Since individual speakers use different ways of expressing prominence, it is not easily extracted and incorporated in a dialog system. As a consequence, up to date prominence only plays a marginal role in human-machine communication. In this paper we compare DNNs and SVMs trained speaker independently with the results of classification with SVM using a speaker adaptation method we recently developed. This adaptation method is based on the radial basis function of the SVM with a Gaussian regularization, which is derived from fMLLR. With this adaptation, we can notably reduce the problem of speaker variations. We present detailed evaluations of the methods and discuss advantages and shortcomings of the proposed approaches for word prominence detection.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"73 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113971843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Discriminative acoustic word embeddings: Tecurrent neural network-based approaches 判别声学词嵌入:当前基于神经网络的方法
Pub Date : 2016-11-08 DOI: 10.1109/SLT.2016.7846310
Shane Settle, Karen Livescu
Acoustic word embeddings — fixed-dimensional vector representations of variable-length spoken word segments — have begun to be considered for tasks such as speech recognition and query-by-example search. Such embeddings can be learned discriminatively so that they are similar for speech segments corresponding to the same word, while being dissimilar for segments corresponding to different words. Recent work has found that acoustic word embeddings can outperform dynamic time warping on query-by-example search and related word discrimination tasks. However, the space of embedding models and training approaches is still relatively unexplored. In this paper we present new discriminative embedding models based on recurrent neural networks (RNNs). We consider training losses that have been successful in prior work, in particular a cross entropy loss for word classification and a contrastive loss that explicitly aims to separate same-word and different-word pairs in a “Siamese network” training setting. We find that both classifier-based and Siamese RNN embeddings improve over previously reported results on a word discrimination task, with Siamese RNNs outperforming classification models. In addition, we present analyses of the learned embeddings and the effects of variables such as dimensionality and network structure.
声学词嵌入——可变长度口语词段的固定维向量表示——已经开始被考虑用于语音识别和按例查询搜索等任务。这种嵌入可以被区分地学习,使得它们对于同一词对应的语音片段是相似的,而对于不同词对应的语音片段是不同的。最近的研究发现,声学词嵌入在按例查询搜索和相关词识别任务上的表现优于动态时间扭曲。然而,嵌入模型和训练方法的空间仍然相对未被探索。本文提出了一种基于递归神经网络(RNNs)的判别嵌入模型。我们考虑了在之前的工作中已经成功的训练损失,特别是单词分类的交叉熵损失和明确旨在在“暹罗网络”训练设置中分离相同单词和不同单词对的对比损失。我们发现基于分类器和Siamese RNN嵌入在单词识别任务上都比之前报道的结果有所改善,其中Siamese RNN优于分类模型。此外,我们还分析了学习嵌入以及维度和网络结构等变量的影响。
{"title":"Discriminative acoustic word embeddings: Tecurrent neural network-based approaches","authors":"Shane Settle, Karen Livescu","doi":"10.1109/SLT.2016.7846310","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846310","url":null,"abstract":"Acoustic word embeddings — fixed-dimensional vector representations of variable-length spoken word segments — have begun to be considered for tasks such as speech recognition and query-by-example search. Such embeddings can be learned discriminatively so that they are similar for speech segments corresponding to the same word, while being dissimilar for segments corresponding to different words. Recent work has found that acoustic word embeddings can outperform dynamic time warping on query-by-example search and related word discrimination tasks. However, the space of embedding models and training approaches is still relatively unexplored. In this paper we present new discriminative embedding models based on recurrent neural networks (RNNs). We consider training losses that have been successful in prior work, in particular a cross entropy loss for word classification and a contrastive loss that explicitly aims to separate same-word and different-word pairs in a “Siamese network” training setting. We find that both classifier-based and Siamese RNN embeddings improve over previously reported results on a word discrimination task, with Siamese RNNs outperforming classification models. In addition, we present analyses of the learned embeddings and the effects of variables such as dimensionality and network structure.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114447052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
End-to-end training approaches for discriminative segmental models 判别分段模型的端到端训练方法
Pub Date : 2016-10-21 DOI: 10.1109/SLT.2016.7846309
Hao Tang, Weiran Wang, Kevin Gimpel, Karen Livescu
Recent work on discriminative segmental models has shown that they can achieve competitive speech recognition performance, using features based on deep neural frame classifiers. However, segmental models can be more challenging to train than standard frame-based approaches. While some segmental models have been successfully trained end to end, there is a lack of understanding of their training under different settings and with different losses.
最近对判别分段模型的研究表明,它们可以使用基于深度神经框架分类器的特征来实现竞争性的语音识别性能。然而,片段模型的训练可能比标准的基于框架的方法更具挑战性。虽然一些分段模型已经成功地端到端训练,但缺乏对不同设置和不同损失下的训练的理解。
{"title":"End-to-end training approaches for discriminative segmental models","authors":"Hao Tang, Weiran Wang, Kevin Gimpel, Karen Livescu","doi":"10.1109/SLT.2016.7846309","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846309","url":null,"abstract":"Recent work on discriminative segmental models has shown that they can achieve competitive speech recognition performance, using features based on deep neural frame classifiers. However, segmental models can be more challenging to train than standard frame-based approaches. While some segmental models have been successfully trained end to end, there is a lack of understanding of their training under different settings and with different losses.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121022246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Jointly learning to align and convert graphemes to phonemes with neural attention models 用神经注意模型共同学习对齐和转换字素到音素
Pub Date : 2016-10-20 DOI: 10.1109/SLT.2016.7846248
Shubham Toshniwal, Karen Livescu
We propose an attention-enabled encoder-decoder model for the problem of grapheme-to-phoneme conversion. Most previous work has tackled the problem via joint sequence models that require explicit alignments for training. In contrast, the attention-enabled encoder-decoder model allows for jointly learning to align and convert characters to phonemes. We explore different types of attention models, including global and local attention, and our best models achieve state-of-the-art results on three standard data sets (CMU-Dict, Pronlex, and NetTalk).
我们提出了一个注意支持的编码器-解码器模型来解决字素到音素的转换问题。以前的大多数工作都是通过联合序列模型来解决这个问题,这种模型需要明确的训练对齐。相比之下,支持注意力的编码器-解码器模型允许共同学习对齐和将字符转换为音素。我们探索了不同类型的注意力模型,包括全局和局部注意力,我们最好的模型在三个标准数据集(CMU-Dict, Pronlex和NetTalk)上实现了最先进的结果。
{"title":"Jointly learning to align and convert graphemes to phonemes with neural attention models","authors":"Shubham Toshniwal, Karen Livescu","doi":"10.1109/SLT.2016.7846248","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846248","url":null,"abstract":"We propose an attention-enabled encoder-decoder model for the problem of grapheme-to-phoneme conversion. Most previous work has tackled the problem via joint sequence models that require explicit alignments for training. In contrast, the attention-enabled encoder-decoder model allows for jointly learning to align and convert characters to phonemes. We explore different types of attention models, including global and local attention, and our best models achieve state-of-the-art results on three standard data sets (CMU-Dict, Pronlex, and NetTalk).","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123957625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Very deep convolutional neural networks for robust speech recognition 用于鲁棒语音识别的深度卷积神经网络
Pub Date : 2016-10-02 DOI: 10.1109/SLT.2016.7846307
Y. Qian, P. Woodland
This paper describes the extension and optimisation of our previous work on very deep convolutional neural networks (CNNs) for effective recognition of noisy speech in the Aurora 4 task. The appropriate number of convolutional layers, the sizes of the filters, pooling operations and input feature maps are all modified: the filter and pooling sizes are reduced and dimensions of input feature maps are extended to allow adding more convolutional layers. Furthermore appropriate input padding and input feature map selection strategies are developed. In addition, an adaptation framework using joint training of very deep CNN with auxiliary features i-vector and fMLLR features is developed. These modifications give substantial word error rate reductions over the standard CNN used as baseline. Finally the very deep CNN is combined with an LSTM-RNN acoustic model and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%, further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN joint decoding.
本文描述了我们之前在非常深度卷积神经网络(cnn)上的工作的扩展和优化,以有效识别极光4任务中的噪声语音。适当的卷积层数、过滤器的大小、池化操作和输入特征图都被修改:过滤器和池化的大小被减小,输入特征图的维度被扩展,以允许添加更多的卷积层。此外,还提出了适当的输入填充和输入特征映射选择策略。此外,提出了一种基于辅助特征i向量和fMLLR特征联合训练的深度CNN自适应框架。这些修改大大降低了作为基准的标准CNN的单词错误率。最后将深度CNN与LSTM-RNN声学模型结合,验证了联合声学模型解码方案中状态加权对数似然评分组合的有效性。在Aurora 4任务上,极深CNN的WER为8.81%,辅助特征联合训练的WER为7.99%,LSTM-RNN联合解码的WER为7.09%。
{"title":"Very deep convolutional neural networks for robust speech recognition","authors":"Y. Qian, P. Woodland","doi":"10.1109/SLT.2016.7846307","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846307","url":null,"abstract":"This paper describes the extension and optimisation of our previous work on very deep convolutional neural networks (CNNs) for effective recognition of noisy speech in the Aurora 4 task. The appropriate number of convolutional layers, the sizes of the filters, pooling operations and input feature maps are all modified: the filter and pooling sizes are reduced and dimensions of input feature maps are extended to allow adding more convolutional layers. Furthermore appropriate input padding and input feature map selection strategies are developed. In addition, an adaptation framework using joint training of very deep CNN with auxiliary features i-vector and fMLLR features is developed. These modifications give substantial word error rate reductions over the standard CNN used as baseline. Finally the very deep CNN is combined with an LSTM-RNN acoustic model and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%, further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN joint decoding.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117118532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
Optimizing neural network hyperparameters with Gaussian processes for dialog act classification 基于高斯过程的对话行为分类神经网络超参数优化
Pub Date : 2016-09-27 DOI: 10.1109/SLT.2016.7846296
Franck Dernoncourt, Ji Young Lee
Systems based on artificial neural networks (ANNs) have achieved state-of-the-art results in many natural language processing tasks. Although ANNs do not require manually engineered features, ANNs have many hyperparameters to be optimized. The choice of hyperparameters significantly impacts models' performances. However, the ANN hyperparameters are typically chosen by manual, grid, or random search, which either requires expert experiences or is computationally expensive. Recent approaches based on Bayesian optimization using Gaussian processes (GPs) is a more systematic way to automatically pinpoint optimal or near-optimal machine learning hyperparameters. Using a previously published ANN model yielding state-of-the-art results for dialog act classification, we demonstrate that optimizing hyperparameters using GP further improves the results, and reduces the computational time by a factor of 4 compared to a random search. Therefore it is a useful technique for tuning ANN models to yield the best performances for NLP tasks.
基于人工神经网络(ANNs)的系统在许多自然语言处理任务中取得了最先进的成果。虽然人工神经网络不需要人工设计的特征,但人工神经网络有许多超参数需要优化。超参数的选择显著影响模型的性能。然而,人工神经网络的超参数通常是通过手动、网格或随机搜索来选择的,这要么需要专家经验,要么计算成本很高。最近基于贝叶斯优化的方法使用高斯过程(GPs)是一种更系统的方法来自动确定最优或接近最优的机器学习超参数。使用先前发表的ANN模型产生最先进的对话行为分类结果,我们证明使用GP优化超参数进一步改善了结果,并且与随机搜索相比将计算时间减少了4倍。因此,它是一种有用的技术来调整人工神经网络模型,以获得NLP任务的最佳性能。
{"title":"Optimizing neural network hyperparameters with Gaussian processes for dialog act classification","authors":"Franck Dernoncourt, Ji Young Lee","doi":"10.1109/SLT.2016.7846296","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846296","url":null,"abstract":"Systems based on artificial neural networks (ANNs) have achieved state-of-the-art results in many natural language processing tasks. Although ANNs do not require manually engineered features, ANNs have many hyperparameters to be optimized. The choice of hyperparameters significantly impacts models' performances. However, the ANN hyperparameters are typically chosen by manual, grid, or random search, which either requires expert experiences or is computationally expensive. Recent approaches based on Bayesian optimization using Gaussian processes (GPs) is a more systematic way to automatically pinpoint optimal or near-optimal machine learning hyperparameters. Using a previously published ANN model yielding state-of-the-art results for dialog act classification, we demonstrate that optimizing hyperparameters using GP further improves the results, and reduces the computational time by a factor of 4 compared to a random search. Therefore it is a useful technique for tuning ANN models to yield the best performances for NLP tasks.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132930805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
A robust diarization system for measuring dominance in Peer-Led Team Learning groups 一个强健的测量在同伴领导的团队学习小组中的支配地位的系统
Pub Date : 2016-09-26 DOI: 10.1109/SLT.2016.7846283
Harishchandra Dubey, A. Sangwan, J. Hansen
Peer-Led Team Learning (PLTL) is a structured learning model where a team leader is appointed to facilitate collaborative problem solving among students for Science, Technology, Engineering and Mathematics (STEM) courses. This paper presents an informed HMM-based speaker diarization system. The minimum duration of short conversational-turns and number of participating students were fed as side information to the HMM system. A modified form of Bayesian Information Criterion (BIC) was used for iterative merging and re-segmentation. Finally, we used the diarization output to compute a novel dominance score based on unsupervised acoustic analysis.
同伴领导的团队学习(PLTL)是一种结构化的学习模式,在科学、技术、工程和数学(STEM)课程中,指定一名团队领导促进学生之间的协作解决问题。提出了一种基于信息hmm的说话人分类系统。短会话回合的最小持续时间和参与的学生人数作为辅助信息输入HMM系统。采用改进的贝叶斯信息准则(BIC)进行迭代合并和再分割。最后,我们使用diarization输出来计算基于无监督声学分析的新优势分数。
{"title":"A robust diarization system for measuring dominance in Peer-Led Team Learning groups","authors":"Harishchandra Dubey, A. Sangwan, J. Hansen","doi":"10.1109/SLT.2016.7846283","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846283","url":null,"abstract":"Peer-Led Team Learning (PLTL) is a structured learning model where a team leader is appointed to facilitate collaborative problem solving among students for Science, Technology, Engineering and Mathematics (STEM) courses. This paper presents an informed HMM-based speaker diarization system. The minimum duration of short conversational-turns and number of participating students were fed as side information to the HMM system. A modified form of Bayesian Information Criterion (BIC) was used for iterative merging and re-segmentation. Finally, we used the diarization output to compute a novel dominance score based on unsupervised acoustic analysis.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129573308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1