判别声学词嵌入:当前基于神经网络的方法

2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-11-08 DOI:10.1109/SLT.2016.7846310

Shane Settle, Karen Livescu

{"title":"判别声学词嵌入:当前基于神经网络的方法","authors":"Shane Settle, Karen Livescu","doi":"10.1109/SLT.2016.7846310","DOIUrl":null,"url":null,"abstract":"Acoustic word embeddings — fixed-dimensional vector representations of variable-length spoken word segments — have begun to be considered for tasks such as speech recognition and query-by-example search. Such embeddings can be learned discriminatively so that they are similar for speech segments corresponding to the same word, while being dissimilar for segments corresponding to different words. Recent work has found that acoustic word embeddings can outperform dynamic time warping on query-by-example search and related word discrimination tasks. However, the space of embedding models and training approaches is still relatively unexplored. In this paper we present new discriminative embedding models based on recurrent neural networks (RNNs). We consider training losses that have been successful in prior work, in particular a cross entropy loss for word classification and a contrastive loss that explicitly aims to separate same-word and different-word pairs in a “Siamese network” training setting. We find that both classifier-based and Siamese RNN embeddings improve over previously reported results on a word discrimination task, with Siamese RNNs outperforming classification models. In addition, we present analyses of the learned embeddings and the effects of variables such as dimensionality and network structure.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"80","resultStr":"{\"title\":\"Discriminative acoustic word embeddings: Tecurrent neural network-based approaches\",\"authors\":\"Shane Settle, Karen Livescu\",\"doi\":\"10.1109/SLT.2016.7846310\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Acoustic word embeddings — fixed-dimensional vector representations of variable-length spoken word segments — have begun to be considered for tasks such as speech recognition and query-by-example search. Such embeddings can be learned discriminatively so that they are similar for speech segments corresponding to the same word, while being dissimilar for segments corresponding to different words. Recent work has found that acoustic word embeddings can outperform dynamic time warping on query-by-example search and related word discrimination tasks. However, the space of embedding models and training approaches is still relatively unexplored. In this paper we present new discriminative embedding models based on recurrent neural networks (RNNs). We consider training losses that have been successful in prior work, in particular a cross entropy loss for word classification and a contrastive loss that explicitly aims to separate same-word and different-word pairs in a “Siamese network” training setting. We find that both classifier-based and Siamese RNN embeddings improve over previously reported results on a word discrimination task, with Siamese RNNs outperforming classification models. In addition, we present analyses of the learned embeddings and the effects of variables such as dimensionality and network structure.\",\"PeriodicalId\":281635,\"journal\":{\"name\":\"2016 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"142 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"80\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2016.7846310\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2016.7846310","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 80

摘要

声学词嵌入——可变长度口语词段的固定维向量表示——已经开始被考虑用于语音识别和按例查询搜索等任务。这种嵌入可以被区分地学习，使得它们对于同一词对应的语音片段是相似的，而对于不同词对应的语音片段是不同的。最近的研究发现，声学词嵌入在按例查询搜索和相关词识别任务上的表现优于动态时间扭曲。然而，嵌入模型和训练方法的空间仍然相对未被探索。本文提出了一种基于递归神经网络(RNNs)的判别嵌入模型。我们考虑了在之前的工作中已经成功的训练损失，特别是单词分类的交叉熵损失和明确旨在在“暹罗网络”训练设置中分离相同单词和不同单词对的对比损失。我们发现基于分类器和Siamese RNN嵌入在单词识别任务上都比之前报道的结果有所改善，其中Siamese RNN优于分类模型。此外，我们还分析了学习嵌入以及维度和网络结构等变量的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Discriminative acoustic word embeddings: Tecurrent neural network-based approaches

Acoustic word embeddings — fixed-dimensional vector representations of variable-length spoken word segments — have begun to be considered for tasks such as speech recognition and query-by-example search. Such embeddings can be learned discriminatively so that they are similar for speech segments corresponding to the same word, while being dissimilar for segments corresponding to different words. Recent work has found that acoustic word embeddings can outperform dynamic time warping on query-by-example search and related word discrimination tasks. However, the space of embedding models and training approaches is still relatively unexplored. In this paper we present new discriminative embedding models based on recurrent neural networks (RNNs). We consider training losses that have been successful in prior work, in particular a cross entropy loss for word classification and a contrastive loss that explicitly aims to separate same-word and different-word pairs in a “Siamese network” training setting. We find that both classifier-based and Siamese RNN embeddings improve over previously reported results on a word discrimination task, with Siamese RNNs outperforming classification models. In addition, we present analyses of the learned embeddings and the effects of variables such as dimensionality and network structure.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量