首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge 用于阿拉伯语多方言广播媒体识别的QCRI高级转录系统(QATS): MGB-2的挑战
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846279
Sameer Khurana, Ahmed M. Ali
In this paper, we describe Qatar Computing Research Institute's (QCRI) speech transcription system for the 2016 Dialectal Arabic Multi-Genre Broadcast (MGB-2) challenge. MGB-2 is a controlled evaluation using 1,200 hours audio with lightly supervised transcription Our system which was a combination of three purely sequence trained recognition systems, achieved the lowest WER of 14.2% among the nine participating teams. Key features of our transcription system are: purely sequence trained acoustic models using the recently introduced Lattice free Maximum Mutual Information (LF-MMI) modeling framework; Language model rescoring using a four-gram and Recurrent Neural Network with Max- Ent connections (RNNME) language models; and system combination using Minimum Bayes Risk (MBR) decoding criterion. The whole system is built using kaldi speech recognition toolkit.
在本文中,我们描述了卡塔尔计算研究所(QCRI)的语音转录系统,用于2016年阿拉伯方言多类型广播(MGB-2)挑战。MGB-2是使用1200小时音频和轻度监督转录的受控评估。我们的系统是三个纯粹序列训练识别系统的组合,在九个参与团队中实现了最低的14.2%的WER。我们的转录系统的主要特点是:使用最近引入的晶格自由最大互信息(LF-MMI)建模框架的纯序列训练声学模型;基于四元神经网络和循环神经网络(RNNME)语言模型的语言模型重建采用最小贝叶斯风险(MBR)解码准则进行系统组合。整个系统采用kaldi语音识别工具箱构建。
{"title":"QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge","authors":"Sameer Khurana, Ahmed M. Ali","doi":"10.1109/SLT.2016.7846279","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846279","url":null,"abstract":"In this paper, we describe Qatar Computing Research Institute's (QCRI) speech transcription system for the 2016 Dialectal Arabic Multi-Genre Broadcast (MGB-2) challenge. MGB-2 is a controlled evaluation using 1,200 hours audio with lightly supervised transcription Our system which was a combination of three purely sequence trained recognition systems, achieved the lowest WER of 14.2% among the nine participating teams. Key features of our transcription system are: purely sequence trained acoustic models using the recently introduced Lattice free Maximum Mutual Information (LF-MMI) modeling framework; Language model rescoring using a four-gram and Recurrent Neural Network with Max- Ent connections (RNNME) language models; and system combination using Minimum Bayes Risk (MBR) decoding criterion. The whole system is built using kaldi speech recognition toolkit.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122291298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Learning utterance-level normalisation using Variational Autoencoders for robust automatic speech recognition 使用变分自编码器学习话语级归一化,实现鲁棒自动语音识别
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846243
Shawn Tan, K. Sim
This paper presents a Variational Autoencoder (VAE) based framework for modelling utterances. In this model, a mapping from an utterance to a distribution over the latent space, the VAE-utterance feature, is defined. This is in addition to a frame-level mapping, the VAE-frame feature. Using the Aurora-4 dataset, we train and perform some analysis on these models based on their detection of speaker and utterance variability, and also use combinations of LDA, i-vector, and VAE-frame and utterance features for speech recognition training. We find that it works equally well using VAE-frame + VAE-utterance features alone, and by using an LDA + VAE-frame +VAE-utterance feature combination, we obtain a word-errorrate (WER) of 9.59%, a gain over the 9.72% baseline which uses an LDA + i-vector combination.
本文提出了一种基于变分自编码器(VAE)的语音建模框架。在该模型中,定义了从话语到潜在空间分布的映射,即ae -话语特征。这是对帧级映射(vee -frame特性)的补充。使用Aurora-4数据集,我们基于这些模型对说话人和话语变化的检测进行了训练和分析,并使用LDA、i-vector、ae -frame和话语特征的组合进行了语音识别训练。我们发现,单独使用ae -frame + ae -utterance特征效果同样好,并且通过使用LDA + ae -frame + ae -utterance特征组合,我们获得了9.59%的单词错误率(WER),比使用LDA + i-vector组合的9.72%基线有所提高。
{"title":"Learning utterance-level normalisation using Variational Autoencoders for robust automatic speech recognition","authors":"Shawn Tan, K. Sim","doi":"10.1109/SLT.2016.7846243","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846243","url":null,"abstract":"This paper presents a Variational Autoencoder (VAE) based framework for modelling utterances. In this model, a mapping from an utterance to a distribution over the latent space, the VAE-utterance feature, is defined. This is in addition to a frame-level mapping, the VAE-frame feature. Using the Aurora-4 dataset, we train and perform some analysis on these models based on their detection of speaker and utterance variability, and also use combinations of LDA, i-vector, and VAE-frame and utterance features for speech recognition training. We find that it works equally well using VAE-frame + VAE-utterance features alone, and by using an LDA + VAE-frame +VAE-utterance feature combination, we obtain a word-errorrate (WER) of 9.59%, a gain over the 9.72% baseline which uses an LDA + i-vector combination.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124780359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
A prioritized grid long short-term memory RNN for speech recognition 语音识别的优先网格长短期记忆RNN
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846305
Wei-Ning Hsu, Yu Zhang, James R. Glass
Recurrent neural networks (RNNs) are naturally suitable for speech recognition because of their ability of utilizing dynamically changing temporal information. Deep RNNs have been argued to be able to model temporal relationships at different time granularities, but suffer vanishing gradient problems. In this paper, we extend stacked long short-term memory (LSTM) RNNs by using grid LSTM blocks that formulate computation along not only the temporal dimension, but also the depth dimension, in order to alleviate this issue. Moreover, we prioritize the depth dimension over the temporal one to provide the depth dimension more updated information, since the output from it will be used for classification. We call this model the prioritized Grid LSTM (pGLSTM). Extensive experiments on four large datasets (AMI, HKUST, GALE, and MGB) indicate that the pGLSTM outperforms alternative deep LSTM models, beating stacked LSTMs with 4% to 7% relative improvement, and achieve new benchmarks among uni-directional models on all datasets.
递归神经网络(RNNs)由于其利用动态变化的时间信息的能力,自然适用于语音识别。深度rnn被认为能够模拟不同时间粒度的时间关系,但存在梯度消失问题。在本文中,我们通过使用网格LSTM块来扩展堆叠长短期记忆(LSTM) rnn,该网格LSTM块不仅沿着时间维度,而且沿着深度维度进行计算,以缓解这一问题。此外,我们将深度维度优先于时间维度,以便为深度维度提供更多的更新信息,因为它的输出将用于分类。我们称这个模型为优先网格LSTM (pGLSTM)。在四个大型数据集(AMI, HKUST, GALE和MGB)上进行的大量实验表明,pGLSTM优于其他深度LSTM模型,比堆叠LSTM的相对改进幅度为4%至7%,并且在所有数据集上的单向模型中实现了新的基准。
{"title":"A prioritized grid long short-term memory RNN for speech recognition","authors":"Wei-Ning Hsu, Yu Zhang, James R. Glass","doi":"10.1109/SLT.2016.7846305","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846305","url":null,"abstract":"Recurrent neural networks (RNNs) are naturally suitable for speech recognition because of their ability of utilizing dynamically changing temporal information. Deep RNNs have been argued to be able to model temporal relationships at different time granularities, but suffer vanishing gradient problems. In this paper, we extend stacked long short-term memory (LSTM) RNNs by using grid LSTM blocks that formulate computation along not only the temporal dimension, but also the depth dimension, in order to alleviate this issue. Moreover, we prioritize the depth dimension over the temporal one to provide the depth dimension more updated information, since the output from it will be used for classification. We call this model the prioritized Grid LSTM (pGLSTM). Extensive experiments on four large datasets (AMI, HKUST, GALE, and MGB) indicate that the pGLSTM outperforms alternative deep LSTM models, beating stacked LSTMs with 4% to 7% relative improvement, and achieve new benchmarks among uni-directional models on all datasets.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124942985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Low-rank bases for factorized hidden layer adaptation of DNN acoustic models DNN声学模型分解隐层自适应的低秩基
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846332
Lahiru Samarakoon, K. Sim
Recently, the factorized hidden layer (FHL) adaptation method is proposed for speaker adaptation of deep neural network (DNN) acoustic models. An FHL contains a speaker-dependent (SD) transformation matrix using a linear combination of rank-1 matrices and an SD bias using a linear combination of vectors, in addition to the standard affine transformation. On the other hand, full-rank bases are used with a similar DNN adaptation method which is based on cluster adaptive training (CAT). Therefore, it is interesting to investigate the effect of the rank of the bases used for adaptation. The increase of the rank of the bases improves the speaker subspace representation, without increasing the number of learnable speaker parameters. In this work, we investigate the effect of using various ranks for the bases of the SD transformation of FHLs on Aurora 4, AMI IHM and AMI SDM tasks. Experimental results have shown that when one FHL layer is used, it is optimal to use low-ranked bases of rank-50, instead of full-rank bases. Furthermore, when multiple FHLs are used, rank-1 bases are sufficient.
近年来,针对深度神经网络(DNN)声学模型的说话人自适应,提出了分解隐藏层(FHL)自适应方法。除了标准仿射变换之外,FHL还包含一个使用秩1矩阵线性组合的扬声器相关(SD)变换矩阵和一个使用向量线性组合的SD偏置。另一方面,将全秩基与基于聚类自适应训练(CAT)的DNN自适应方法结合使用。因此,研究用于适应的碱基等级的影响是一个有趣的问题。在不增加可学习的说话人参数数量的情况下,基秩的增加改善了说话人子空间的表示。在这项工作中,我们研究了在Aurora 4、AMI IHM和AMI SDM任务中使用不同等级的FHLs的SD转换碱基的影响。实验结果表明,当使用一个FHL层时,使用rank-50的低秩碱基比使用全秩碱基更优。此外,当使用多个fhl时,排名1的碱基就足够了。
{"title":"Low-rank bases for factorized hidden layer adaptation of DNN acoustic models","authors":"Lahiru Samarakoon, K. Sim","doi":"10.1109/SLT.2016.7846332","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846332","url":null,"abstract":"Recently, the factorized hidden layer (FHL) adaptation method is proposed for speaker adaptation of deep neural network (DNN) acoustic models. An FHL contains a speaker-dependent (SD) transformation matrix using a linear combination of rank-1 matrices and an SD bias using a linear combination of vectors, in addition to the standard affine transformation. On the other hand, full-rank bases are used with a similar DNN adaptation method which is based on cluster adaptive training (CAT). Therefore, it is interesting to investigate the effect of the rank of the bases used for adaptation. The increase of the rank of the bases improves the speaker subspace representation, without increasing the number of learnable speaker parameters. In this work, we investigate the effect of using various ranks for the bases of the SD transformation of FHLs on Aurora 4, AMI IHM and AMI SDM tasks. Experimental results have shown that when one FHL layer is used, it is optimal to use low-ranked bases of rank-50, instead of full-rank bases. Furthermore, when multiple FHLs are used, rank-1 bases are sufficient.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127982931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Contextual language model adaptation using dynamic classes 使用动态类适应上下文语言模型
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846301
Lucy Vasserman, Ben Haynor, Petar S. Aleksic
Recent focus on assistant products has increased the need for extremely flexible speech systems that adapt well to specific users' needs. An important aspect of this is enabling users to make voice commands referencing their own personal data, such as favorite songs, application names, and contacts. Recognition accuracy for common commands such as playing music and sending text messages can be greatly improved if we know a user's preferences. In the past, we have addressed this problem using class-based language models that allow for query-time injection of class instances. However, this approach is limited by the need to train class-based models ahead of time.
最近对辅助产品的关注增加了对极其灵活的语音系统的需求,这些系统可以很好地适应特定用户的需求。这样做的一个重要方面是允许用户使用语音命令来引用他们自己的个人数据,比如最喜欢的歌曲、应用程序名称和联系人。如果我们知道用户的偏好,对播放音乐和发送短信等常见命令的识别准确性可以大大提高。在过去,我们使用基于类的语言模型来解决这个问题,该模型允许在查询时注入类实例。然而,由于需要提前训练基于类的模型,这种方法受到了限制。
{"title":"Contextual language model adaptation using dynamic classes","authors":"Lucy Vasserman, Ben Haynor, Petar S. Aleksic","doi":"10.1109/SLT.2016.7846301","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846301","url":null,"abstract":"Recent focus on assistant products has increased the need for extremely flexible speech systems that adapt well to specific users' needs. An important aspect of this is enabling users to make voice commands referencing their own personal data, such as favorite songs, application names, and contacts. Recognition accuracy for common commands such as playing music and sending text messages can be greatly improved if we know a user's preferences. In the past, we have addressed this problem using class-based language models that allow for query-time injection of class instances. However, this approach is limited by the need to train class-based models ahead of time.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129050446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Modelling speaker and channel variability using deep neural networks for robust speaker verification 利用深度神经网络对说话人和通道可变性进行建模,实现对说话人的鲁棒验证
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846264
Gautam Bhattacharya, Md. Jahangir Alam, P. Kenny, Vishwa Gupta
We propose to improve the performance of i-vector based speaker verification by processing the i-vectors with a deep neural network before they are fed to a cosine distance or probabilistic linear discriminant analysis (PLDA) classifier. To this end we build on an existing model that we refer to as Non-linear Within Class Normalization (NWCN) and introduce a novel Speaker Classifier Network (SCN). Both models deliver impressive speaker verification performance, showing a 56% and 68% relative improvement over standard i-vectors when combined with a cosine distance backend. The NWCN model also reduces the equal error rate for PLDA from 1.78% to 1.63%. We also test these models under the constraints of domain mismatch, i.e. when no in-domain training data is available. Under these conditions, SCN features in combination with cosine distance performs better than the PLDA baseline, achieving an equal error rate of 2.92% as compared to 3.37%.
我们提出在i-向量被输入余弦距离或概率线性判别分析(PLDA)分类器之前,通过深度神经网络处理i-向量来提高基于i-向量的说话人验证的性能。为此,我们建立了一个现有的模型,我们称之为非线性类内归一化(NWCN),并引入了一个新的说话人分类器网络(SCN)。这两种模型都提供了令人印象深刻的扬声器验证性能,在与余弦距离后端相结合时,比标准i-vector显示出56%和68%的相对改进。NWCN模型还将PLDA的等错误率从1.78%降低到1.63%。我们还在领域不匹配的约束下对这些模型进行了测试,即在没有可用的领域内训练数据的情况下。在这些条件下,SCN特征与余弦距离的结合优于PLDA基线,错误率为2.92%,错误率为3.37%。
{"title":"Modelling speaker and channel variability using deep neural networks for robust speaker verification","authors":"Gautam Bhattacharya, Md. Jahangir Alam, P. Kenny, Vishwa Gupta","doi":"10.1109/SLT.2016.7846264","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846264","url":null,"abstract":"We propose to improve the performance of i-vector based speaker verification by processing the i-vectors with a deep neural network before they are fed to a cosine distance or probabilistic linear discriminant analysis (PLDA) classifier. To this end we build on an existing model that we refer to as Non-linear Within Class Normalization (NWCN) and introduce a novel Speaker Classifier Network (SCN). Both models deliver impressive speaker verification performance, showing a 56% and 68% relative improvement over standard i-vectors when combined with a cosine distance backend. The NWCN model also reduces the equal error rate for PLDA from 1.78% to 1.63%. We also test these models under the constraints of domain mismatch, i.e. when no in-domain training data is available. Under these conditions, SCN features in combination with cosine distance performs better than the PLDA baseline, achieving an equal error rate of 2.92% as compared to 3.37%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130332667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Semantically driven inversion transduction grammar induction for early stage training of spoken language translation 语义驱动倒转转导语法归纳在口语翻译早期训练中的应用
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846275
Meriem Beloucif, Dekai Wu
We propose an approach in which we inject a crosslingual semantic frame based objective function directly into inversion transduction grammar (ITG) induction in order to semantically train spoken language translation systems. This approach represents a follow-up of our recent work on improving machine translation quality by tuning loglinear mixture weights using a semantic frame based objective function in the late, final stage of statistical machine translation training. In contrast, our new approach injects a semantic frame based objective function back into earlier stages of the training pipeline, during the actual learning of the translation model, biasing learning toward semantically more accurate alignments. Our work is motivated by the fact that ITG alignments have empirically been shown to fully cover crosslingual semantic frame alternations. We show that injecting a crosslingual semantic based objective function for driving ITG induction further sharpens the ITG constraints, leading to better performance than either the conventional ITG or the traditional GIZA++ based approaches.
本文提出了一种将基于目标函数的跨语言语义框架直接注入到倒转语法(ITG)归纳中的方法,以对口语翻译系统进行语义训练。这种方法代表了我们最近在统计机器翻译训练的最后阶段通过使用基于语义框架的目标函数调整线性混合权重来提高机器翻译质量的后续工作。相比之下,我们的新方法将基于语义框架的目标函数注入到训练管道的早期阶段,在翻译模型的实际学习期间,将学习偏向于语义上更准确的对齐。我们的工作的动机是,ITG对齐已经被经验证明完全覆盖跨语言语义框架的变化。我们发现,注入一个基于跨语言语义的目标函数来驱动ITG诱导,进一步强化了ITG约束,导致比传统的ITG或传统的基于giz++的方法更好的性能。
{"title":"Semantically driven inversion transduction grammar induction for early stage training of spoken language translation","authors":"Meriem Beloucif, Dekai Wu","doi":"10.1109/SLT.2016.7846275","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846275","url":null,"abstract":"We propose an approach in which we inject a crosslingual semantic frame based objective function directly into inversion transduction grammar (ITG) induction in order to semantically train spoken language translation systems. This approach represents a follow-up of our recent work on improving machine translation quality by tuning loglinear mixture weights using a semantic frame based objective function in the late, final stage of statistical machine translation training. In contrast, our new approach injects a semantic frame based objective function back into earlier stages of the training pipeline, during the actual learning of the translation model, biasing learning toward semantically more accurate alignments. Our work is motivated by the fact that ITG alignments have empirically been shown to fully cover crosslingual semantic frame alternations. We show that injecting a crosslingual semantic based objective function for driving ITG induction further sharpens the ITG constraints, leading to better performance than either the conventional ITG or the traditional GIZA++ based approaches.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"48 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127451130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Iterative training of a DPGMM-HMM acoustic unit recognizer in a zero resource scenario 零资源情况下DPGMM-HMM声单元识别器的迭代训练
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846245
Michael Heck, S. Sakti, Satoshi Nakamura
In this paper we propose a framework for building a full-fledged acoustic unit recognizer in a zero resource setting, i.e., without any provided labels. For that, we combine an iterative Dirichlet process Gaussian mixture model (DPGMM) clustering framework with a standard pipeline for supervised GMM-HMM acoustic model (AM) and n-gram language model (LM) training, enhanced by a scheme for iterative model re-training. We use the DPGMM to cluster feature vectors into a dynamically sized set of acoustic units. The frame based class labels serve as transcriptions of the audio data and are used as input to the AM and LM training pipeline. We show that iterative unsupervised model re-training of this DPGMM-HMM acoustic unit recognizer improves performance according to an ABX sound class discriminability task based evaluation. Our results show that the learned models generalize well and that sound class discriminability benefits from contextual information introduced by the language model. Our systems are competitive with supervisedly trained phone recognizers, and can beat the baseline set by DPGMM clustering.
在本文中,我们提出了一个框架,用于在零资源设置中构建一个成熟的声学单元识别器,即没有任何提供的标签。为此,我们将迭代Dirichlet过程高斯混合模型(DPGMM)聚类框架与标准管道相结合,用于监督GMM-HMM声学模型(AM)和n-gram语言模型(LM)训练,并通过迭代模型再训练方案进行增强。我们使用DPGMM将特征向量聚类成一组动态大小的声学单元。基于帧的类标签作为音频数据的转录,并用作AM和LM训练管道的输入。我们证明了迭代无监督模型再训练这种DPGMM-HMM声学单元识别器提高性能根据ABX声音类判别任务的评估。我们的研究结果表明,学习模型具有良好的泛化能力,并且语言模型引入的上下文信息有利于声音类的区分。我们的系统可以与经过监督训练的手机识别器相媲美,并且可以超过DPGMM聚类设置的基线。
{"title":"Iterative training of a DPGMM-HMM acoustic unit recognizer in a zero resource scenario","authors":"Michael Heck, S. Sakti, Satoshi Nakamura","doi":"10.1109/SLT.2016.7846245","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846245","url":null,"abstract":"In this paper we propose a framework for building a full-fledged acoustic unit recognizer in a zero resource setting, i.e., without any provided labels. For that, we combine an iterative Dirichlet process Gaussian mixture model (DPGMM) clustering framework with a standard pipeline for supervised GMM-HMM acoustic model (AM) and n-gram language model (LM) training, enhanced by a scheme for iterative model re-training. We use the DPGMM to cluster feature vectors into a dynamically sized set of acoustic units. The frame based class labels serve as transcriptions of the audio data and are used as input to the AM and LM training pipeline. We show that iterative unsupervised model re-training of this DPGMM-HMM acoustic unit recognizer improves performance according to an ABX sound class discriminability task based evaluation. Our results show that the learned models generalize well and that sound class discriminability benefits from contextual information introduced by the language model. Our systems are competitive with supervisedly trained phone recognizers, and can beat the baseline set by DPGMM clustering.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129738844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Recurrent convolutional neural networks for structured speech act tagging 结构化语音行为标注的循环卷积神经网络
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846312
Takashi Ushio, Hongjie Shi, M. Endo, K. Yamagami, Noriaki Horii
Spoken language understanding (SLU) is one of the important problem in natural language processing, and especially in dialog system. Fifth Dialog State Tracking Challenge (DSTC5) introduced a SLU challenge task, which is automatic tagging to speech utterances by two speaker roles with speech acts tag and semantic slots tag. In this paper, we focus on speech acts tagging. We propose local coactivate multi-task learning model for capturing structured speech acts, based on sentence features by recurrent convolutional neural networks. An experiment result, shows that our model outperformed all other submitted entries, and were able to capture coactivated local features of category and attribute, which are parts of speech act.
口语理解是自然语言处理特别是对话系统中的一个重要问题。第五次对话状态跟踪挑战(DSTC5)引入了一个SLU挑战任务,该任务是由两个说话者角色使用语音行为标签和语义槽标签对语音进行自动标记。本文主要研究语音行为标注。我们提出了一种基于循环卷积神经网络的局部共激活多任务学习模型,用于捕获结构化语音行为。实验结果表明,我们的模型优于所有其他提交的条目,并且能够捕获作为言语行为组成部分的类别和属性的协同激活的局部特征。
{"title":"Recurrent convolutional neural networks for structured speech act tagging","authors":"Takashi Ushio, Hongjie Shi, M. Endo, K. Yamagami, Noriaki Horii","doi":"10.1109/SLT.2016.7846312","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846312","url":null,"abstract":"Spoken language understanding (SLU) is one of the important problem in natural language processing, and especially in dialog system. Fifth Dialog State Tracking Challenge (DSTC5) introduced a SLU challenge task, which is automatic tagging to speech utterances by two speaker roles with speech acts tag and semantic slots tag. In this paper, we focus on speech acts tagging. We propose local coactivate multi-task learning model for capturing structured speech acts, based on sentence features by recurrent convolutional neural networks. An experiment result, shows that our model outperformed all other submitted entries, and were able to capture coactivated local features of category and attribute, which are parts of speech act.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125952108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Abstractive headline generation for spoken content by attentive recurrent neural networks with ASR error modeling 基于ASR误差建模的关注递归神经网络对口语内容的抽象标题生成
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846258
Lang-Chi Yu, Hung-yi Lee, Lin-Shan Lee
Headline generation for spoken content is important since spoken content is difficult to be shown on the screen and browsed by the user. It is a special type of abstractive summarization, for which the summaries are generated word by word from scratch without using any part of the original content. Many deep learning approaches for headline generation from text document have been proposed recently, all requiring huge quantities of training data, which is difficult for spoken document summarization. In this paper, we propose an ASR error modeling approach to learn the underlying structure of ASR error patterns and incorporate this model in an Attentive Recurrent Neural Network (ARNN) architecture. In this way, the model for abstractive headline generation for spoken content can be learned from abundant text data and the ASR data for some recognizers. Experiments showed very encouraging results and verified that the proposed ASR error model works well even when the input spoken content is recognized by a recognizer very different from the one the model learned from.
口语内容的标题生成非常重要,因为口语内容很难显示在屏幕上并被用户浏览。它是一种特殊类型的抽象摘要,它的摘要是在不使用原始内容的任何部分的情况下,一个字一个字地从头生成的。最近提出了许多从文本文档生成标题的深度学习方法,这些方法都需要大量的训练数据,这给口语文档的总结带来了困难。在本文中,我们提出了一种ASR误差建模方法来学习ASR误差模式的底层结构,并将该模型整合到一个关注递归神经网络(ARNN)体系结构中。这样,口语内容的抽象标题生成模型可以从丰富的文本数据和一些识别器的ASR数据中学习。实验显示了非常令人鼓舞的结果,并验证了所提出的ASR误差模型即使在输入的语音内容被一个与模型学习的识别器非常不同的识别器识别时也能很好地工作。
{"title":"Abstractive headline generation for spoken content by attentive recurrent neural networks with ASR error modeling","authors":"Lang-Chi Yu, Hung-yi Lee, Lin-Shan Lee","doi":"10.1109/SLT.2016.7846258","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846258","url":null,"abstract":"Headline generation for spoken content is important since spoken content is difficult to be shown on the screen and browsed by the user. It is a special type of abstractive summarization, for which the summaries are generated word by word from scratch without using any part of the original content. Many deep learning approaches for headline generation from text document have been proposed recently, all requiring huge quantities of training data, which is difficult for spoken document summarization. In this paper, we propose an ASR error modeling approach to learn the underlying structure of ASR error patterns and incorporate this model in an Attentive Recurrent Neural Network (ARNN) architecture. In this way, the model for abstractive headline generation for spoken content can be learned from abundant text data and the ASR data for some recognizers. Experiments showed very encouraging results and verified that the proposed ASR error model works well even when the input spoken content is recognized by a recognizer very different from the one the model learned from.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126701206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1