首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Towards a virtual personal assistant based on a user-defined portfolio of multi-domain vocal applications 迈向基于用户自定义的多域语音应用组合的虚拟个人助理
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846252
Tatiana Ekeinhor-Komi, J. Bouraoui, R. Laroche, F. Lefèvre
This paper proposes a novel approach to defining and simulating a new generation of virtual personal assistants as multi-application multi-domain distributed dialogue systems. The first contribution is the assistant architecture, composed of independent third-party applications handled by a Dispatcher. In this view, applications are black-boxes responding with a self-scored answer to user requests. Next, the Dispatcher distributes the current request to the most relevant application, based on these scores and the context (history of interaction etc.), and conveys its answer to the user. To address variations in the user-defined portfolio of applications, the second contribution, a stochastic model automates the online optimisation of the Dispatcher's behaviour. To evaluate the learnability of the Dispatcher's policy, several parametrisations of the user and application simulators are enabled, in such a way that they cover variations of realistic situations. Results confirm in all considered configurations of interest, that reinforcement learning can learn adapted strategies.
本文提出了一种新的方法来定义和模拟新一代虚拟个人助理作为多应用多领域分布式对话系统。第一个贡献是辅助体系结构,它由Dispatcher处理的独立第三方应用程序组成。在这个视图中,应用程序是黑盒,对用户请求进行自评分响应。接下来,Dispatcher根据这些分数和上下文(交互历史等)将当前请求分发给最相关的应用程序,并将其答案传递给用户。为了解决用户定义的应用程序组合中的变化,第二个贡献是随机模型自动在线优化Dispatcher的行为。为了评估Dispatcher策略的可学习性,启用了用户和应用程序模拟器的几个参数化,以覆盖各种实际情况的方式。结果证实,在所有考虑的感兴趣的配置,强化学习可以学习适应策略。
{"title":"Towards a virtual personal assistant based on a user-defined portfolio of multi-domain vocal applications","authors":"Tatiana Ekeinhor-Komi, J. Bouraoui, R. Laroche, F. Lefèvre","doi":"10.1109/SLT.2016.7846252","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846252","url":null,"abstract":"This paper proposes a novel approach to defining and simulating a new generation of virtual personal assistants as multi-application multi-domain distributed dialogue systems. The first contribution is the assistant architecture, composed of independent third-party applications handled by a Dispatcher. In this view, applications are black-boxes responding with a self-scored answer to user requests. Next, the Dispatcher distributes the current request to the most relevant application, based on these scores and the context (history of interaction etc.), and conveys its answer to the user. To address variations in the user-defined portfolio of applications, the second contribution, a stochastic model automates the online optimisation of the Dispatcher's behaviour. To evaluate the learnability of the Dispatcher's policy, several parametrisations of the user and application simulators are enabled, in such a way that they cover variations of realistic situations. Results confirm in all considered configurations of interest, that reinforcement learning can learn adapted strategies.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127354192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Code-switching detection using multilingual DNNS 使用多语言DNNS的代码切换检测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846326
Emre Yilmaz, H. V. D. Heuvel, D. V. Leeuwen
Automatic speech recognition (ASR) of code-switching speech requires careful handling of unexpected language switches that may occur in a single utterance. In this paper, we investigate the feasibility of using multilingually trained deep neural networks (DNN) for the ASR of Frisian speech containing code-switches to Dutch with the aim of building a robust recognizer that can handle this phenomenon. For this purpose, we train several multilingual DNN models on Frisian and two closely related languages, namely English and Dutch, to compare the impact of single-step and two-step multilingual DNN training on the recognition and code-switching detection performance. We apply bilingual DNN retraining on both target languages by varying the amount of training data belonging to the higher-resourced target language (Dutch). The recognition results show that the multilingual DNN training scheme with an initial multilingual training step followed by bilingual retraining provides recognition performance comparable to an oracle baseline recognizer that can employ language-specific acoustic models. We further show that we can detect code-switches at the word level with an equal error rate of around 17% excluding the deletions due to ASR errors.
语码转换语音的自动语音识别(ASR)需要仔细处理单个话语中可能出现的意外语言切换。在本文中,我们研究了使用多语言训练的深度神经网络(DNN)对含有荷兰语代码转换的弗里斯兰语语音进行ASR的可行性,目的是建立一个可以处理这种现象的鲁棒识别器。为此,我们在弗里斯兰语和两种密切相关的语言(英语和荷兰语)上训练了几个多语种DNN模型,比较单步和两步多语种DNN训练对识别和码切换检测性能的影响。我们通过改变属于资源较高的目标语言(荷兰语)的训练数据量,在两种目标语言上应用双语DNN再训练。识别结果表明,采用初始多语言训练步骤然后进行双语再训练的多语言深度神经网络训练方案提供的识别性能与可以使用特定语言声学模型的oracle基线识别器相当。我们进一步表明,我们可以在单词级别检测代码切换,排除ASR错误导致的删除,错误率约为17%。
{"title":"Code-switching detection using multilingual DNNS","authors":"Emre Yilmaz, H. V. D. Heuvel, D. V. Leeuwen","doi":"10.1109/SLT.2016.7846326","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846326","url":null,"abstract":"Automatic speech recognition (ASR) of code-switching speech requires careful handling of unexpected language switches that may occur in a single utterance. In this paper, we investigate the feasibility of using multilingually trained deep neural networks (DNN) for the ASR of Frisian speech containing code-switches to Dutch with the aim of building a robust recognizer that can handle this phenomenon. For this purpose, we train several multilingual DNN models on Frisian and two closely related languages, namely English and Dutch, to compare the impact of single-step and two-step multilingual DNN training on the recognition and code-switching detection performance. We apply bilingual DNN retraining on both target languages by varying the amount of training data belonging to the higher-resourced target language (Dutch). The recognition results show that the multilingual DNN training scheme with an initial multilingual training step followed by bilingual retraining provides recognition performance comparable to an oracle baseline recognizer that can employ language-specific acoustic models. We further show that we can detect code-switches at the word level with an equal error rate of around 17% excluding the deletions due to ASR errors.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125114933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Dialog state tracking with attention-based sequence-to-sequence learning 对话状态跟踪与基于注意力的序列到序列学习
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846317
Takaaki Hori, Hai Wang, Chiori Hori, Shinji Watanabe, B. Harsham, Jonathan Le Roux, J. Hershey, Yusuke Koji, Yi Jing, Zhaocheng Zhu, T. Aikawa
We present an advanced dialog state tracking system designed for the 5th Dialog State Tracking Challenge (DSTC5). The main task of DSTC5 is to track the dialog state in a human-human dialog. For each utterance, the tracker emits a frame of slot-value pairs considering the full history of the dialog up to the current turn. Our system includes an encoder-decoder architecture with an attention mechanism to map an input word sequence to a set of semantic labels, i.e., slot-value pairs. This handles the problem of the unknown alignment between the utterances and the labels. By combining the attention-based tracker with rule-based trackers elaborated for English and Chinese, the F-score for the development set improved from 0.475 to 0.507 compared to the rule-only trackers. Moreover, we achieved 0.517 F-score by refining the combination strategy based on the topic and slot level performance of each tracker. In this paper, we also validate the efficacy of each technique and report the test set results submitted to the challenge.
我们为第五届对话状态跟踪挑战赛(DSTC5)设计了一个高级对话状态跟踪系统。DSTC5的主要任务是跟踪人机对话中的对话状态。对于每个话语,跟踪器发出一帧槽值对,考虑到对话到当前回合的完整历史。我们的系统包括一个编码器-解码器架构,该架构具有一个注意机制,可以将输入单词序列映射到一组语义标签,即槽值对。这就解决了语音和标签之间的未知对齐问题。通过将基于注意力的跟踪器与为英语和汉语精心设计的基于规则的跟踪器相结合,开发集的f分数从0.475提高到0.507。此外,我们根据每个跟踪器的主题和槽级性能改进组合策略,获得了0.517 f分。在本文中,我们还验证了每种技术的有效性,并报告了提交给挑战的测试集结果。
{"title":"Dialog state tracking with attention-based sequence-to-sequence learning","authors":"Takaaki Hori, Hai Wang, Chiori Hori, Shinji Watanabe, B. Harsham, Jonathan Le Roux, J. Hershey, Yusuke Koji, Yi Jing, Zhaocheng Zhu, T. Aikawa","doi":"10.1109/SLT.2016.7846317","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846317","url":null,"abstract":"We present an advanced dialog state tracking system designed for the 5th Dialog State Tracking Challenge (DSTC5). The main task of DSTC5 is to track the dialog state in a human-human dialog. For each utterance, the tracker emits a frame of slot-value pairs considering the full history of the dialog up to the current turn. Our system includes an encoder-decoder architecture with an attention mechanism to map an input word sequence to a set of semantic labels, i.e., slot-value pairs. This handles the problem of the unknown alignment between the utterances and the labels. By combining the attention-based tracker with rule-based trackers elaborated for English and Chinese, the F-score for the development set improved from 0.475 to 0.507 compared to the rule-only trackers. Moreover, we achieved 0.517 F-score by refining the combination strategy based on the topic and slot level performance of each tracker. In this paper, we also validate the efficacy of each technique and report the test set results submitted to the challenge.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130842951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Towards acoustic model unification across dialects 跨方言声学模型的统一
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846328
Mohamed G. Elfeky, M. Bastani, Xavier Velez, P. Moreno, Austin Waters
Acoustic model performance typically decreases when evaluated on a dialectal variation of the same language that was not used during training. Similarly, models simultaneously trained on a group of dialects tend to underperform dialect-specific models. In this paper, we report on our efforts towards building a unified acoustic model that can serve a multi-dialectal language. Two techniques are presented: Distillation and MultiTask Learning (MTL). In Distillation, we use an ensemble of dialect-specific acoustic models and distill its knowledge in a single model. In MTL, we utilize multitask learning to train a unified acoustic model that learns to distinguish dialects as a side task. We show that both techniques are superior to the jointly-trained model that is trained on all dialectal data, reducing word error rates by 4:2% and 0:6%, respectively. While achieving this improvement, neither technique degrades the performance of the dialect-specific models by more than 3:4%.
声学模型的表现通常会在训练中未使用的同一种语言的方言变体上进行评估时下降。同样,在一组方言上同时训练的模型往往表现不如特定方言的模型。在本文中,我们报告了我们为建立一个可以服务于多方言语言的统一声学模型所做的努力。提出了蒸馏和多任务学习(MTL)两种技术。在蒸馏中,我们使用特定方言声学模型的集合,并将其知识提取到单个模型中。在MTL中,我们利用多任务学习来训练统一的声学模型,该模型将学习区分方言作为副任务。我们表明,这两种技术都优于在所有方言数据上训练的联合训练模型,分别将单词错误率降低了4:2%和0:6%。在实现这种改进的同时,两种技术对特定方言模型的性能的降低都不超过3:4%。
{"title":"Towards acoustic model unification across dialects","authors":"Mohamed G. Elfeky, M. Bastani, Xavier Velez, P. Moreno, Austin Waters","doi":"10.1109/SLT.2016.7846328","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846328","url":null,"abstract":"Acoustic model performance typically decreases when evaluated on a dialectal variation of the same language that was not used during training. Similarly, models simultaneously trained on a group of dialects tend to underperform dialect-specific models. In this paper, we report on our efforts towards building a unified acoustic model that can serve a multi-dialectal language. Two techniques are presented: Distillation and MultiTask Learning (MTL). In Distillation, we use an ensemble of dialect-specific acoustic models and distill its knowledge in a single model. In MTL, we utilize multitask learning to train a unified acoustic model that learns to distinguish dialects as a side task. We show that both techniques are superior to the jointly-trained model that is trained on all dialectal data, reducing word error rates by 4:2% and 0:6%, respectively. While achieving this improvement, neither technique degrades the performance of the dialect-specific models by more than 3:4%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128415902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients 基于谱图图像特征和梅尔倒谱系数的盲语音分割
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846324
Adriana Stan, Cassia Valentini-Botinhao, B. Orza, M. Giurgiu
This paper introduces a novel method for blind speech segmentation at a phone level based on image processing. We consider the spectrogram of the waveform of an utterance as an image and hypothesize that its striping defects, i.e. discontinuities, appear due to phone boundaries. Using a simple image destriping algorithm these discontinuities are found. To discover phone transitions which are not as salient in the image, we compute spectral changes derived from the time evolution of Mel cepstral parametrisation of speech. These socalled image-based and acoustic features are then combined to form a mixed probability function, whose values indicate the likelihood of a phone boundary being located at the corresponding time frame. The method is completely unsupervised and achieves an accuracy of 75.59% at a −3.26% over-segmentation rate, yielding an F-measure of 0.76 and an 0.80 R-value on the TIMIT dataset.
介绍了一种基于图像处理的手机级盲语音分割方法。我们将语音波形的频谱图视为图像,并假设其条纹缺陷,即不连续,是由于电话边界而出现的。使用简单的图像去条纹算法发现这些不连续性。为了发现在图像中不那么突出的电话转换,我们计算了语音的梅尔倒谱参数化的时间演变所产生的频谱变化。然后将这些所谓的基于图像和声学的特征结合起来形成一个混合概率函数,其值表明在相应的时间框架内手机边界被定位的可能性。该方法完全无监督,在- 3.26%的过分割率下实现了75.59%的准确率,在TIMIT数据集上产生了0.76的f测量值和0.80的r值。
{"title":"Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients","authors":"Adriana Stan, Cassia Valentini-Botinhao, B. Orza, M. Giurgiu","doi":"10.1109/SLT.2016.7846324","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846324","url":null,"abstract":"This paper introduces a novel method for blind speech segmentation at a phone level based on image processing. We consider the spectrogram of the waveform of an utterance as an image and hypothesize that its striping defects, i.e. discontinuities, appear due to phone boundaries. Using a simple image destriping algorithm these discontinuities are found. To discover phone transitions which are not as salient in the image, we compute spectral changes derived from the time evolution of Mel cepstral parametrisation of speech. These socalled image-based and acoustic features are then combined to form a mixed probability function, whose values indicate the likelihood of a phone boundary being located at the corresponding time frame. The method is completely unsupervised and achieves an accuracy of 75.59% at a −3.26% over-segmentation rate, yielding an F-measure of 0.76 and an 0.80 R-value on the TIMIT dataset.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128662374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
The NDSC transcription system for the 2016 multi-genre broadcast challenge 2016年多类型广播挑战赛的NDSC转录系统
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846276
Xukui Yang, Dan Qu, Wenlin Zhang, Weiqiang Zhang
The National Digital Switching System Engineering and Technological R&D Center (NDSC) speech-to-text transcription system for the 2016 multi-genre broadcast challenge is described. Various acoustic models based on deep neural network (DNN), such as hybrid DNN, long short term memory recurrent neural network (LSTM RNN), and time delay neural network (TDNN), are trained. The system also makes use of recurrent neural network language models (RNNLMs) for re-scoring and minimum Bayes risk (MBR) combination. The WER on test dataset of the speech-to-text task is 18.2%. Furthermore, to simulate real applications where manual segmentations were not available an automatic segmentation system based on long-term information is proposed. WERs based on the automatically generated segments were slightly worse than that based on the manual segmentations.
介绍了国家数字交换系统工程与技术研发中心(NDSC)针对2016年多类型广播挑战赛的语音转文本转录系统。基于深度神经网络(DNN)的各种声学模型,如混合深度神经网络、长短期记忆递归神经网络(LSTM RNN)和时滞神经网络(TDNN)进行了训练。该系统还利用递归神经网络语言模型(RNNLMs)进行重新评分和最小贝叶斯风险(MBR)组合。语音转文本任务测试数据集上的WER为18.2%。在此基础上,针对无法进行人工分割的实际应用,提出了一种基于长时信息的自动分割系统。基于自动生成分段的wer略差于基于手动分段的wer。
{"title":"The NDSC transcription system for the 2016 multi-genre broadcast challenge","authors":"Xukui Yang, Dan Qu, Wenlin Zhang, Weiqiang Zhang","doi":"10.1109/SLT.2016.7846276","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846276","url":null,"abstract":"The National Digital Switching System Engineering and Technological R&D Center (NDSC) speech-to-text transcription system for the 2016 multi-genre broadcast challenge is described. Various acoustic models based on deep neural network (DNN), such as hybrid DNN, long short term memory recurrent neural network (LSTM RNN), and time delay neural network (TDNN), are trained. The system also makes use of recurrent neural network language models (RNNLMs) for re-scoring and minimum Bayes risk (MBR) combination. The WER on test dataset of the speech-to-text task is 18.2%. Furthermore, to simulate real applications where manual segmentations were not available an automatic segmentation system based on long-term information is proposed. WERs based on the automatically generated segments were slightly worse than that based on the manual segmentations.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116588429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Pre-filtered dynamic time warping for posteriorgram based keyword search 基于后置图的关键词搜索的预滤波动态时间翘曲
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846292
Gozde Cetinkaya, Batuhan Gündogdu, M. Saraçlar
In this study, we present a pre-filtering method for dynamic time warping (DTW) to improve the efficiency of a posteriorgram based keyword search (KWS) system. The ultimate aim is to improve the performance of a large vocabulary continuous speech recognition (LVCSR) based KWS system using the posteriorgram based KWS approach. We use phonetic posteriorgrams to represent the audio data and generate average posteriorgrams to represent the given text queries. The DTW algorithm is used to determine the optimal alignment between the posteriorgrams of the audio data and the queries. Since DTW has quadratic complexity, it can be relatively inefficient for keyword search. Our main contribution is to reduce this complexity by pre-filtering based on a vector space representation of the two posteriorgrams without any degradation in performance. Experimental results show that our system reduces the complexity and when combined with the baseline LVCSR based KWS system, it improves the performance both for the out-of-vocabulary (OOV) queries and the in-vocabulary (IV) queries.
在这项研究中,我们提出了一种动态时间规整(DTW)的预滤波方法,以提高基于后置图的关键字搜索(KWS)系统的效率。最终目的是利用基于后验图的KWS方法来提高基于大词汇量连续语音识别(LVCSR)系统的性能。我们使用语音后图来表示音频数据,并生成平均后图来表示给定的文本查询。DTW算法用于确定音频数据的后置图与查询之间的最佳对齐。由于DTW具有二次复杂度,因此对于关键字搜索而言,它的效率相对较低。我们的主要贡献是通过基于两个后置图的向量空间表示进行预滤波来降低这种复杂性,而不会降低性能。实验结果表明,该系统降低了复杂性,并且与基于基线LVCSR的KWS系统相结合,提高了词汇外查询(OOV)和词汇内查询(IV)的性能。
{"title":"Pre-filtered dynamic time warping for posteriorgram based keyword search","authors":"Gozde Cetinkaya, Batuhan Gündogdu, M. Saraçlar","doi":"10.1109/SLT.2016.7846292","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846292","url":null,"abstract":"In this study, we present a pre-filtering method for dynamic time warping (DTW) to improve the efficiency of a posteriorgram based keyword search (KWS) system. The ultimate aim is to improve the performance of a large vocabulary continuous speech recognition (LVCSR) based KWS system using the posteriorgram based KWS approach. We use phonetic posteriorgrams to represent the audio data and generate average posteriorgrams to represent the given text queries. The DTW algorithm is used to determine the optimal alignment between the posteriorgrams of the audio data and the queries. Since DTW has quadratic complexity, it can be relatively inefficient for keyword search. Our main contribution is to reduce this complexity by pre-filtering based on a vector space representation of the two posteriorgrams without any degradation in performance. Experimental results show that our system reduces the complexity and when combined with the baseline LVCSR based KWS system, it improves the performance both for the out-of-vocabulary (OOV) queries and the in-vocabulary (IV) queries.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"389 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121004692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A factor analysis model of sequences for language recognition 用于语言识别的序列因子分析模型
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846287
M. Omar
Joint factor analysis [1] application to speaker and language recognition advanced the performance of automatic systems in these areas. A special case of the early work in [1], namely the i-vector representation [2], has been applied successfully in many areas including speaker [2], language [3], and speech recognition [4]. This work presents a novel model which represents a long sequence of observations using the factor analysis model of shorter overlapping subsquences. This model takes into consideration the dependency of the adjacent latent vectors. It is shown that this model outperforms the current joint factor analysis approach based on the assumption of independent and identically distributed (iid) observations given one global latent vector. In addition, we replace the language-independent prior model of the latent vector in the i-vector model with a language-dependent prior model and modify the objective function used in the estimation of the factor analysis projection matrix and the prior model to correspond to the cross-entropy objective function estimated based on this new model. We derive also the update equations of the projection matrix and the prior model parameters which maximize the cross-entropy objective function. We evaluate the performance of our approach on the language recognition task of the robust automatic transcription of speech (RATS) project. Our experiments show improvements up to 11% relative using the proposed approach in terms of equal error rate compared to the standard approach of using an i-vector representation [2].
联合因子分析[1]在说话人和语言识别中的应用提高了自动系统在这些领域的性能。[1]中早期工作的一个特例,即i向量表示[2],已经成功地应用于许多领域,包括说话人[2]、语言[3]和语音识别[4]。这项工作提出了一个新的模型,它代表了一个长序列的观察使用较短的重叠子序列的因子分析模型。该模型考虑了相邻潜在向量的相关性。结果表明,该模型优于当前基于独立同分布(iid)观测值假设的联合因子分析方法。此外,我们将i-vector模型中潜在向量的语言无关先验模型替换为语言相关先验模型,并修改用于估计因子分析投影矩阵和先验模型的目标函数,使其对应于基于该新模型估计的交叉熵目标函数。我们还推导了投影矩阵的更新方程和使交叉熵目标函数最大化的先验模型参数。我们评估了我们的方法在鲁棒自动语音转录(RATS)项目的语言识别任务中的性能。我们的实验表明,与使用i向量表示的标准方法相比,使用所提出的方法在相同错误率方面的改进高达11%[2]。
{"title":"A factor analysis model of sequences for language recognition","authors":"M. Omar","doi":"10.1109/SLT.2016.7846287","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846287","url":null,"abstract":"Joint factor analysis [1] application to speaker and language recognition advanced the performance of automatic systems in these areas. A special case of the early work in [1], namely the i-vector representation [2], has been applied successfully in many areas including speaker [2], language [3], and speech recognition [4]. This work presents a novel model which represents a long sequence of observations using the factor analysis model of shorter overlapping subsquences. This model takes into consideration the dependency of the adjacent latent vectors. It is shown that this model outperforms the current joint factor analysis approach based on the assumption of independent and identically distributed (iid) observations given one global latent vector. In addition, we replace the language-independent prior model of the latent vector in the i-vector model with a language-dependent prior model and modify the objective function used in the estimation of the factor analysis projection matrix and the prior model to correspond to the cross-entropy objective function estimated based on this new model. We derive also the update equations of the projection matrix and the prior model parameters which maximize the cross-entropy objective function. We evaluate the performance of our approach on the language recognition task of the robust automatic transcription of speech (RATS) project. Our experiments show improvements up to 11% relative using the proposed approach in terms of equal error rate compared to the standard approach of using an i-vector representation [2].","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114213755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Development of the MIT ASR system for the 2016 Arabic Multi-genre Broadcast Challenge 为2016年阿拉伯多类型广播挑战赛开发麻省理工学院ASR系统
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846280
T. A. Hanai, Wei-Ning Hsu, James R. Glass
The Arabic language, with over 300 million speakers, has significant diversity and breadth. This proves challenging when building an automated system to understand what is said. This paper describes an Arabic Automatic Speech Recognition system developed on a 1,200 hour speech corpus that was made available for the 2016 Arabic Multi-genre Broadcast (MGB) Challenge. A range of Deep Neural Network (DNN) topologies were modeled including; Feed-forward, Convolutional, Time-Delay, Recurrent Long Short-Term Memory (LSTM), Highway LSTM (H-LSTM), and Grid LSTM (GLSTM). The best performance came from a sequence discriminatively trained G-LSTM neural network. The best overall Word Error Rate (WER) was 18.3% (p < 0:001) on the development set, after combining hypotheses of 3 and 5 layer sequence discriminatively trained G-LSTM models that had been rescored with a 4-gram language model.
阿拉伯语有3亿多使用者,具有显著的多样性和广泛性。当构建一个自动化系统来理解所说的内容时,这被证明是具有挑战性的。本文描述了一个基于1200小时语音语料库开发的阿拉伯语自动语音识别系统,该语料库可用于2016年阿拉伯语多类型广播(MGB)挑战赛。一系列深度神经网络(DNN)拓扑被建模,包括;前馈、卷积、时滞、循环长短期记忆(LSTM)、高速LSTM (H-LSTM)、网格LSTM (GLSTM)。序列判别训练的G-LSTM神经网络性能最好。将3层和5层序列判别训练的G-LSTM模型与4克语言模型相结合,在开发集上,最佳的总体单词错误率(WER)为18.3% (p < 0:001)。
{"title":"Development of the MIT ASR system for the 2016 Arabic Multi-genre Broadcast Challenge","authors":"T. A. Hanai, Wei-Ning Hsu, James R. Glass","doi":"10.1109/SLT.2016.7846280","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846280","url":null,"abstract":"The Arabic language, with over 300 million speakers, has significant diversity and breadth. This proves challenging when building an automated system to understand what is said. This paper describes an Arabic Automatic Speech Recognition system developed on a 1,200 hour speech corpus that was made available for the 2016 Arabic Multi-genre Broadcast (MGB) Challenge. A range of Deep Neural Network (DNN) topologies were modeled including; Feed-forward, Convolutional, Time-Delay, Recurrent Long Short-Term Memory (LSTM), Highway LSTM (H-LSTM), and Grid LSTM (GLSTM). The best performance came from a sequence discriminatively trained G-LSTM neural network. The best overall Word Error Rate (WER) was 18.3% (p < 0:001) on the development set, after combining hypotheses of 3 and 5 layer sequence discriminatively trained G-LSTM models that had been rescored with a 4-gram language model.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116027195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Syntax or semantics? knowledge-guided joint semantic frame parsing 语法还是语义?知识引导联合语义框架解析
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846288
Yun-Nung (Vivian) Chen, Dilek Hakanni-Tur, Gökhan Tür, Asli Celikyilmaz, Jianfeng Gao, L. Deng
Spoken language understanding (SLU) is a core component of a spoken dialogue system, which involves intent prediction and slot filling and also called semantic frame parsing. Recently recurrent neural networks (RNN) obtained strong results on SLU due to their superior ability of preserving sequential information over time. Traditionally, the SLU component parses semantic frames for utterances considering their flat structures, as the underlying RNN structure is a linear chain. However, natural language exhibits linguistic properties that provide rich, structured information for better understanding. This paper proposes to apply knowledge-guided structural attention networks (K-SAN), which additionally incorporate non-flat network topologies guided by prior knowledge, to a language understanding task. The model can effectively figure out the salient substructures that are essential to parse the given utterance into its semantic frame with an attention mechanism, where two types of knowledge, syntax and semantics, are utilized. The experiments on the benchmark Air Travel Information System (ATIS) data and the conversational assistant Cortana data show that 1) the proposed K-SAN models with syntax or semantics outperform the state-of-the-art neural network based results, and 2) the improvement for joint semantic frame parsing is more significant, because the structured information provides rich cues for sentence-level understanding, where intent prediction and slot filling can be mutually improved.
口语理解(SLU)是口语对话系统的核心组成部分,涉及到意图预测和语槽填充,也称为语义框架解析。近年来,递归神经网络(RNN)由于其优越的随时间保持序列信息的能力,在SLU上取得了强有力的成果。传统上,由于底层RNN结构是线性链,因此SLU组件会考虑话语的平面结构来解析语义框架。然而,自然语言表现出的语言特性为更好的理解提供了丰富的、结构化的信息。本文提出将知识引导的结构注意网络(K-SAN)应用于语言理解任务,该网络在先验知识的引导下加入了非平面网络拓扑。该模型利用了语法和语义两种知识,通过注意机制有效地找出话语的显著子结构,这些子结构是将话语解析为语义框架所必需的。在航空旅行信息系统(ATIS)数据和会话助手Cortana数据上的实验表明:1)本文提出的基于句法和语义的K-SAN模型优于最先进的基于神经网络的结果;2)联合语义框架解析的改进更为显著,因为结构化信息为句子级理解提供了丰富的线索,意图预测和槽填充可以相互改进。
{"title":"Syntax or semantics? knowledge-guided joint semantic frame parsing","authors":"Yun-Nung (Vivian) Chen, Dilek Hakanni-Tur, Gökhan Tür, Asli Celikyilmaz, Jianfeng Gao, L. Deng","doi":"10.1109/SLT.2016.7846288","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846288","url":null,"abstract":"Spoken language understanding (SLU) is a core component of a spoken dialogue system, which involves intent prediction and slot filling and also called semantic frame parsing. Recently recurrent neural networks (RNN) obtained strong results on SLU due to their superior ability of preserving sequential information over time. Traditionally, the SLU component parses semantic frames for utterances considering their flat structures, as the underlying RNN structure is a linear chain. However, natural language exhibits linguistic properties that provide rich, structured information for better understanding. This paper proposes to apply knowledge-guided structural attention networks (K-SAN), which additionally incorporate non-flat network topologies guided by prior knowledge, to a language understanding task. The model can effectively figure out the salient substructures that are essential to parse the given utterance into its semantic frame with an attention mechanism, where two types of knowledge, syntax and semantics, are utilized. The experiments on the benchmark Air Travel Information System (ATIS) data and the conversational assistant Cortana data show that 1) the proposed K-SAN models with syntax or semantics outperform the state-of-the-art neural network based results, and 2) the improvement for joint semantic frame parsing is more significant, because the structured information provides rich cues for sentence-level understanding, where intent prediction and slot filling can be mutually improved.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"263 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124243016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1