首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Further optimisations of constant Q cepstral processing for integrated utterance and text-dependent speaker verification 用于综合话语和文本依赖说话人验证的恒Q倒谱处理的进一步优化
Pub Date : 2016-12-13 DOI: 10.1109/SLT.2016.7846262
H. Delgado, M. Todisco, Md. Sahidullah, A. K. Sarkar, N. Evans, T. Kinnunen, Z. Tan
Many authentication applications involving automatic speaker verification (ASV) demand robust performance using short-duration, fixed or prompted text utterances. Text constraints not only reduce the phone-mismatch between enrolment and test utterances, which generally leads to improved performance, but also provide an ancillary level of security. This can take the form of explicit utterance verification (UV). An integrated UV + ASV system should then verify access attempts which contain not just the expected speaker, but also the expected text content. This paper presents such a system and introduces new features which are used for both UV and ASV tasks. Based upon multi-resolution, spectro-temporal analysis and when fused with more traditional parameterisations, the new features not only generally outperform Mel-frequency cepstral coefficients, but also are shown to be complementary when fusing systems at score level. Finally, the joint operation of UV and ASV greatly decreases false acceptances for unmatched text trials.
许多涉及自动说话人验证(ASV)的身份验证应用要求使用短时间、固定或提示的文本话语具有强大的性能。文本约束不仅减少了注册和测试话语之间的电话不匹配,这通常会导致性能的提高,而且还提供了辅助级别的安全性。这可以采取显式话语验证(UV)的形式。然后,集成的UV + ASV系统应该验证访问尝试,其中不仅包含预期的说话者,还包含预期的文本内容。本文介绍了这样一个系统,并介绍了用于UV和ASV任务的新功能。基于多分辨率,光谱时间分析和与更传统的参数化融合,新特征不仅通常优于mel频率倒谱系数,而且在分数水平上融合系统时显示出互补。最后,UV和ASV的联合操作大大降低了不匹配文本试验的误接受率。
{"title":"Further optimisations of constant Q cepstral processing for integrated utterance and text-dependent speaker verification","authors":"H. Delgado, M. Todisco, Md. Sahidullah, A. K. Sarkar, N. Evans, T. Kinnunen, Z. Tan","doi":"10.1109/SLT.2016.7846262","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846262","url":null,"abstract":"Many authentication applications involving automatic speaker verification (ASV) demand robust performance using short-duration, fixed or prompted text utterances. Text constraints not only reduce the phone-mismatch between enrolment and test utterances, which generally leads to improved performance, but also provide an ancillary level of security. This can take the form of explicit utterance verification (UV). An integrated UV + ASV system should then verify access attempts which contain not just the expected speaker, but also the expected text content. This paper presents such a system and introduces new features which are used for both UV and ASV tasks. Based upon multi-resolution, spectro-temporal analysis and when fused with more traditional parameterisations, the new features not only generally outperform Mel-frequency cepstral coefficients, but also are shown to be complementary when fusing systems at score level. Finally, the joint operation of UV and ASV greatly decreases false acceptances for unmatched text trials.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"336 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115669425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
A study of speech distortion conditions in real scenarios for speech processing applications 语音处理应用中语音失真情况的研究
Pub Date : 2016-12-13 DOI: 10.1109/SLT.2016.7846239
D. González, E. Vincent, J. Lara
The growing demand for robust speech processing applications able to operate in adverse scenarios calls for new evaluation protocols and datasets beyond artificial laboratory conditions. The characteristics of real data for a given scenario are rarely discussed in the literature. As a result, methods are often tested based on the author expertise and not always in scenarios with actual practical value. This paper aims to open this discussion by identifying some of the main problems with data simulation or collection procedures used so far and summarizing the important characteristics of real scenarios to be taken into account, including the properties of reverberation, noise and Lombard effect. At last, we provide some preliminary guidelines towards designing experimental setup and speech recognition results for proposal validation.
对能够在不利情况下运行的强大语音处理应用程序的需求不断增长,需要新的评估协议和超越人工实验室条件的数据集。在文献中很少讨论给定场景下真实数据的特征。因此,方法通常基于作者的专业知识进行测试,而并不总是在具有实际实用价值的场景中进行测试。本文旨在通过识别迄今为止使用的数据模拟或收集程序的一些主要问题,并总结要考虑的真实场景的重要特征,包括混响,噪声和伦巴第效应的特性,来开启这一讨论。最后,我们提供了一些初步的指导,以设计实验装置和语音识别结果,以进行提案的验证。
{"title":"A study of speech distortion conditions in real scenarios for speech processing applications","authors":"D. González, E. Vincent, J. Lara","doi":"10.1109/SLT.2016.7846239","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846239","url":null,"abstract":"The growing demand for robust speech processing applications able to operate in adverse scenarios calls for new evaluation protocols and datasets beyond artificial laboratory conditions. The characteristics of real data for a given scenario are rarely discussed in the literature. As a result, methods are often tested based on the author expertise and not always in scenarios with actual practical value. This paper aims to open this discussion by identifying some of the main problems with data simulation or collection procedures used so far and summarizing the important characteristics of real scenarios to be taken into account, including the properties of reverberation, noise and Lombard effect. At last, we provide some preliminary guidelines towards designing experimental setup and speech recognition results for proposal validation.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"186 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134162937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Learning dialogue dynamics with the method of moments 用时刻法学习对话动态
Pub Date : 2016-12-13 DOI: 10.1109/SLT.2016.7846251
M. Barlier, R. Laroche, O. Pietquin
In this paper, we introduce a novel framework to encode the dynamics of dialogues into a probabilistic graphical model. Traditionally, Hidden Markov Models (HMMs) would be used to address this problem, involving a first step of hand-crafting to build a dialogue model (e.g. defining potential hidden states) followed by applying expectation-maximisation (EM) algorithms to refine it. Recently, an alternative class of algorithms based on the Method of Moments (MoM) has proven successful in avoiding issues of the EM-like algorithms such as convergence towards local optima, tractability issues, initialization issues or the lack of theoretical guarantees. In this work, we show that dialogues may be modeled by SP-RFA, a class of graphical models efficiently learnable within the MoM and directly usable in planning algorithms (such as reinforcement learning). Experiments are led on the Ubuntu corpus and dialogues are considered as sequences of dialogue acts, represented by their Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). We show that a MoM-based algorithm can learn a compact model of sequences of such acts.
在本文中,我们引入了一个新的框架来将对话的动态编码为一个概率图模型。传统上,隐马尔可夫模型(hmm)将用于解决这个问题,包括手工制作构建对话模型的第一步(例如定义潜在的隐藏状态),然后应用期望最大化(EM)算法来完善它。最近,基于矩量法(MoM)的另一类算法被证明成功地避免了类似em算法的问题,如收敛到局部最优、可跟踪性问题、初始化问题或缺乏理论保证。在这项工作中,我们证明了对话可以通过SP-RFA建模,SP-RFA是一类可在MoM中有效学习的图形模型,可直接用于规划算法(如强化学习)。在Ubuntu语料库上进行了实验,将对话视为对话行为序列,并通过潜在狄利克雷分配(LDA)和潜在语义分析(LSA)来表示。我们证明了基于mom的算法可以学习这些行为序列的紧凑模型。
{"title":"Learning dialogue dynamics with the method of moments","authors":"M. Barlier, R. Laroche, O. Pietquin","doi":"10.1109/SLT.2016.7846251","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846251","url":null,"abstract":"In this paper, we introduce a novel framework to encode the dynamics of dialogues into a probabilistic graphical model. Traditionally, Hidden Markov Models (HMMs) would be used to address this problem, involving a first step of hand-crafting to build a dialogue model (e.g. defining potential hidden states) followed by applying expectation-maximisation (EM) algorithms to refine it. Recently, an alternative class of algorithms based on the Method of Moments (MoM) has proven successful in avoiding issues of the EM-like algorithms such as convergence towards local optima, tractability issues, initialization issues or the lack of theoretical guarantees. In this work, we show that dialogues may be modeled by SP-RFA, a class of graphical models efficiently learnable within the MoM and directly usable in planning algorithms (such as reinforcement learning). Experiments are led on the Ubuntu corpus and dialogues are considered as sequences of dialogue acts, represented by their Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). We show that a MoM-based algorithm can learn a compact model of sequences of such acts.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125781481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A nonparametric Bayesian approach for automatic discovery of a lexicon and acoustic units 一种用于自动发现词汇和声学单位的非参数贝叶斯方法
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846247
A. Torbati, J. Picone
State of the art speech recognition systems use context-dependent phonemes as acoustic units. However, these approaches do not work well for low resourced languages where large amounts of training data or resources such as a lexicon are not available. For such languages, automatic discovery of acoustic units can be important. In this paper, we demonstrate the application of nonparametric Bayesian models to acoustic unit discovery. We show that the discovered units are linguistically meaningful. We also present a semi-supervised learning algorithm that uses a nonparametric Bayesian model to learn a mapping between words and acoustic units. We demonstrate that a speech recognition system using these discovered resources can approach the performance of a speech recognizer trained using resources developed by experts. We show that unsupervised discovery of acoustic units combined with semi-supervised discovery of the lexicon achieved performance (9.8% WER) comparable to other published high complexity systems. This nonparametric approach enables the rapid development of speech recognition systems in low resourced languages.
最先进的语音识别系统使用上下文相关的音素作为声学单位。然而,这些方法不适用于缺乏大量训练数据或资源(如词典)的低资源语言。对于这些语言,自动发现声学单位是很重要的。在本文中,我们展示了非参数贝叶斯模型在声学单元发现中的应用。我们表明发现的单位在语言上是有意义的。我们还提出了一种半监督学习算法,该算法使用非参数贝叶斯模型来学习单词和声学单元之间的映射。我们证明,使用这些发现的资源的语音识别系统可以接近使用专家开发的资源训练的语音识别器的性能。我们表明,声学单元的无监督发现与词典的半监督发现相结合,取得了与其他已发表的高复杂性系统相当的性能(9.8%的WER)。这种非参数方法使低资源语言的语音识别系统得以快速发展。
{"title":"A nonparametric Bayesian approach for automatic discovery of a lexicon and acoustic units","authors":"A. Torbati, J. Picone","doi":"10.1109/SLT.2016.7846247","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846247","url":null,"abstract":"State of the art speech recognition systems use context-dependent phonemes as acoustic units. However, these approaches do not work well for low resourced languages where large amounts of training data or resources such as a lexicon are not available. For such languages, automatic discovery of acoustic units can be important. In this paper, we demonstrate the application of nonparametric Bayesian models to acoustic unit discovery. We show that the discovered units are linguistically meaningful. We also present a semi-supervised learning algorithm that uses a nonparametric Bayesian model to learn a mapping between words and acoustic units. We demonstrate that a speech recognition system using these discovered resources can approach the performance of a speech recognizer trained using resources developed by experts. We show that unsupervised discovery of acoustic units combined with semi-supervised discovery of the lexicon achieved performance (9.8% WER) comparable to other published high complexity systems. This nonparametric approach enables the rapid development of speech recognition systems in low resourced languages.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"223 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127177933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Look, listen, and decode: Multimodal speech recognition with images 看,听,解码:多模态语音识别与图像
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846320
Felix Sun, David F. Harwath, James R. Glass
In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.
在本文中,我们引入了一个多模态语音识别场景,其中图像为要解码的语音标题提供上下文信息。我们研究了一种网格评分算法,该算法集成了图像在两个不同点的信息:图像被用来用最有可能的词来增强语言模型,并使用词级RNN来重新评分顶级假设。与仅使用语音记录的基线语音识别器相比,这种评分机制将单词错误率降低了3个绝对百分点。
{"title":"Look, listen, and decode: Multimodal speech recognition with images","authors":"Felix Sun, David F. Harwath, James R. Glass","doi":"10.1109/SLT.2016.7846320","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846320","url":null,"abstract":"In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125156947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Weakly supervised user intent detection for multi-domain dialogues 多域对话的弱监督用户意图检测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846250
Ming Sun, Aasish Pappu, Yun-Nung (Vivian) Chen, Alexander I. Rudnicky
Users interact with mobile apps with certain intents such as finding a restaurant. Some intents and their corresponding activities are complex and may involve multiple apps; for example, a restaurant app, a messenger app and a calendar app may be needed to plan a dinner with friends. However, activities may be quite personal and third-party developers would not be building apps to specifically handle complex intents (e.g., a DinnerPlanner). Instead we want our intelligent agent to actively learn to understand these intents and provide assistance when needed. This paper proposes a framework to enable the agent to learn an inventory of intents from a small set of task-oriented user utterances. The experiments show that on previously unseen user activities, the agent is able to reliably recognize user intents using graph-based semi-supervised learning methods. The dataset, models, and the system outputs are available to research community.
用户与手机应用的交互带有特定的目的,比如寻找一家餐馆。一些意图及其相应的活动是复杂的,可能涉及多个应用程序;例如,可能需要一个餐厅应用程序、一个信使应用程序和一个日历应用程序来计划与朋友的晚餐。然而,活动可能是非常个性化的,第三方开发人员不会构建专门处理复杂意图的应用程序(例如,DinnerPlanner)。相反,我们希望我们的智能代理能够主动学习理解这些意图,并在需要时提供帮助。本文提出了一个框架,使智能体能够从一小组面向任务的用户话语中学习意图清单。实验表明,对于以前未见过的用户活动,智能体能够使用基于图的半监督学习方法可靠地识别用户意图。数据集、模型和系统输出可供研究界使用。
{"title":"Weakly supervised user intent detection for multi-domain dialogues","authors":"Ming Sun, Aasish Pappu, Yun-Nung (Vivian) Chen, Alexander I. Rudnicky","doi":"10.1109/SLT.2016.7846250","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846250","url":null,"abstract":"Users interact with mobile apps with certain intents such as finding a restaurant. Some intents and their corresponding activities are complex and may involve multiple apps; for example, a restaurant app, a messenger app and a calendar app may be needed to plan a dinner with friends. However, activities may be quite personal and third-party developers would not be building apps to specifically handle complex intents (e.g., a DinnerPlanner). Instead we want our intelligent agent to actively learn to understand these intents and provide assistance when needed. This paper proposes a framework to enable the agent to learn an inventory of intents from a small set of task-oriented user utterances. The experiments show that on previously unseen user activities, the agent is able to reliably recognize user intents using graph-based semi-supervised learning methods. The dataset, models, and the system outputs are available to research community.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129257086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Voice search language model adaptation using contextual information 基于上下文信息的语音搜索语言模型适配
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846273
Justin Scheiner, Ian Williams, Petar S. Aleksic
It has been shown that automatic speech recognition (ASR) system quality can be improved by augmenting n-gram language models with contextual information [1][2]. In the voice search domain, there are a large number of useful contextual signals for a given query. Some of these signals are speaker location, speaker identity, time of the query, etc. Each of these signals comes with relevant contextual information (e.g. location specific entities, favorite queries, recent popular queries) that is not included in the language model's training data. We show that these contextual signals can be used to improve ASR system quality. This is achieved by adjusting n-gram language model probabilities on-the-fly based on the contextual information relevant for the current voice search request. We analyze three example sources of context: location context, previously entered typed and spoken queries. We present a set of approaches we have used to improve ASR quality using these sources of context. Our main objective is to automatically, in real time, take advantage of all available sources of contextual information. In addition, we investigate challenges that come with applying our approach to a number of languages (unsegmented languages, languages with diacritics) and present solutions used.
研究表明,利用上下文信息[1][2]增强n-gram语言模型可以提高自动语音识别(ASR)系统质量。在语音搜索领域,对于给定的查询,存在大量有用的上下文信号。这些信号包括说话人的位置、说话人的身份、查询的时间等。这些信号中的每一个都带有相关的上下文信息(例如,特定位置的实体、最喜欢的查询、最近流行的查询),这些信息不包括在语言模型的训练数据中。我们表明,这些上下文信号可以用来提高ASR系统的质量。这是通过基于与当前语音搜索请求相关的上下文信息实时调整n-gram语言模型概率来实现的。我们分析了上下文的三个示例来源:位置上下文、先前输入的键入查询和语音查询。我们提出了一套我们使用这些上下文资源来提高ASR质量的方法。我们的主要目标是自动地、实时地利用所有可用的上下文信息来源。此外,我们还研究了将我们的方法应用于许多语言(未分段语言,带有变音符号的语言)时所面临的挑战,并提出了使用的解决方案。
{"title":"Voice search language model adaptation using contextual information","authors":"Justin Scheiner, Ian Williams, Petar S. Aleksic","doi":"10.1109/SLT.2016.7846273","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846273","url":null,"abstract":"It has been shown that automatic speech recognition (ASR) system quality can be improved by augmenting n-gram language models with contextual information [1][2]. In the voice search domain, there are a large number of useful contextual signals for a given query. Some of these signals are speaker location, speaker identity, time of the query, etc. Each of these signals comes with relevant contextual information (e.g. location specific entities, favorite queries, recent popular queries) that is not included in the language model's training data. We show that these contextual signals can be used to improve ASR system quality. This is achieved by adjusting n-gram language model probabilities on-the-fly based on the contextual information relevant for the current voice search request. We analyze three example sources of context: location context, previously entered typed and spoken queries. We present a set of approaches we have used to improve ASR quality using these sources of context. Our main objective is to automatically, in real time, take advantage of all available sources of contextual information. In addition, we investigate challenges that come with applying our approach to a number of languages (unsegmented languages, languages with diacritics) and present solutions used.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127740590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Quaternion Neural Networks for Spoken Language Understanding 用于口语理解的四元数神经网络
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846290
Titouan Parcollet, Mohamed Morchid, Pierre-Michel Bousquet, Richard Dufour, G. Linarès, R. Mori
Machine Learning (ML) techniques have allowed a great performance improvement of different challenging Spoken Language Understanding (SLU) tasks. Among these methods, Neural Networks (NN), or Multilayer Perceptron (MLP), recently received a great interest from researchers due to their representation capability of complex internal structures in a low dimensional subspace. However, MLPs employ document representations based on basic word level or topic-based features. Therefore, these basic representations reveal little in way of document statistical structure by only considering words or topics contained in the document as a “bag-of-words”, ignoring relations between them. We propose to remedy this weakness by extending the complex features based on Quaternion algebra presented in [1] to neural networks called QMLP. This original QMLP approach is based on hyper-complex algebra to take into consideration features dependencies in documents. New document features, based on the document structure itself, used as input of the QMLP, are also investigated in this paper, in comparison to those initially proposed in [1]. Experiments made on a SLU task from a real framework of human spoken dialogues showed that our QMLP approach associated with the proposed document features outperforms other approaches, with an accuracy gain of 2% with respect to the MLP based on real numbers and more than 3% with respect to the first Quaternion-based features proposed in [1]. We finally demonstrated that less iterations are needed by our QMLP architecture to be efficient and to reach promising accuracies.
机器学习(ML)技术可以极大地提高不同具有挑战性的口语理解(SLU)任务的性能。在这些方法中,神经网络(NN)或多层感知器(MLP)由于其在低维子空间中表示复杂内部结构的能力,最近受到了研究人员的极大兴趣。但是,mlp使用基于基本词级或基于主题的特征的文档表示。因此,这些基本表示只将文档中包含的单词或主题视为“词袋”,而忽略了它们之间的关系,几乎没有揭示文档的统计结构。我们建议通过将[1]中提出的基于四元数代数的复杂特征扩展到称为QMLP的神经网络来弥补这一弱点。这种原始的QMLP方法基于超复杂代数,以考虑文档中的特性依赖关系。与文献[1]中最初提出的特征相比,本文还研究了基于文档结构本身作为QMLP输入的新文档特征。在一个真实的人类口语对话框架的SLU任务上进行的实验表明,我们的QMLP方法与所提出的文档特征相关联,优于其他方法,相对于基于实数的MLP,准确率提高了2%,相对于[1]中提出的基于四元数的第一个特征,准确率提高了3%以上。我们最终证明,我们的QMLP体系结构需要更少的迭代来提高效率并达到有希望的准确性。
{"title":"Quaternion Neural Networks for Spoken Language Understanding","authors":"Titouan Parcollet, Mohamed Morchid, Pierre-Michel Bousquet, Richard Dufour, G. Linarès, R. Mori","doi":"10.1109/SLT.2016.7846290","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846290","url":null,"abstract":"Machine Learning (ML) techniques have allowed a great performance improvement of different challenging Spoken Language Understanding (SLU) tasks. Among these methods, Neural Networks (NN), or Multilayer Perceptron (MLP), recently received a great interest from researchers due to their representation capability of complex internal structures in a low dimensional subspace. However, MLPs employ document representations based on basic word level or topic-based features. Therefore, these basic representations reveal little in way of document statistical structure by only considering words or topics contained in the document as a “bag-of-words”, ignoring relations between them. We propose to remedy this weakness by extending the complex features based on Quaternion algebra presented in [1] to neural networks called QMLP. This original QMLP approach is based on hyper-complex algebra to take into consideration features dependencies in documents. New document features, based on the document structure itself, used as input of the QMLP, are also investigated in this paper, in comparison to those initially proposed in [1]. Experiments made on a SLU task from a real framework of human spoken dialogues showed that our QMLP approach associated with the proposed document features outperforms other approaches, with an accuracy gain of 2% with respect to the MLP based on real numbers and more than 3% with respect to the first Quaternion-based features proposed in [1]. We finally demonstrated that less iterations are needed by our QMLP architecture to be efficient and to reach promising accuracies.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121865038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
High quality agreement-based semi-supervised training data for acoustic modeling 声学建模的高质量基于协议的半监督训练数据
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846323
F. D. C. Quitry, Asa Oines, P. Moreno, Eugene Weinstein
This paper describes a new technique to automatically obtain large high-quality training speech corpora for acoustic modeling. Traditional approaches select utterances based on confidence thresholds and other heuristics. We propose instead to use an ensemble approach: we transcribe each utterance using several recognizers, and only keep those on which they agree. The recognizers we use are trained on data from different dialects of the same language, and this diversity leads them to make different mistakes in transcribing speech utterances. In this work we show, however, that when they agree, this is an extremely strong signal that the transcript is correct. This allows us to produce automatically transcribed speech corpora that are superior in transcript correctness even to those manually transcribed by humans. Furthermore, we show that using the produced semi-supervised data sets, we can train new acoustic models which outperform those trained solely on previously available data sets.
本文介绍了一种自动获取用于声学建模的大质量训练语料库的新技术。传统的方法是基于置信阈值和其他启发式方法来选择话语。我们建议使用一种集成方法:我们使用几个识别器转录每个话语,只保留它们一致的那些。我们使用的识别器是在同一种语言的不同方言的数据上训练的,这种多样性导致它们在转录语音时犯不同的错误。然而,在这项工作中,我们表明,当它们一致时,这是一个非常强烈的信号,表明转录是正确的。这使我们能够生成自动转录的语音语料库,其转录正确性甚至优于人类手动转录的语料库。此外,我们表明,使用生成的半监督数据集,我们可以训练新的声学模型,其性能优于仅使用先前可用数据集训练的声学模型。
{"title":"High quality agreement-based semi-supervised training data for acoustic modeling","authors":"F. D. C. Quitry, Asa Oines, P. Moreno, Eugene Weinstein","doi":"10.1109/SLT.2016.7846323","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846323","url":null,"abstract":"This paper describes a new technique to automatically obtain large high-quality training speech corpora for acoustic modeling. Traditional approaches select utterances based on confidence thresholds and other heuristics. We propose instead to use an ensemble approach: we transcribe each utterance using several recognizers, and only keep those on which they agree. The recognizers we use are trained on data from different dialects of the same language, and this diversity leads them to make different mistakes in transcribing speech utterances. In this work we show, however, that when they agree, this is an extremely strong signal that the transcript is correct. This allows us to produce automatically transcribed speech corpora that are superior in transcript correctness even to those manually transcribed by humans. Furthermore, we show that using the produced semi-supervised data sets, we can train new acoustic models which outperform those trained solely on previously available data sets.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127263934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
DNN adaptation for recognition of children speech through automatic utterance selection 基于自动语音选择的儿童语音识别的DNN自适应
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846331
M. Matassoni, D. Falavigna, D. Giuliani
This paper describes an approach for adapting a DNN trained on adult speech to children voices. The method extends a previous one, based on the Kullback-Leibler divergence between the original (adult) DNN output distribution and the target one, by accounting for the quality of the supervision of the adaptation utterances. In addition, starting from the observation that by gradually removing from the adaptation set the sentences with higher WERs significant performance improvements can be achieved, we also investigate the usage of automatic selection of adaptation utterances. For determining transcription quality we investigate the use of confidence estimates of recognized hypotheses. We present experiments and related results achieved on an Italian data set of children's speech. We show that the proposed DNN adaptation approach allows to significantly reduce the WER on a given test set from 14.2% (corresponding to using the non adapted DNN, trained on adult speech) to 10.6%. It is worth mentioning that the latter result has been achieved without making use of any training data specific of children's speech.
本文描述了一种将经过成人语言训练的深度神经网络应用于儿童语音的方法。该方法基于原始(成人)DNN输出分布与目标DNN输出分布之间的Kullback-Leibler分歧,通过考虑对自适应话语的监督质量,扩展了之前的方法。此外,我们还从逐步从适应集合中移除具有更高wer的句子可以显著提高性能的观察出发,研究了自动选择适应话语的使用情况。为了确定转录质量,我们研究了对公认假设的置信度估计的使用。本文介绍了在意大利语儿童语言数据集上的实验和相关结果。我们表明,提出的深度神经网络自适应方法可以将给定测试集上的WER从14.2%(对应于使用未经自适应的深度神经网络,对成人语音进行训练)显著降低到10.6%。值得一提的是,后一种结果是在没有使用任何针对儿童言语的训练数据的情况下取得的。
{"title":"DNN adaptation for recognition of children speech through automatic utterance selection","authors":"M. Matassoni, D. Falavigna, D. Giuliani","doi":"10.1109/SLT.2016.7846331","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846331","url":null,"abstract":"This paper describes an approach for adapting a DNN trained on adult speech to children voices. The method extends a previous one, based on the Kullback-Leibler divergence between the original (adult) DNN output distribution and the target one, by accounting for the quality of the supervision of the adaptation utterances. In addition, starting from the observation that by gradually removing from the adaptation set the sentences with higher WERs significant performance improvements can be achieved, we also investigate the usage of automatic selection of adaptation utterances. For determining transcription quality we investigate the use of confidence estimates of recognized hypotheses. We present experiments and related results achieved on an Italian data set of children's speech. We show that the proposed DNN adaptation approach allows to significantly reduce the WER on a given test set from 14.2% (corresponding to using the non adapted DNN, trained on adult speech) to 10.6%. It is worth mentioning that the latter result has been achieved without making use of any training data specific of children's speech.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128897632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1