首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Intent detection using semantically enriched word embeddings 基于语义丰富词嵌入的意图检测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846297
Joo-Kyung Kim, Gökhan Tür, Asli Celikyilmaz, Bin Cao, Ye-Yi Wang
State-of-the-art targeted language understanding systems rely on deep learning methods using 1-hot word vectors or off-the-shelf word embeddings. While word embeddings can be enriched with information from semantic lexicons (such as WordNet and PPDB) to improve their semantic representation, most previous research on word-embedding enriching has focused on improving intrinsic word-level tasks such as word analogy and antonym detection. In this work, we enrich word embeddings to force semantically similar or dissimilar words to be closer or farther away in the embedding space to improve the performance of an extrinsic task, namely, intent detection for spoken language understanding. We utilize several semantic lexicons, such as WordNet, PPDB, and Macmillan Dictionary to enrich the word embeddings and later use them as initial representation of words for intent detection. Thus, we enrich embeddings outside the neural network as opposed to learning the embeddings within the network, and, on top of the embeddings, build bidirectional LSTM for intent detection. Our experiments on ATIS and a real log dataset from Microsoft Cortana show that word embeddings enriched with semantic lexicons can improve intent detection.
最先进的目标语言理解系统依赖于使用1-热词向量或现成的词嵌入的深度学习方法。虽然词嵌入可以利用语义词汇(如WordNet和PPDB)的信息进行丰富,以提高其语义表征,但大多数关于词嵌入丰富的研究都集中在提高固有的词级任务,如词类比和反义词检测。在这项工作中,我们丰富了词嵌入,以迫使语义相似或不相似的词在嵌入空间中更近或更远,以提高外在任务的性能,即用于口语理解的意图检测。我们利用几个语义词典,如WordNet、PPDB和Macmillan Dictionary来丰富词嵌入,然后将它们用作意图检测的词的初始表示。因此,我们丰富了神经网络外部的嵌入,而不是学习网络内部的嵌入,并且在嵌入的基础上,构建了用于意图检测的双向LSTM。我们在ATIS和来自Microsoft Cortana的真实日志数据集上的实验表明,富含语义词汇的词嵌入可以改善意图检测。
{"title":"Intent detection using semantically enriched word embeddings","authors":"Joo-Kyung Kim, Gökhan Tür, Asli Celikyilmaz, Bin Cao, Ye-Yi Wang","doi":"10.1109/SLT.2016.7846297","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846297","url":null,"abstract":"State-of-the-art targeted language understanding systems rely on deep learning methods using 1-hot word vectors or off-the-shelf word embeddings. While word embeddings can be enriched with information from semantic lexicons (such as WordNet and PPDB) to improve their semantic representation, most previous research on word-embedding enriching has focused on improving intrinsic word-level tasks such as word analogy and antonym detection. In this work, we enrich word embeddings to force semantically similar or dissimilar words to be closer or farther away in the embedding space to improve the performance of an extrinsic task, namely, intent detection for spoken language understanding. We utilize several semantic lexicons, such as WordNet, PPDB, and Macmillan Dictionary to enrich the word embeddings and later use them as initial representation of words for intent detection. Thus, we enrich embeddings outside the neural network as opposed to learning the embeddings within the network, and, on top of the embeddings, build bidirectional LSTM for intent detection. Our experiments on ATIS and a real log dataset from Microsoft Cortana show that word embeddings enriched with semantic lexicons can improve intent detection.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129445404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 87
Deep neural network-based speaker embeddings for end-to-end speaker verification 基于深度神经网络的说话人嵌入端到端说话人验证
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846260
David Snyder, Pegah Ghahremani, Daniel Povey, D. Garcia-Romero, Yishay Carmiel, S. Khudanpur
In this study, we investigate an end-to-end text-independent speaker verification system. The architecture consists of a deep neural network that takes a variable length speech segment and maps it to a speaker embedding. The objective function separates same-speaker and different-speaker pairs, and is reused during verification. Similar systems have recently shown promise for text-dependent verification, but we believe that this is unexplored for the text-independent task. We show that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates. Relative to the baseline, the end-to-end system reduces EER by 13% average and 29% pooled across test conditions. The fused system achieves a reduction of 32% average and 38% pooled.
在本研究中,我们研究了一个端到端独立于文本的说话人验证系统。该体系结构由一个深度神经网络组成,该网络接受可变长度的语音片段并将其映射到说话人嵌入。目标函数将同一发言者和不同发言者对分开,并在验证期间重用。类似的系统最近显示出对依赖文本的验证的希望,但我们认为这对于独立于文本的任务来说是未知的。我们表明,给定大量的训练说话者,所提出的系统在相等错误率(EER)和低缺失率方面优于i向量基线。相对于基线,端到端系统将EER平均降低13%,在测试条件下降低29%。融合系统平均减少32%,合并后减少38%。
{"title":"Deep neural network-based speaker embeddings for end-to-end speaker verification","authors":"David Snyder, Pegah Ghahremani, Daniel Povey, D. Garcia-Romero, Yishay Carmiel, S. Khudanpur","doi":"10.1109/SLT.2016.7846260","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846260","url":null,"abstract":"In this study, we investigate an end-to-end text-independent speaker verification system. The architecture consists of a deep neural network that takes a variable length speech segment and maps it to a speaker embedding. The objective function separates same-speaker and different-speaker pairs, and is reused during verification. Similar systems have recently shown promise for text-dependent verification, but we believe that this is unexplored for the text-independent task. We show that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates. Relative to the baseline, the end-to-end system reduces EER by 13% average and 29% pooled across test conditions. The fused system achieves a reduction of 32% average and 38% pooled.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129879564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 332
Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting 长短期记忆网络的最大池损失训练用于小内存占用关键字识别
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846306
Ming Sun, A. Raju, G. Tucker, S. Panchapagesan, Gengshen Fu, Arindam Mandal, S. Matsoukas, N. Strom, S. Vitaladevuni
We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields 67:6% relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.
我们提出了一个基于最大池的损失函数,用于训练长短期记忆(LSTM)网络,用于小占用的关键字定位(KWS),具有低CPU,内存和延迟要求。通过交叉熵损失训练网络的初始化,可以进一步指导最大池化损失训练。采用一种基于后验平滑的评价方法来衡量关键词识别性能。我们的实验结果表明,使用交叉熵损失或最大池化损失训练的LSTM模型优于交叉熵损失训练的基线前馈深度神经网络(DNN)。此外,随机初始化网络的最大池损失训练LSTM比交叉熵损失训练LSTM性能更好。最后,用交叉熵预训练网络初始化的最大池损失训练LSTM表现出最好的性能,在曲线下面积(Area Under the Curve, AUC)测量中,与基线前馈深度神经网络相比,其相对降低了67:6%。
{"title":"Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting","authors":"Ming Sun, A. Raju, G. Tucker, S. Panchapagesan, Gengshen Fu, Arindam Mandal, S. Matsoukas, N. Strom, S. Vitaladevuni","doi":"10.1109/SLT.2016.7846306","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846306","url":null,"abstract":"We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields 67:6% relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117109951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 109
Phonetic content impact on Forensic Voice Comparison 语音内容对法医语音比对的影响
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846267
Ajili. Moez, Bonastre Jean-François, Ben Kheder Waad, Rossato Solange, Kahn Juliette
Forensic Voice Comparison (FVC) is increasingly using the likelihood ratio (LR) in order to indicate whether the evidence supports the prosecution (same-speaker) or defender (different-speakers) hypotheses. In addition to support one hypothesis, the LR provides a theoretically founded estimate of the relative strength of its support. Despite this nice theoretical aspect, the LR accepts some practical limitations due both to its estimation process itself and to a lack of knowledge about the reliability of this (practical) estimation process. In a large set of situations, a lack in reliability at the estimation process level potentially destroys the reliability of the resulting LR. It is particularly true when automatic FVC is considered, as Automatic Speaker Recognition (ASpR) systems are outputting a score in all situations regardless of the case specific conditions. Furthermore, ASpR systems use different normalization steps to see their scores as LR and these normalization steps are potential sources of bias. In the LR estimation done by ASpR systems, different factors are not taken into account such as the amount of information involved in the comparison, the phonemic content and finally the speaker intrinsic characteristics, denoted here ”speaker factor”. Consequently, a more complete view of reliability seems to be a mandatory point for FVC, even if a LR-like approach is used. This article focuses on the impact of phonemic content on FVC performance and variability. The experimental part is using FABIOLE database. This database is dedicated to this kind of studies and allows to examine both interspeaker variability and intra-speaker variability. The results demonstrate the importance of the phonemic content and highlight interesting differences between inter-speakers effects and intra-speaker’s ones.
法医声音比较(FVC)越来越多地使用似然比(LR)来表明证据是否支持原告(同一说话人)或辩护人(不同说话人)的假设。除了支持一个假设之外,LR还提供了对其支持的相对强度的理论基础估计。尽管理论方面很好,但由于其估计过程本身以及缺乏对该(实际)估计过程可靠性的了解,LR接受了一些实际限制。在很多情况下,在评估过程层面缺乏可靠性可能会破坏最终LR的可靠性。当考虑到自动FVC时尤其如此,因为自动说话人识别(ASpR)系统在所有情况下都输出分数,而不管具体情况如何。此外,ASpR系统使用不同的归一化步骤将其分数视为LR,这些归一化步骤是偏差的潜在来源。在ASpR系统进行的LR估计中,没有考虑不同的因素,如比较中涉及的信息量、音位内容以及说话人的内在特征,这里记为“说话人因素”。因此,一个更完整的可靠性视图似乎是FVC的一个强制性点,即使使用类似lr的方法。本文主要研究音素含量对FVC性能和变异性的影响。实验部分使用FABIOLE数据库。这个数据库专门用于这类研究,并允许检查说话人之间和说话人内部的变化。结果表明了音素内容的重要性,并突出了说话人之间和说话人内部的影响之间的有趣差异。
{"title":"Phonetic content impact on Forensic Voice Comparison","authors":"Ajili. Moez, Bonastre Jean-François, Ben Kheder Waad, Rossato Solange, Kahn Juliette","doi":"10.1109/SLT.2016.7846267","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846267","url":null,"abstract":"Forensic Voice Comparison (FVC) is increasingly using the likelihood ratio (LR) in order to indicate whether the evidence supports the prosecution (same-speaker) or defender (different-speakers) hypotheses. In addition to support one hypothesis, the LR provides a theoretically founded estimate of the relative strength of its support. Despite this nice theoretical aspect, the LR accepts some practical limitations due both to its estimation process itself and to a lack of knowledge about the reliability of this (practical) estimation process. In a large set of situations, a lack in reliability at the estimation process level potentially destroys the reliability of the resulting LR. It is particularly true when automatic FVC is considered, as Automatic Speaker Recognition (ASpR) systems are outputting a score in all situations regardless of the case specific conditions. Furthermore, ASpR systems use different normalization steps to see their scores as LR and these normalization steps are potential sources of bias. In the LR estimation done by ASpR systems, different factors are not taken into account such as the amount of information involved in the comparison, the phonemic content and finally the speaker intrinsic characteristics, denoted here ”speaker factor”. Consequently, a more complete view of reliability seems to be a mandatory point for FVC, even if a LR-like approach is used. This article focuses on the impact of phonemic content on FVC performance and variability. The experimental part is using FABIOLE database. This database is dedicated to this kind of studies and allows to examine both interspeaker variability and intra-speaker variability. The results demonstrate the importance of the phonemic content and highlight interesting differences between inter-speakers effects and intra-speaker’s ones.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126809741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Deep neural network based acoustic model parameter reduction using manifold regularized low rank matrix factorization 基于深度神经网络的流形正则化低秩矩阵分解声学模型参数约简
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846333
Hoon Chung, Jeom-ja Kang, Kiyoung Park, Sung Joo Lee, J. Park
In this paper, we propose a deep neural network (DNN) model parameter reduction based on manifold regularized low rank matrix factorization to reduce the computational complexity of acoustic model for low resource embedded devices. One of the most common DNN model parameter reduction techniques is truncated singular value decomposition (TSVD). TSVD reduces the number of parameters by approximating a target matrix with a low rank one in terms of minimizing the Euclidean norm. In this work, we questioned whether the Euclidean norm is appropriate as objective function to factorize DNN matrices because DNN is known to learn nonlinear manifold of acoustic features. Therefore, in order to exploit the manifold structure for robust parameter reduction, we propose manifold regularized matrix factorization approach. The proposed method was evaluated on TIMIT phone recognition domain.
为了降低低资源嵌入式设备声学模型的计算复杂度,提出了一种基于流形正则化低秩矩阵分解的深度神经网络(DNN)模型参数约简方法。截断奇异值分解(TSVD)是最常用的深度神经网络模型参数约简技术之一。TSVD通过最小化欧几里得范数来逼近低秩的目标矩阵,从而减少了参数的数量。在这项工作中,我们质疑欧几里得范数是否适合作为目标函数来分解DNN矩阵,因为DNN已知学习声学特征的非线性流形。因此,为了利用流形结构进行鲁棒参数约简,我们提出了流形正则矩阵分解方法。在TIMIT手机识别域上对该方法进行了验证。
{"title":"Deep neural network based acoustic model parameter reduction using manifold regularized low rank matrix factorization","authors":"Hoon Chung, Jeom-ja Kang, Kiyoung Park, Sung Joo Lee, J. Park","doi":"10.1109/SLT.2016.7846333","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846333","url":null,"abstract":"In this paper, we propose a deep neural network (DNN) model parameter reduction based on manifold regularized low rank matrix factorization to reduce the computational complexity of acoustic model for low resource embedded devices. One of the most common DNN model parameter reduction techniques is truncated singular value decomposition (TSVD). TSVD reduces the number of parameters by approximating a target matrix with a low rank one in terms of minimizing the Euclidean norm. In this work, we questioned whether the Euclidean norm is appropriate as objective function to factorize DNN matrices because DNN is known to learn nonlinear manifold of acoustic features. Therefore, in order to exploit the manifold structure for robust parameter reduction, we propose manifold regularized matrix factorization approach. The proposed method was evaluated on TIMIT phone recognition domain.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125374332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Evaluation and calibration of Lombard effects in speaker verification 说话人验证中Lombard效应的评估与校正
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846266
Finnian Kelly, J. Hansen
The Lombard effect is the involuntary tendency of speakers to increase their vocal effort in noisy environments in order to maintain intelligible communication. This study assesses the impact of the Lombard effect on the performance of a current speaker verification system. Lombard speech produced in the presence of several noise types and noise levels is drawn from the UT-Scope corpus. The performance of an i-vector PLDA (Probabilistic Linear Discriminant Analysis) system is observed to degrade significantly with Lombard speech. The resulting error rates are found to be dependent on the noise type and noise level. A score calibration scheme based on Quality Measure Functions (QMFs) is adopted, allowing noise information to be incorporated into calibration. This approach leads to a reduction in discrimination error relative to conventional calibration.
伦巴第效应是指说话者在嘈杂的环境中为了保持可理解的交流而不自觉地加大发声力度的倾向。本研究评估了伦巴第效应对当前说话人验证系统性能的影响。在存在几种噪声类型和噪声水平的情况下产生的伦巴第语是从ut范围语料库中提取的。观察到i向量PLDA(概率线性判别分析)系统的性能在Lombard语音中显着下降。得出的错误率取决于噪声类型和噪声水平。采用基于质量度量函数(QMFs)的分数校准方案,将噪声信息纳入到校准中。与传统校准方法相比,该方法减少了识别误差。
{"title":"Evaluation and calibration of Lombard effects in speaker verification","authors":"Finnian Kelly, J. Hansen","doi":"10.1109/SLT.2016.7846266","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846266","url":null,"abstract":"The Lombard effect is the involuntary tendency of speakers to increase their vocal effort in noisy environments in order to maintain intelligible communication. This study assesses the impact of the Lombard effect on the performance of a current speaker verification system. Lombard speech produced in the presence of several noise types and noise levels is drawn from the UT-Scope corpus. The performance of an i-vector PLDA (Probabilistic Linear Discriminant Analysis) system is observed to degrade significantly with Lombard speech. The resulting error rates are found to be dependent on the noise type and noise level. A score calibration scheme based on Quality Measure Functions (QMFs) is adopted, allowing noise information to be incorporated into calibration. This approach leads to a reduction in discrimination error relative to conventional calibration.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127250869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
An overview of end-to-end language understanding and dialog management for personal digital assistants 个人数字助理的端到端语言理解和对话管理概述
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846294
R. Sarikaya, Paul A. Crook, Alex Marin, Minwoo Jeong, J. Robichaud, Asli Celikyilmaz, Young-Bum Kim, Alexandre Rochette, O. Khan, Xiaohu Liu, D. Boies, T. Anastasakos, Zhaleh Feizollahi, Nikhil Ramesh, H. Suzuki, R. Holenstein, E. Krawczyk, Vasiliy Radostev
Spoken language understanding and dialog management have emerged as key technologies in interacting with personal digital assistants (PDAs). The coverage, complexity, and the scale of PDAs are much larger than previous conversational understanding systems. As such, new problems arise. In this paper, we provide an overview of the language understanding and dialog management capabilities of PDAs, focusing particularly on Cortana, Microsoft's PDA. We explain the system architecture for language understanding and dialog management for our PDA, indicate how it differs with prior state-of-the-art systems, and describe key components. We also report a set of experiments detailing system performance on a variety of scenarios and tasks. We describe how the quality of user experiences are measured end-to-end and also discuss open issues.
口语理解和对话管理已成为与个人数字助理(pda)交互的关键技术。pda的覆盖范围、复杂性和规模都比以前的会话理解系统大得多。因此,出现了新的问题。在本文中,我们概述了PDA的语言理解和对话管理功能,特别关注Microsoft的PDA Cortana。我们解释了用于语言理解和对话管理的系统体系结构,指出了它与以前最先进的系统的不同之处,并描述了关键组件。我们还报告了一组实验,详细说明了系统在各种场景和任务上的性能。我们描述了如何端到端测量用户体验的质量,并讨论了开放的问题。
{"title":"An overview of end-to-end language understanding and dialog management for personal digital assistants","authors":"R. Sarikaya, Paul A. Crook, Alex Marin, Minwoo Jeong, J. Robichaud, Asli Celikyilmaz, Young-Bum Kim, Alexandre Rochette, O. Khan, Xiaohu Liu, D. Boies, T. Anastasakos, Zhaleh Feizollahi, Nikhil Ramesh, H. Suzuki, R. Holenstein, E. Krawczyk, Vasiliy Radostev","doi":"10.1109/SLT.2016.7846294","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846294","url":null,"abstract":"Spoken language understanding and dialog management have emerged as key technologies in interacting with personal digital assistants (PDAs). The coverage, complexity, and the scale of PDAs are much larger than previous conversational understanding systems. As such, new problems arise. In this paper, we provide an overview of the language understanding and dialog management capabilities of PDAs, focusing particularly on Cortana, Microsoft's PDA. We explain the system architecture for language understanding and dialog management for our PDA, indicate how it differs with prior state-of-the-art systems, and describe key components. We also report a set of experiments detailing system performance on a variety of scenarios and tasks. We describe how the quality of user experiences are measured end-to-end and also discuss open issues.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127252676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Speech vs. text: A comparative analysis of features for depression detection systems 语音与文本:抑郁症检测系统特征的比较分析
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846256
M. Morales, Rivka Levitan
Depression is a serious illness that affects millions of people globally. In recent years, the task of automatic depression detection from speech has gained popularity. However, several challenges remain, including which features provide the best discrimination between classes or depression levels. Thus far, most research has focused on extracting features from the speech signal. However, the speech production system is complex and depression has been shown to affect many linguistic properties, including phonetics, semantics, and syntax. Therefore, we argue that researchers should look beyond the acoustic properties of speech by building features that capture syntactic structure and semantic content. We provide a comparative analyses of various features for depression detection. Using the same corpus, we evaluate how a system built on text-based features compares to a speech-based system. We find that a combination of features drawn from both speech and text lead to the best system performance.
抑郁症是一种严重的疾病,影响着全球数百万人。近年来,语音抑郁的自动检测得到了广泛的应用。然而,仍然存在一些挑战,包括哪些特征可以最好地区分阶级或抑郁程度。到目前为止,大多数研究都集中在从语音信号中提取特征上。然而,语音产生系统是复杂的,抑郁症已经被证明会影响许多语言特性,包括语音、语义和句法。因此,我们认为研究人员应该通过构建捕捉句法结构和语义内容的特征来超越语音的声学特性。我们为抑郁症检测提供了各种特征的比较分析。使用相同的语料库,我们评估了基于文本特征的系统与基于语音的系统的比较。我们发现,从语音和文本中提取的特征组合可以获得最佳的系统性能。
{"title":"Speech vs. text: A comparative analysis of features for depression detection systems","authors":"M. Morales, Rivka Levitan","doi":"10.1109/SLT.2016.7846256","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846256","url":null,"abstract":"Depression is a serious illness that affects millions of people globally. In recent years, the task of automatic depression detection from speech has gained popularity. However, several challenges remain, including which features provide the best discrimination between classes or depression levels. Thus far, most research has focused on extracting features from the speech signal. However, the speech production system is complex and depression has been shown to affect many linguistic properties, including phonetics, semantics, and syntax. Therefore, we argue that researchers should look beyond the acoustic properties of speech by building features that capture syntactic structure and semantic content. We provide a comparative analyses of various features for depression detection. Using the same corpus, we evaluate how a system built on text-based features compares to a speech-based system. We find that a combination of features drawn from both speech and text lead to the best system performance.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134125590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Extractive speech summarization leveraging convolutional neural network techniques 利用卷积神经网络技术提取语音摘要
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846259
Chun-I Tsai, Hsiao-Tsung Hung, Kuan-Yu Chen, Berlin Chen
Extractive text or speech summarization endeavors to select representative sentences from a source document and assemble them into a concise summary, so as to help people to browse and assimilate the main theme of the document efficiently. The recent past has seen a surge of interest in developing deep learning- or deep neural network-based supervised methods for extractive text summarization. This paper presents a continuation of this line of research for speech summarization and its contributions are three-fold. First, we exploit an effective framework that integrates two convolutional neural networks (CNNs) and a multilayer perceptron (MLP) for summary sentence selection. Specifically, CNNs encode a given document-sentence pair into two discriminative vector embeddings separately, while MLP in turn takes the two embeddings of a document-sentence pair and their similarity measure as the input to induce a ranking score for each sentence. Second, the input of MLP is augmented by a rich set of prosodic and lexical features apart from those derived from CNNs. Third, the utility of our proposed summarization methods and several widely-used methods are extensively analyzed and compared. The empirical results seem to demonstrate the effectiveness of our summarization method in relation to several state-of-the-art methods.
摘要文本或语音摘要是从源文档中选择有代表性的句子,并将其组合成一个简洁的摘要,以帮助人们高效地浏览和吸收文档的主题。最近,人们对开发基于深度学习或深度神经网络的监督方法的兴趣激增,用于提取文本摘要。本文是语音摘要研究的延续,其贡献有三个方面。首先,我们利用了一个有效的框架,该框架集成了两个卷积神经网络(cnn)和一个多层感知器(MLP),用于总结句子的选择。具体来说,cnn将给定的文档-句子对分别编码为两个判别向量嵌入,而MLP则将文档-句子对的两个嵌入及其相似度测度作为输入,从而得出每个句子的排名分数。其次,MLP的输入被丰富的韵律和词汇特征所增强,而这些特征不仅仅来源于cnn。第三,对本文提出的总结方法和几种常用方法的实用性进行了广泛的分析和比较。实证结果似乎证明了我们的总结方法与几种最先进的方法的有效性。
{"title":"Extractive speech summarization leveraging convolutional neural network techniques","authors":"Chun-I Tsai, Hsiao-Tsung Hung, Kuan-Yu Chen, Berlin Chen","doi":"10.1109/SLT.2016.7846259","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846259","url":null,"abstract":"Extractive text or speech summarization endeavors to select representative sentences from a source document and assemble them into a concise summary, so as to help people to browse and assimilate the main theme of the document efficiently. The recent past has seen a surge of interest in developing deep learning- or deep neural network-based supervised methods for extractive text summarization. This paper presents a continuation of this line of research for speech summarization and its contributions are three-fold. First, we exploit an effective framework that integrates two convolutional neural networks (CNNs) and a multilayer perceptron (MLP) for summary sentence selection. Specifically, CNNs encode a given document-sentence pair into two discriminative vector embeddings separately, while MLP in turn takes the two embeddings of a document-sentence pair and their similarity measure as the input to induce a ranking score for each sentence. Second, the input of MLP is augmented by a rich set of prosodic and lexical features apart from those derived from CNNs. Third, the utility of our proposed summarization methods and several widely-used methods are extensively analyzed and compared. The empirical results seem to demonstrate the effectiveness of our summarization method in relation to several state-of-the-art methods.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"71 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120870154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Analysis of user behavior with multimodal virtual customer service agents 多模态虚拟客服代理的用户行为分析
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846322
Ian Beaver, Cynthia Freeman
We investigate the occurrence of user restatement when there is no apparent error in Intelligent Virtual Assistant (IVA) understanding in a multimodal customer service environment. We define several classes of response medium combinations and use various statistical measures to determine how the combination of medium and linguistic complexity impacts the user's apparent willingness to accept their query result. Through analysis on 3; 000 sessions with a live customer service IVA deployed on an airline company website and mobile application, we discover that as more media are involved in a response, user restatements increase. We also determine which linguistic complexity measures should be minimized for every response class in order to improve user comprehension.
我们调查了在多式联运客户服务环境中,当智能虚拟助理(IVA)理解没有明显错误时,用户重述的发生。我们定义了几种响应媒介组合,并使用各种统计度量来确定媒介和语言复杂性的组合如何影响用户接受其查询结果的明显意愿。通过对3的分析;在一家航空公司的网站和移动应用程序上进行了000次现场客户服务IVA,我们发现,随着更多的媒体参与到回应中,用户的重述也会增加。为了提高用户的理解能力,我们还确定了每个响应类应该最小化哪些语言复杂性措施。
{"title":"Analysis of user behavior with multimodal virtual customer service agents","authors":"Ian Beaver, Cynthia Freeman","doi":"10.1109/SLT.2016.7846322","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846322","url":null,"abstract":"We investigate the occurrence of user restatement when there is no apparent error in Intelligent Virtual Assistant (IVA) understanding in a multimodal customer service environment. We define several classes of response medium combinations and use various statistical measures to determine how the combination of medium and linguistic complexity impacts the user's apparent willingness to accept their query result. Through analysis on 3; 000 sessions with a live customer service IVA deployed on an airline company website and mobile application, we discover that as more media are involved in a response, user restatements increase. We also determine which linguistic complexity measures should be minimized for every response class in order to improve user comprehension.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123215866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1