Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846297
Joo-Kyung Kim, Gökhan Tür, Asli Celikyilmaz, Bin Cao, Ye-Yi Wang
State-of-the-art targeted language understanding systems rely on deep learning methods using 1-hot word vectors or off-the-shelf word embeddings. While word embeddings can be enriched with information from semantic lexicons (such as WordNet and PPDB) to improve their semantic representation, most previous research on word-embedding enriching has focused on improving intrinsic word-level tasks such as word analogy and antonym detection. In this work, we enrich word embeddings to force semantically similar or dissimilar words to be closer or farther away in the embedding space to improve the performance of an extrinsic task, namely, intent detection for spoken language understanding. We utilize several semantic lexicons, such as WordNet, PPDB, and Macmillan Dictionary to enrich the word embeddings and later use them as initial representation of words for intent detection. Thus, we enrich embeddings outside the neural network as opposed to learning the embeddings within the network, and, on top of the embeddings, build bidirectional LSTM for intent detection. Our experiments on ATIS and a real log dataset from Microsoft Cortana show that word embeddings enriched with semantic lexicons can improve intent detection.
{"title":"Intent detection using semantically enriched word embeddings","authors":"Joo-Kyung Kim, Gökhan Tür, Asli Celikyilmaz, Bin Cao, Ye-Yi Wang","doi":"10.1109/SLT.2016.7846297","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846297","url":null,"abstract":"State-of-the-art targeted language understanding systems rely on deep learning methods using 1-hot word vectors or off-the-shelf word embeddings. While word embeddings can be enriched with information from semantic lexicons (such as WordNet and PPDB) to improve their semantic representation, most previous research on word-embedding enriching has focused on improving intrinsic word-level tasks such as word analogy and antonym detection. In this work, we enrich word embeddings to force semantically similar or dissimilar words to be closer or farther away in the embedding space to improve the performance of an extrinsic task, namely, intent detection for spoken language understanding. We utilize several semantic lexicons, such as WordNet, PPDB, and Macmillan Dictionary to enrich the word embeddings and later use them as initial representation of words for intent detection. Thus, we enrich embeddings outside the neural network as opposed to learning the embeddings within the network, and, on top of the embeddings, build bidirectional LSTM for intent detection. Our experiments on ATIS and a real log dataset from Microsoft Cortana show that word embeddings enriched with semantic lexicons can improve intent detection.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129445404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846260
David Snyder, Pegah Ghahremani, Daniel Povey, D. Garcia-Romero, Yishay Carmiel, S. Khudanpur
In this study, we investigate an end-to-end text-independent speaker verification system. The architecture consists of a deep neural network that takes a variable length speech segment and maps it to a speaker embedding. The objective function separates same-speaker and different-speaker pairs, and is reused during verification. Similar systems have recently shown promise for text-dependent verification, but we believe that this is unexplored for the text-independent task. We show that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates. Relative to the baseline, the end-to-end system reduces EER by 13% average and 29% pooled across test conditions. The fused system achieves a reduction of 32% average and 38% pooled.
{"title":"Deep neural network-based speaker embeddings for end-to-end speaker verification","authors":"David Snyder, Pegah Ghahremani, Daniel Povey, D. Garcia-Romero, Yishay Carmiel, S. Khudanpur","doi":"10.1109/SLT.2016.7846260","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846260","url":null,"abstract":"In this study, we investigate an end-to-end text-independent speaker verification system. The architecture consists of a deep neural network that takes a variable length speech segment and maps it to a speaker embedding. The objective function separates same-speaker and different-speaker pairs, and is reused during verification. Similar systems have recently shown promise for text-dependent verification, but we believe that this is unexplored for the text-independent task. We show that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates. Relative to the baseline, the end-to-end system reduces EER by 13% average and 29% pooled across test conditions. The fused system achieves a reduction of 32% average and 38% pooled.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129879564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846306
Ming Sun, A. Raju, G. Tucker, S. Panchapagesan, Gengshen Fu, Arindam Mandal, S. Matsoukas, N. Strom, S. Vitaladevuni
We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields 67:6% relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.
我们提出了一个基于最大池的损失函数,用于训练长短期记忆(LSTM)网络,用于小占用的关键字定位(KWS),具有低CPU,内存和延迟要求。通过交叉熵损失训练网络的初始化,可以进一步指导最大池化损失训练。采用一种基于后验平滑的评价方法来衡量关键词识别性能。我们的实验结果表明,使用交叉熵损失或最大池化损失训练的LSTM模型优于交叉熵损失训练的基线前馈深度神经网络(DNN)。此外,随机初始化网络的最大池损失训练LSTM比交叉熵损失训练LSTM性能更好。最后,用交叉熵预训练网络初始化的最大池损失训练LSTM表现出最好的性能,在曲线下面积(Area Under the Curve, AUC)测量中,与基线前馈深度神经网络相比,其相对降低了67:6%。
{"title":"Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting","authors":"Ming Sun, A. Raju, G. Tucker, S. Panchapagesan, Gengshen Fu, Arindam Mandal, S. Matsoukas, N. Strom, S. Vitaladevuni","doi":"10.1109/SLT.2016.7846306","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846306","url":null,"abstract":"We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields 67:6% relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117109951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Forensic Voice Comparison (FVC) is increasingly using the likelihood ratio (LR) in order to indicate whether the evidence supports the prosecution (same-speaker) or defender (different-speakers) hypotheses. In addition to support one hypothesis, the LR provides a theoretically founded estimate of the relative strength of its support. Despite this nice theoretical aspect, the LR accepts some practical limitations due both to its estimation process itself and to a lack of knowledge about the reliability of this (practical) estimation process. In a large set of situations, a lack in reliability at the estimation process level potentially destroys the reliability of the resulting LR. It is particularly true when automatic FVC is considered, as Automatic Speaker Recognition (ASpR) systems are outputting a score in all situations regardless of the case specific conditions. Furthermore, ASpR systems use different normalization steps to see their scores as LR and these normalization steps are potential sources of bias. In the LR estimation done by ASpR systems, different factors are not taken into account such as the amount of information involved in the comparison, the phonemic content and finally the speaker intrinsic characteristics, denoted here ”speaker factor”. Consequently, a more complete view of reliability seems to be a mandatory point for FVC, even if a LR-like approach is used. This article focuses on the impact of phonemic content on FVC performance and variability. The experimental part is using FABIOLE database. This database is dedicated to this kind of studies and allows to examine both interspeaker variability and intra-speaker variability. The results demonstrate the importance of the phonemic content and highlight interesting differences between inter-speakers effects and intra-speaker’s ones.
{"title":"Phonetic content impact on Forensic Voice Comparison","authors":"Ajili. Moez, Bonastre Jean-François, Ben Kheder Waad, Rossato Solange, Kahn Juliette","doi":"10.1109/SLT.2016.7846267","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846267","url":null,"abstract":"Forensic Voice Comparison (FVC) is increasingly using the likelihood ratio (LR) in order to indicate whether the evidence supports the prosecution (same-speaker) or defender (different-speakers) hypotheses. In addition to support one hypothesis, the LR provides a theoretically founded estimate of the relative strength of its support. Despite this nice theoretical aspect, the LR accepts some practical limitations due both to its estimation process itself and to a lack of knowledge about the reliability of this (practical) estimation process. In a large set of situations, a lack in reliability at the estimation process level potentially destroys the reliability of the resulting LR. It is particularly true when automatic FVC is considered, as Automatic Speaker Recognition (ASpR) systems are outputting a score in all situations regardless of the case specific conditions. Furthermore, ASpR systems use different normalization steps to see their scores as LR and these normalization steps are potential sources of bias. In the LR estimation done by ASpR systems, different factors are not taken into account such as the amount of information involved in the comparison, the phonemic content and finally the speaker intrinsic characteristics, denoted here ”speaker factor”. Consequently, a more complete view of reliability seems to be a mandatory point for FVC, even if a LR-like approach is used. This article focuses on the impact of phonemic content on FVC performance and variability. The experimental part is using FABIOLE database. This database is dedicated to this kind of studies and allows to examine both interspeaker variability and intra-speaker variability. The results demonstrate the importance of the phonemic content and highlight interesting differences between inter-speakers effects and intra-speaker’s ones.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126809741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846333
Hoon Chung, Jeom-ja Kang, Kiyoung Park, Sung Joo Lee, J. Park
In this paper, we propose a deep neural network (DNN) model parameter reduction based on manifold regularized low rank matrix factorization to reduce the computational complexity of acoustic model for low resource embedded devices. One of the most common DNN model parameter reduction techniques is truncated singular value decomposition (TSVD). TSVD reduces the number of parameters by approximating a target matrix with a low rank one in terms of minimizing the Euclidean norm. In this work, we questioned whether the Euclidean norm is appropriate as objective function to factorize DNN matrices because DNN is known to learn nonlinear manifold of acoustic features. Therefore, in order to exploit the manifold structure for robust parameter reduction, we propose manifold regularized matrix factorization approach. The proposed method was evaluated on TIMIT phone recognition domain.
{"title":"Deep neural network based acoustic model parameter reduction using manifold regularized low rank matrix factorization","authors":"Hoon Chung, Jeom-ja Kang, Kiyoung Park, Sung Joo Lee, J. Park","doi":"10.1109/SLT.2016.7846333","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846333","url":null,"abstract":"In this paper, we propose a deep neural network (DNN) model parameter reduction based on manifold regularized low rank matrix factorization to reduce the computational complexity of acoustic model for low resource embedded devices. One of the most common DNN model parameter reduction techniques is truncated singular value decomposition (TSVD). TSVD reduces the number of parameters by approximating a target matrix with a low rank one in terms of minimizing the Euclidean norm. In this work, we questioned whether the Euclidean norm is appropriate as objective function to factorize DNN matrices because DNN is known to learn nonlinear manifold of acoustic features. Therefore, in order to exploit the manifold structure for robust parameter reduction, we propose manifold regularized matrix factorization approach. The proposed method was evaluated on TIMIT phone recognition domain.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125374332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846266
Finnian Kelly, J. Hansen
The Lombard effect is the involuntary tendency of speakers to increase their vocal effort in noisy environments in order to maintain intelligible communication. This study assesses the impact of the Lombard effect on the performance of a current speaker verification system. Lombard speech produced in the presence of several noise types and noise levels is drawn from the UT-Scope corpus. The performance of an i-vector PLDA (Probabilistic Linear Discriminant Analysis) system is observed to degrade significantly with Lombard speech. The resulting error rates are found to be dependent on the noise type and noise level. A score calibration scheme based on Quality Measure Functions (QMFs) is adopted, allowing noise information to be incorporated into calibration. This approach leads to a reduction in discrimination error relative to conventional calibration.
{"title":"Evaluation and calibration of Lombard effects in speaker verification","authors":"Finnian Kelly, J. Hansen","doi":"10.1109/SLT.2016.7846266","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846266","url":null,"abstract":"The Lombard effect is the involuntary tendency of speakers to increase their vocal effort in noisy environments in order to maintain intelligible communication. This study assesses the impact of the Lombard effect on the performance of a current speaker verification system. Lombard speech produced in the presence of several noise types and noise levels is drawn from the UT-Scope corpus. The performance of an i-vector PLDA (Probabilistic Linear Discriminant Analysis) system is observed to degrade significantly with Lombard speech. The resulting error rates are found to be dependent on the noise type and noise level. A score calibration scheme based on Quality Measure Functions (QMFs) is adopted, allowing noise information to be incorporated into calibration. This approach leads to a reduction in discrimination error relative to conventional calibration.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127250869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846294
R. Sarikaya, Paul A. Crook, Alex Marin, Minwoo Jeong, J. Robichaud, Asli Celikyilmaz, Young-Bum Kim, Alexandre Rochette, O. Khan, Xiaohu Liu, D. Boies, T. Anastasakos, Zhaleh Feizollahi, Nikhil Ramesh, H. Suzuki, R. Holenstein, E. Krawczyk, Vasiliy Radostev
Spoken language understanding and dialog management have emerged as key technologies in interacting with personal digital assistants (PDAs). The coverage, complexity, and the scale of PDAs are much larger than previous conversational understanding systems. As such, new problems arise. In this paper, we provide an overview of the language understanding and dialog management capabilities of PDAs, focusing particularly on Cortana, Microsoft's PDA. We explain the system architecture for language understanding and dialog management for our PDA, indicate how it differs with prior state-of-the-art systems, and describe key components. We also report a set of experiments detailing system performance on a variety of scenarios and tasks. We describe how the quality of user experiences are measured end-to-end and also discuss open issues.
{"title":"An overview of end-to-end language understanding and dialog management for personal digital assistants","authors":"R. Sarikaya, Paul A. Crook, Alex Marin, Minwoo Jeong, J. Robichaud, Asli Celikyilmaz, Young-Bum Kim, Alexandre Rochette, O. Khan, Xiaohu Liu, D. Boies, T. Anastasakos, Zhaleh Feizollahi, Nikhil Ramesh, H. Suzuki, R. Holenstein, E. Krawczyk, Vasiliy Radostev","doi":"10.1109/SLT.2016.7846294","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846294","url":null,"abstract":"Spoken language understanding and dialog management have emerged as key technologies in interacting with personal digital assistants (PDAs). The coverage, complexity, and the scale of PDAs are much larger than previous conversational understanding systems. As such, new problems arise. In this paper, we provide an overview of the language understanding and dialog management capabilities of PDAs, focusing particularly on Cortana, Microsoft's PDA. We explain the system architecture for language understanding and dialog management for our PDA, indicate how it differs with prior state-of-the-art systems, and describe key components. We also report a set of experiments detailing system performance on a variety of scenarios and tasks. We describe how the quality of user experiences are measured end-to-end and also discuss open issues.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127252676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846256
M. Morales, Rivka Levitan
Depression is a serious illness that affects millions of people globally. In recent years, the task of automatic depression detection from speech has gained popularity. However, several challenges remain, including which features provide the best discrimination between classes or depression levels. Thus far, most research has focused on extracting features from the speech signal. However, the speech production system is complex and depression has been shown to affect many linguistic properties, including phonetics, semantics, and syntax. Therefore, we argue that researchers should look beyond the acoustic properties of speech by building features that capture syntactic structure and semantic content. We provide a comparative analyses of various features for depression detection. Using the same corpus, we evaluate how a system built on text-based features compares to a speech-based system. We find that a combination of features drawn from both speech and text lead to the best system performance.
{"title":"Speech vs. text: A comparative analysis of features for depression detection systems","authors":"M. Morales, Rivka Levitan","doi":"10.1109/SLT.2016.7846256","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846256","url":null,"abstract":"Depression is a serious illness that affects millions of people globally. In recent years, the task of automatic depression detection from speech has gained popularity. However, several challenges remain, including which features provide the best discrimination between classes or depression levels. Thus far, most research has focused on extracting features from the speech signal. However, the speech production system is complex and depression has been shown to affect many linguistic properties, including phonetics, semantics, and syntax. Therefore, we argue that researchers should look beyond the acoustic properties of speech by building features that capture syntactic structure and semantic content. We provide a comparative analyses of various features for depression detection. Using the same corpus, we evaluate how a system built on text-based features compares to a speech-based system. We find that a combination of features drawn from both speech and text lead to the best system performance.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134125590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846259
Chun-I Tsai, Hsiao-Tsung Hung, Kuan-Yu Chen, Berlin Chen
Extractive text or speech summarization endeavors to select representative sentences from a source document and assemble them into a concise summary, so as to help people to browse and assimilate the main theme of the document efficiently. The recent past has seen a surge of interest in developing deep learning- or deep neural network-based supervised methods for extractive text summarization. This paper presents a continuation of this line of research for speech summarization and its contributions are three-fold. First, we exploit an effective framework that integrates two convolutional neural networks (CNNs) and a multilayer perceptron (MLP) for summary sentence selection. Specifically, CNNs encode a given document-sentence pair into two discriminative vector embeddings separately, while MLP in turn takes the two embeddings of a document-sentence pair and their similarity measure as the input to induce a ranking score for each sentence. Second, the input of MLP is augmented by a rich set of prosodic and lexical features apart from those derived from CNNs. Third, the utility of our proposed summarization methods and several widely-used methods are extensively analyzed and compared. The empirical results seem to demonstrate the effectiveness of our summarization method in relation to several state-of-the-art methods.
{"title":"Extractive speech summarization leveraging convolutional neural network techniques","authors":"Chun-I Tsai, Hsiao-Tsung Hung, Kuan-Yu Chen, Berlin Chen","doi":"10.1109/SLT.2016.7846259","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846259","url":null,"abstract":"Extractive text or speech summarization endeavors to select representative sentences from a source document and assemble them into a concise summary, so as to help people to browse and assimilate the main theme of the document efficiently. The recent past has seen a surge of interest in developing deep learning- or deep neural network-based supervised methods for extractive text summarization. This paper presents a continuation of this line of research for speech summarization and its contributions are three-fold. First, we exploit an effective framework that integrates two convolutional neural networks (CNNs) and a multilayer perceptron (MLP) for summary sentence selection. Specifically, CNNs encode a given document-sentence pair into two discriminative vector embeddings separately, while MLP in turn takes the two embeddings of a document-sentence pair and their similarity measure as the input to induce a ranking score for each sentence. Second, the input of MLP is augmented by a rich set of prosodic and lexical features apart from those derived from CNNs. Third, the utility of our proposed summarization methods and several widely-used methods are extensively analyzed and compared. The empirical results seem to demonstrate the effectiveness of our summarization method in relation to several state-of-the-art methods.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"71 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120870154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846322
Ian Beaver, Cynthia Freeman
We investigate the occurrence of user restatement when there is no apparent error in Intelligent Virtual Assistant (IVA) understanding in a multimodal customer service environment. We define several classes of response medium combinations and use various statistical measures to determine how the combination of medium and linguistic complexity impacts the user's apparent willingness to accept their query result. Through analysis on 3; 000 sessions with a live customer service IVA deployed on an airline company website and mobile application, we discover that as more media are involved in a response, user restatements increase. We also determine which linguistic complexity measures should be minimized for every response class in order to improve user comprehension.
{"title":"Analysis of user behavior with multimodal virtual customer service agents","authors":"Ian Beaver, Cynthia Freeman","doi":"10.1109/SLT.2016.7846322","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846322","url":null,"abstract":"We investigate the occurrence of user restatement when there is no apparent error in Intelligent Virtual Assistant (IVA) understanding in a multimodal customer service environment. We define several classes of response medium combinations and use various statistical measures to determine how the combination of medium and linguistic complexity impacts the user's apparent willingness to accept their query result. Through analysis on 3; 000 sessions with a live customer service IVA deployed on an airline company website and mobile application, we discover that as more media are involved in a response, user restatements increase. We also determine which linguistic complexity measures should be minimized for every response class in order to improve user comprehension.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123215866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}