Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846252
Tatiana Ekeinhor-Komi, J. Bouraoui, R. Laroche, F. Lefèvre
This paper proposes a novel approach to defining and simulating a new generation of virtual personal assistants as multi-application multi-domain distributed dialogue systems. The first contribution is the assistant architecture, composed of independent third-party applications handled by a Dispatcher. In this view, applications are black-boxes responding with a self-scored answer to user requests. Next, the Dispatcher distributes the current request to the most relevant application, based on these scores and the context (history of interaction etc.), and conveys its answer to the user. To address variations in the user-defined portfolio of applications, the second contribution, a stochastic model automates the online optimisation of the Dispatcher's behaviour. To evaluate the learnability of the Dispatcher's policy, several parametrisations of the user and application simulators are enabled, in such a way that they cover variations of realistic situations. Results confirm in all considered configurations of interest, that reinforcement learning can learn adapted strategies.
{"title":"Towards a virtual personal assistant based on a user-defined portfolio of multi-domain vocal applications","authors":"Tatiana Ekeinhor-Komi, J. Bouraoui, R. Laroche, F. Lefèvre","doi":"10.1109/SLT.2016.7846252","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846252","url":null,"abstract":"This paper proposes a novel approach to defining and simulating a new generation of virtual personal assistants as multi-application multi-domain distributed dialogue systems. The first contribution is the assistant architecture, composed of independent third-party applications handled by a Dispatcher. In this view, applications are black-boxes responding with a self-scored answer to user requests. Next, the Dispatcher distributes the current request to the most relevant application, based on these scores and the context (history of interaction etc.), and conveys its answer to the user. To address variations in the user-defined portfolio of applications, the second contribution, a stochastic model automates the online optimisation of the Dispatcher's behaviour. To evaluate the learnability of the Dispatcher's policy, several parametrisations of the user and application simulators are enabled, in such a way that they cover variations of realistic situations. Results confirm in all considered configurations of interest, that reinforcement learning can learn adapted strategies.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127354192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846326
Emre Yilmaz, H. V. D. Heuvel, D. V. Leeuwen
Automatic speech recognition (ASR) of code-switching speech requires careful handling of unexpected language switches that may occur in a single utterance. In this paper, we investigate the feasibility of using multilingually trained deep neural networks (DNN) for the ASR of Frisian speech containing code-switches to Dutch with the aim of building a robust recognizer that can handle this phenomenon. For this purpose, we train several multilingual DNN models on Frisian and two closely related languages, namely English and Dutch, to compare the impact of single-step and two-step multilingual DNN training on the recognition and code-switching detection performance. We apply bilingual DNN retraining on both target languages by varying the amount of training data belonging to the higher-resourced target language (Dutch). The recognition results show that the multilingual DNN training scheme with an initial multilingual training step followed by bilingual retraining provides recognition performance comparable to an oracle baseline recognizer that can employ language-specific acoustic models. We further show that we can detect code-switches at the word level with an equal error rate of around 17% excluding the deletions due to ASR errors.
{"title":"Code-switching detection using multilingual DNNS","authors":"Emre Yilmaz, H. V. D. Heuvel, D. V. Leeuwen","doi":"10.1109/SLT.2016.7846326","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846326","url":null,"abstract":"Automatic speech recognition (ASR) of code-switching speech requires careful handling of unexpected language switches that may occur in a single utterance. In this paper, we investigate the feasibility of using multilingually trained deep neural networks (DNN) for the ASR of Frisian speech containing code-switches to Dutch with the aim of building a robust recognizer that can handle this phenomenon. For this purpose, we train several multilingual DNN models on Frisian and two closely related languages, namely English and Dutch, to compare the impact of single-step and two-step multilingual DNN training on the recognition and code-switching detection performance. We apply bilingual DNN retraining on both target languages by varying the amount of training data belonging to the higher-resourced target language (Dutch). The recognition results show that the multilingual DNN training scheme with an initial multilingual training step followed by bilingual retraining provides recognition performance comparable to an oracle baseline recognizer that can employ language-specific acoustic models. We further show that we can detect code-switches at the word level with an equal error rate of around 17% excluding the deletions due to ASR errors.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125114933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846317
Takaaki Hori, Hai Wang, Chiori Hori, Shinji Watanabe, B. Harsham, Jonathan Le Roux, J. Hershey, Yusuke Koji, Yi Jing, Zhaocheng Zhu, T. Aikawa
We present an advanced dialog state tracking system designed for the 5th Dialog State Tracking Challenge (DSTC5). The main task of DSTC5 is to track the dialog state in a human-human dialog. For each utterance, the tracker emits a frame of slot-value pairs considering the full history of the dialog up to the current turn. Our system includes an encoder-decoder architecture with an attention mechanism to map an input word sequence to a set of semantic labels, i.e., slot-value pairs. This handles the problem of the unknown alignment between the utterances and the labels. By combining the attention-based tracker with rule-based trackers elaborated for English and Chinese, the F-score for the development set improved from 0.475 to 0.507 compared to the rule-only trackers. Moreover, we achieved 0.517 F-score by refining the combination strategy based on the topic and slot level performance of each tracker. In this paper, we also validate the efficacy of each technique and report the test set results submitted to the challenge.
{"title":"Dialog state tracking with attention-based sequence-to-sequence learning","authors":"Takaaki Hori, Hai Wang, Chiori Hori, Shinji Watanabe, B. Harsham, Jonathan Le Roux, J. Hershey, Yusuke Koji, Yi Jing, Zhaocheng Zhu, T. Aikawa","doi":"10.1109/SLT.2016.7846317","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846317","url":null,"abstract":"We present an advanced dialog state tracking system designed for the 5th Dialog State Tracking Challenge (DSTC5). The main task of DSTC5 is to track the dialog state in a human-human dialog. For each utterance, the tracker emits a frame of slot-value pairs considering the full history of the dialog up to the current turn. Our system includes an encoder-decoder architecture with an attention mechanism to map an input word sequence to a set of semantic labels, i.e., slot-value pairs. This handles the problem of the unknown alignment between the utterances and the labels. By combining the attention-based tracker with rule-based trackers elaborated for English and Chinese, the F-score for the development set improved from 0.475 to 0.507 compared to the rule-only trackers. Moreover, we achieved 0.517 F-score by refining the combination strategy based on the topic and slot level performance of each tracker. In this paper, we also validate the efficacy of each technique and report the test set results submitted to the challenge.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130842951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846328
Mohamed G. Elfeky, M. Bastani, Xavier Velez, P. Moreno, Austin Waters
Acoustic model performance typically decreases when evaluated on a dialectal variation of the same language that was not used during training. Similarly, models simultaneously trained on a group of dialects tend to underperform dialect-specific models. In this paper, we report on our efforts towards building a unified acoustic model that can serve a multi-dialectal language. Two techniques are presented: Distillation and MultiTask Learning (MTL). In Distillation, we use an ensemble of dialect-specific acoustic models and distill its knowledge in a single model. In MTL, we utilize multitask learning to train a unified acoustic model that learns to distinguish dialects as a side task. We show that both techniques are superior to the jointly-trained model that is trained on all dialectal data, reducing word error rates by 4:2% and 0:6%, respectively. While achieving this improvement, neither technique degrades the performance of the dialect-specific models by more than 3:4%.
{"title":"Towards acoustic model unification across dialects","authors":"Mohamed G. Elfeky, M. Bastani, Xavier Velez, P. Moreno, Austin Waters","doi":"10.1109/SLT.2016.7846328","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846328","url":null,"abstract":"Acoustic model performance typically decreases when evaluated on a dialectal variation of the same language that was not used during training. Similarly, models simultaneously trained on a group of dialects tend to underperform dialect-specific models. In this paper, we report on our efforts towards building a unified acoustic model that can serve a multi-dialectal language. Two techniques are presented: Distillation and MultiTask Learning (MTL). In Distillation, we use an ensemble of dialect-specific acoustic models and distill its knowledge in a single model. In MTL, we utilize multitask learning to train a unified acoustic model that learns to distinguish dialects as a side task. We show that both techniques are superior to the jointly-trained model that is trained on all dialectal data, reducing word error rates by 4:2% and 0:6%, respectively. While achieving this improvement, neither technique degrades the performance of the dialect-specific models by more than 3:4%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128415902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846324
Adriana Stan, Cassia Valentini-Botinhao, B. Orza, M. Giurgiu
This paper introduces a novel method for blind speech segmentation at a phone level based on image processing. We consider the spectrogram of the waveform of an utterance as an image and hypothesize that its striping defects, i.e. discontinuities, appear due to phone boundaries. Using a simple image destriping algorithm these discontinuities are found. To discover phone transitions which are not as salient in the image, we compute spectral changes derived from the time evolution of Mel cepstral parametrisation of speech. These socalled image-based and acoustic features are then combined to form a mixed probability function, whose values indicate the likelihood of a phone boundary being located at the corresponding time frame. The method is completely unsupervised and achieves an accuracy of 75.59% at a −3.26% over-segmentation rate, yielding an F-measure of 0.76 and an 0.80 R-value on the TIMIT dataset.
{"title":"Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients","authors":"Adriana Stan, Cassia Valentini-Botinhao, B. Orza, M. Giurgiu","doi":"10.1109/SLT.2016.7846324","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846324","url":null,"abstract":"This paper introduces a novel method for blind speech segmentation at a phone level based on image processing. We consider the spectrogram of the waveform of an utterance as an image and hypothesize that its striping defects, i.e. discontinuities, appear due to phone boundaries. Using a simple image destriping algorithm these discontinuities are found. To discover phone transitions which are not as salient in the image, we compute spectral changes derived from the time evolution of Mel cepstral parametrisation of speech. These socalled image-based and acoustic features are then combined to form a mixed probability function, whose values indicate the likelihood of a phone boundary being located at the corresponding time frame. The method is completely unsupervised and achieves an accuracy of 75.59% at a −3.26% over-segmentation rate, yielding an F-measure of 0.76 and an 0.80 R-value on the TIMIT dataset.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128662374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846276
Xukui Yang, Dan Qu, Wenlin Zhang, Weiqiang Zhang
The National Digital Switching System Engineering and Technological R&D Center (NDSC) speech-to-text transcription system for the 2016 multi-genre broadcast challenge is described. Various acoustic models based on deep neural network (DNN), such as hybrid DNN, long short term memory recurrent neural network (LSTM RNN), and time delay neural network (TDNN), are trained. The system also makes use of recurrent neural network language models (RNNLMs) for re-scoring and minimum Bayes risk (MBR) combination. The WER on test dataset of the speech-to-text task is 18.2%. Furthermore, to simulate real applications where manual segmentations were not available an automatic segmentation system based on long-term information is proposed. WERs based on the automatically generated segments were slightly worse than that based on the manual segmentations.
{"title":"The NDSC transcription system for the 2016 multi-genre broadcast challenge","authors":"Xukui Yang, Dan Qu, Wenlin Zhang, Weiqiang Zhang","doi":"10.1109/SLT.2016.7846276","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846276","url":null,"abstract":"The National Digital Switching System Engineering and Technological R&D Center (NDSC) speech-to-text transcription system for the 2016 multi-genre broadcast challenge is described. Various acoustic models based on deep neural network (DNN), such as hybrid DNN, long short term memory recurrent neural network (LSTM RNN), and time delay neural network (TDNN), are trained. The system also makes use of recurrent neural network language models (RNNLMs) for re-scoring and minimum Bayes risk (MBR) combination. The WER on test dataset of the speech-to-text task is 18.2%. Furthermore, to simulate real applications where manual segmentations were not available an automatic segmentation system based on long-term information is proposed. WERs based on the automatically generated segments were slightly worse than that based on the manual segmentations.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116588429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846292
Gozde Cetinkaya, Batuhan Gündogdu, M. Saraçlar
In this study, we present a pre-filtering method for dynamic time warping (DTW) to improve the efficiency of a posteriorgram based keyword search (KWS) system. The ultimate aim is to improve the performance of a large vocabulary continuous speech recognition (LVCSR) based KWS system using the posteriorgram based KWS approach. We use phonetic posteriorgrams to represent the audio data and generate average posteriorgrams to represent the given text queries. The DTW algorithm is used to determine the optimal alignment between the posteriorgrams of the audio data and the queries. Since DTW has quadratic complexity, it can be relatively inefficient for keyword search. Our main contribution is to reduce this complexity by pre-filtering based on a vector space representation of the two posteriorgrams without any degradation in performance. Experimental results show that our system reduces the complexity and when combined with the baseline LVCSR based KWS system, it improves the performance both for the out-of-vocabulary (OOV) queries and the in-vocabulary (IV) queries.
{"title":"Pre-filtered dynamic time warping for posteriorgram based keyword search","authors":"Gozde Cetinkaya, Batuhan Gündogdu, M. Saraçlar","doi":"10.1109/SLT.2016.7846292","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846292","url":null,"abstract":"In this study, we present a pre-filtering method for dynamic time warping (DTW) to improve the efficiency of a posteriorgram based keyword search (KWS) system. The ultimate aim is to improve the performance of a large vocabulary continuous speech recognition (LVCSR) based KWS system using the posteriorgram based KWS approach. We use phonetic posteriorgrams to represent the audio data and generate average posteriorgrams to represent the given text queries. The DTW algorithm is used to determine the optimal alignment between the posteriorgrams of the audio data and the queries. Since DTW has quadratic complexity, it can be relatively inefficient for keyword search. Our main contribution is to reduce this complexity by pre-filtering based on a vector space representation of the two posteriorgrams without any degradation in performance. Experimental results show that our system reduces the complexity and when combined with the baseline LVCSR based KWS system, it improves the performance both for the out-of-vocabulary (OOV) queries and the in-vocabulary (IV) queries.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"389 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121004692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846287
M. Omar
Joint factor analysis [1] application to speaker and language recognition advanced the performance of automatic systems in these areas. A special case of the early work in [1], namely the i-vector representation [2], has been applied successfully in many areas including speaker [2], language [3], and speech recognition [4]. This work presents a novel model which represents a long sequence of observations using the factor analysis model of shorter overlapping subsquences. This model takes into consideration the dependency of the adjacent latent vectors. It is shown that this model outperforms the current joint factor analysis approach based on the assumption of independent and identically distributed (iid) observations given one global latent vector. In addition, we replace the language-independent prior model of the latent vector in the i-vector model with a language-dependent prior model and modify the objective function used in the estimation of the factor analysis projection matrix and the prior model to correspond to the cross-entropy objective function estimated based on this new model. We derive also the update equations of the projection matrix and the prior model parameters which maximize the cross-entropy objective function. We evaluate the performance of our approach on the language recognition task of the robust automatic transcription of speech (RATS) project. Our experiments show improvements up to 11% relative using the proposed approach in terms of equal error rate compared to the standard approach of using an i-vector representation [2].
{"title":"A factor analysis model of sequences for language recognition","authors":"M. Omar","doi":"10.1109/SLT.2016.7846287","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846287","url":null,"abstract":"Joint factor analysis [1] application to speaker and language recognition advanced the performance of automatic systems in these areas. A special case of the early work in [1], namely the i-vector representation [2], has been applied successfully in many areas including speaker [2], language [3], and speech recognition [4]. This work presents a novel model which represents a long sequence of observations using the factor analysis model of shorter overlapping subsquences. This model takes into consideration the dependency of the adjacent latent vectors. It is shown that this model outperforms the current joint factor analysis approach based on the assumption of independent and identically distributed (iid) observations given one global latent vector. In addition, we replace the language-independent prior model of the latent vector in the i-vector model with a language-dependent prior model and modify the objective function used in the estimation of the factor analysis projection matrix and the prior model to correspond to the cross-entropy objective function estimated based on this new model. We derive also the update equations of the projection matrix and the prior model parameters which maximize the cross-entropy objective function. We evaluate the performance of our approach on the language recognition task of the robust automatic transcription of speech (RATS) project. Our experiments show improvements up to 11% relative using the proposed approach in terms of equal error rate compared to the standard approach of using an i-vector representation [2].","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114213755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846280
T. A. Hanai, Wei-Ning Hsu, James R. Glass
The Arabic language, with over 300 million speakers, has significant diversity and breadth. This proves challenging when building an automated system to understand what is said. This paper describes an Arabic Automatic Speech Recognition system developed on a 1,200 hour speech corpus that was made available for the 2016 Arabic Multi-genre Broadcast (MGB) Challenge. A range of Deep Neural Network (DNN) topologies were modeled including; Feed-forward, Convolutional, Time-Delay, Recurrent Long Short-Term Memory (LSTM), Highway LSTM (H-LSTM), and Grid LSTM (GLSTM). The best performance came from a sequence discriminatively trained G-LSTM neural network. The best overall Word Error Rate (WER) was 18.3% (p < 0:001) on the development set, after combining hypotheses of 3 and 5 layer sequence discriminatively trained G-LSTM models that had been rescored with a 4-gram language model.
{"title":"Development of the MIT ASR system for the 2016 Arabic Multi-genre Broadcast Challenge","authors":"T. A. Hanai, Wei-Ning Hsu, James R. Glass","doi":"10.1109/SLT.2016.7846280","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846280","url":null,"abstract":"The Arabic language, with over 300 million speakers, has significant diversity and breadth. This proves challenging when building an automated system to understand what is said. This paper describes an Arabic Automatic Speech Recognition system developed on a 1,200 hour speech corpus that was made available for the 2016 Arabic Multi-genre Broadcast (MGB) Challenge. A range of Deep Neural Network (DNN) topologies were modeled including; Feed-forward, Convolutional, Time-Delay, Recurrent Long Short-Term Memory (LSTM), Highway LSTM (H-LSTM), and Grid LSTM (GLSTM). The best performance came from a sequence discriminatively trained G-LSTM neural network. The best overall Word Error Rate (WER) was 18.3% (p < 0:001) on the development set, after combining hypotheses of 3 and 5 layer sequence discriminatively trained G-LSTM models that had been rescored with a 4-gram language model.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116027195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846288
Yun-Nung (Vivian) Chen, Dilek Hakanni-Tur, Gökhan Tür, Asli Celikyilmaz, Jianfeng Gao, L. Deng
Spoken language understanding (SLU) is a core component of a spoken dialogue system, which involves intent prediction and slot filling and also called semantic frame parsing. Recently recurrent neural networks (RNN) obtained strong results on SLU due to their superior ability of preserving sequential information over time. Traditionally, the SLU component parses semantic frames for utterances considering their flat structures, as the underlying RNN structure is a linear chain. However, natural language exhibits linguistic properties that provide rich, structured information for better understanding. This paper proposes to apply knowledge-guided structural attention networks (K-SAN), which additionally incorporate non-flat network topologies guided by prior knowledge, to a language understanding task. The model can effectively figure out the salient substructures that are essential to parse the given utterance into its semantic frame with an attention mechanism, where two types of knowledge, syntax and semantics, are utilized. The experiments on the benchmark Air Travel Information System (ATIS) data and the conversational assistant Cortana data show that 1) the proposed K-SAN models with syntax or semantics outperform the state-of-the-art neural network based results, and 2) the improvement for joint semantic frame parsing is more significant, because the structured information provides rich cues for sentence-level understanding, where intent prediction and slot filling can be mutually improved.
{"title":"Syntax or semantics? knowledge-guided joint semantic frame parsing","authors":"Yun-Nung (Vivian) Chen, Dilek Hakanni-Tur, Gökhan Tür, Asli Celikyilmaz, Jianfeng Gao, L. Deng","doi":"10.1109/SLT.2016.7846288","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846288","url":null,"abstract":"Spoken language understanding (SLU) is a core component of a spoken dialogue system, which involves intent prediction and slot filling and also called semantic frame parsing. Recently recurrent neural networks (RNN) obtained strong results on SLU due to their superior ability of preserving sequential information over time. Traditionally, the SLU component parses semantic frames for utterances considering their flat structures, as the underlying RNN structure is a linear chain. However, natural language exhibits linguistic properties that provide rich, structured information for better understanding. This paper proposes to apply knowledge-guided structural attention networks (K-SAN), which additionally incorporate non-flat network topologies guided by prior knowledge, to a language understanding task. The model can effectively figure out the salient substructures that are essential to parse the given utterance into its semantic frame with an attention mechanism, where two types of knowledge, syntax and semantics, are utilized. The experiments on the benchmark Air Travel Information System (ATIS) data and the conversational assistant Cortana data show that 1) the proposed K-SAN models with syntax or semantics outperform the state-of-the-art neural network based results, and 2) the improvement for joint semantic frame parsing is more significant, because the structured information provides rich cues for sentence-level understanding, where intent prediction and slot filling can be mutually improved.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"263 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124243016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}