Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846279
Sameer Khurana, Ahmed M. Ali
In this paper, we describe Qatar Computing Research Institute's (QCRI) speech transcription system for the 2016 Dialectal Arabic Multi-Genre Broadcast (MGB-2) challenge. MGB-2 is a controlled evaluation using 1,200 hours audio with lightly supervised transcription Our system which was a combination of three purely sequence trained recognition systems, achieved the lowest WER of 14.2% among the nine participating teams. Key features of our transcription system are: purely sequence trained acoustic models using the recently introduced Lattice free Maximum Mutual Information (LF-MMI) modeling framework; Language model rescoring using a four-gram and Recurrent Neural Network with Max- Ent connections (RNNME) language models; and system combination using Minimum Bayes Risk (MBR) decoding criterion. The whole system is built using kaldi speech recognition toolkit.
{"title":"QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge","authors":"Sameer Khurana, Ahmed M. Ali","doi":"10.1109/SLT.2016.7846279","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846279","url":null,"abstract":"In this paper, we describe Qatar Computing Research Institute's (QCRI) speech transcription system for the 2016 Dialectal Arabic Multi-Genre Broadcast (MGB-2) challenge. MGB-2 is a controlled evaluation using 1,200 hours audio with lightly supervised transcription Our system which was a combination of three purely sequence trained recognition systems, achieved the lowest WER of 14.2% among the nine participating teams. Key features of our transcription system are: purely sequence trained acoustic models using the recently introduced Lattice free Maximum Mutual Information (LF-MMI) modeling framework; Language model rescoring using a four-gram and Recurrent Neural Network with Max- Ent connections (RNNME) language models; and system combination using Minimum Bayes Risk (MBR) decoding criterion. The whole system is built using kaldi speech recognition toolkit.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122291298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846243
Shawn Tan, K. Sim
This paper presents a Variational Autoencoder (VAE) based framework for modelling utterances. In this model, a mapping from an utterance to a distribution over the latent space, the VAE-utterance feature, is defined. This is in addition to a frame-level mapping, the VAE-frame feature. Using the Aurora-4 dataset, we train and perform some analysis on these models based on their detection of speaker and utterance variability, and also use combinations of LDA, i-vector, and VAE-frame and utterance features for speech recognition training. We find that it works equally well using VAE-frame + VAE-utterance features alone, and by using an LDA + VAE-frame +VAE-utterance feature combination, we obtain a word-errorrate (WER) of 9.59%, a gain over the 9.72% baseline which uses an LDA + i-vector combination.
{"title":"Learning utterance-level normalisation using Variational Autoencoders for robust automatic speech recognition","authors":"Shawn Tan, K. Sim","doi":"10.1109/SLT.2016.7846243","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846243","url":null,"abstract":"This paper presents a Variational Autoencoder (VAE) based framework for modelling utterances. In this model, a mapping from an utterance to a distribution over the latent space, the VAE-utterance feature, is defined. This is in addition to a frame-level mapping, the VAE-frame feature. Using the Aurora-4 dataset, we train and perform some analysis on these models based on their detection of speaker and utterance variability, and also use combinations of LDA, i-vector, and VAE-frame and utterance features for speech recognition training. We find that it works equally well using VAE-frame + VAE-utterance features alone, and by using an LDA + VAE-frame +VAE-utterance feature combination, we obtain a word-errorrate (WER) of 9.59%, a gain over the 9.72% baseline which uses an LDA + i-vector combination.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124780359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846305
Wei-Ning Hsu, Yu Zhang, James R. Glass
Recurrent neural networks (RNNs) are naturally suitable for speech recognition because of their ability of utilizing dynamically changing temporal information. Deep RNNs have been argued to be able to model temporal relationships at different time granularities, but suffer vanishing gradient problems. In this paper, we extend stacked long short-term memory (LSTM) RNNs by using grid LSTM blocks that formulate computation along not only the temporal dimension, but also the depth dimension, in order to alleviate this issue. Moreover, we prioritize the depth dimension over the temporal one to provide the depth dimension more updated information, since the output from it will be used for classification. We call this model the prioritized Grid LSTM (pGLSTM). Extensive experiments on four large datasets (AMI, HKUST, GALE, and MGB) indicate that the pGLSTM outperforms alternative deep LSTM models, beating stacked LSTMs with 4% to 7% relative improvement, and achieve new benchmarks among uni-directional models on all datasets.
{"title":"A prioritized grid long short-term memory RNN for speech recognition","authors":"Wei-Ning Hsu, Yu Zhang, James R. Glass","doi":"10.1109/SLT.2016.7846305","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846305","url":null,"abstract":"Recurrent neural networks (RNNs) are naturally suitable for speech recognition because of their ability of utilizing dynamically changing temporal information. Deep RNNs have been argued to be able to model temporal relationships at different time granularities, but suffer vanishing gradient problems. In this paper, we extend stacked long short-term memory (LSTM) RNNs by using grid LSTM blocks that formulate computation along not only the temporal dimension, but also the depth dimension, in order to alleviate this issue. Moreover, we prioritize the depth dimension over the temporal one to provide the depth dimension more updated information, since the output from it will be used for classification. We call this model the prioritized Grid LSTM (pGLSTM). Extensive experiments on four large datasets (AMI, HKUST, GALE, and MGB) indicate that the pGLSTM outperforms alternative deep LSTM models, beating stacked LSTMs with 4% to 7% relative improvement, and achieve new benchmarks among uni-directional models on all datasets.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124942985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846332
Lahiru Samarakoon, K. Sim
Recently, the factorized hidden layer (FHL) adaptation method is proposed for speaker adaptation of deep neural network (DNN) acoustic models. An FHL contains a speaker-dependent (SD) transformation matrix using a linear combination of rank-1 matrices and an SD bias using a linear combination of vectors, in addition to the standard affine transformation. On the other hand, full-rank bases are used with a similar DNN adaptation method which is based on cluster adaptive training (CAT). Therefore, it is interesting to investigate the effect of the rank of the bases used for adaptation. The increase of the rank of the bases improves the speaker subspace representation, without increasing the number of learnable speaker parameters. In this work, we investigate the effect of using various ranks for the bases of the SD transformation of FHLs on Aurora 4, AMI IHM and AMI SDM tasks. Experimental results have shown that when one FHL layer is used, it is optimal to use low-ranked bases of rank-50, instead of full-rank bases. Furthermore, when multiple FHLs are used, rank-1 bases are sufficient.
{"title":"Low-rank bases for factorized hidden layer adaptation of DNN acoustic models","authors":"Lahiru Samarakoon, K. Sim","doi":"10.1109/SLT.2016.7846332","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846332","url":null,"abstract":"Recently, the factorized hidden layer (FHL) adaptation method is proposed for speaker adaptation of deep neural network (DNN) acoustic models. An FHL contains a speaker-dependent (SD) transformation matrix using a linear combination of rank-1 matrices and an SD bias using a linear combination of vectors, in addition to the standard affine transformation. On the other hand, full-rank bases are used with a similar DNN adaptation method which is based on cluster adaptive training (CAT). Therefore, it is interesting to investigate the effect of the rank of the bases used for adaptation. The increase of the rank of the bases improves the speaker subspace representation, without increasing the number of learnable speaker parameters. In this work, we investigate the effect of using various ranks for the bases of the SD transformation of FHLs on Aurora 4, AMI IHM and AMI SDM tasks. Experimental results have shown that when one FHL layer is used, it is optimal to use low-ranked bases of rank-50, instead of full-rank bases. Furthermore, when multiple FHLs are used, rank-1 bases are sufficient.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127982931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846301
Lucy Vasserman, Ben Haynor, Petar S. Aleksic
Recent focus on assistant products has increased the need for extremely flexible speech systems that adapt well to specific users' needs. An important aspect of this is enabling users to make voice commands referencing their own personal data, such as favorite songs, application names, and contacts. Recognition accuracy for common commands such as playing music and sending text messages can be greatly improved if we know a user's preferences. In the past, we have addressed this problem using class-based language models that allow for query-time injection of class instances. However, this approach is limited by the need to train class-based models ahead of time.
{"title":"Contextual language model adaptation using dynamic classes","authors":"Lucy Vasserman, Ben Haynor, Petar S. Aleksic","doi":"10.1109/SLT.2016.7846301","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846301","url":null,"abstract":"Recent focus on assistant products has increased the need for extremely flexible speech systems that adapt well to specific users' needs. An important aspect of this is enabling users to make voice commands referencing their own personal data, such as favorite songs, application names, and contacts. Recognition accuracy for common commands such as playing music and sending text messages can be greatly improved if we know a user's preferences. In the past, we have addressed this problem using class-based language models that allow for query-time injection of class instances. However, this approach is limited by the need to train class-based models ahead of time.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129050446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846264
Gautam Bhattacharya, Md. Jahangir Alam, P. Kenny, Vishwa Gupta
We propose to improve the performance of i-vector based speaker verification by processing the i-vectors with a deep neural network before they are fed to a cosine distance or probabilistic linear discriminant analysis (PLDA) classifier. To this end we build on an existing model that we refer to as Non-linear Within Class Normalization (NWCN) and introduce a novel Speaker Classifier Network (SCN). Both models deliver impressive speaker verification performance, showing a 56% and 68% relative improvement over standard i-vectors when combined with a cosine distance backend. The NWCN model also reduces the equal error rate for PLDA from 1.78% to 1.63%. We also test these models under the constraints of domain mismatch, i.e. when no in-domain training data is available. Under these conditions, SCN features in combination with cosine distance performs better than the PLDA baseline, achieving an equal error rate of 2.92% as compared to 3.37%.
{"title":"Modelling speaker and channel variability using deep neural networks for robust speaker verification","authors":"Gautam Bhattacharya, Md. Jahangir Alam, P. Kenny, Vishwa Gupta","doi":"10.1109/SLT.2016.7846264","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846264","url":null,"abstract":"We propose to improve the performance of i-vector based speaker verification by processing the i-vectors with a deep neural network before they are fed to a cosine distance or probabilistic linear discriminant analysis (PLDA) classifier. To this end we build on an existing model that we refer to as Non-linear Within Class Normalization (NWCN) and introduce a novel Speaker Classifier Network (SCN). Both models deliver impressive speaker verification performance, showing a 56% and 68% relative improvement over standard i-vectors when combined with a cosine distance backend. The NWCN model also reduces the equal error rate for PLDA from 1.78% to 1.63%. We also test these models under the constraints of domain mismatch, i.e. when no in-domain training data is available. Under these conditions, SCN features in combination with cosine distance performs better than the PLDA baseline, achieving an equal error rate of 2.92% as compared to 3.37%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130332667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846275
Meriem Beloucif, Dekai Wu
We propose an approach in which we inject a crosslingual semantic frame based objective function directly into inversion transduction grammar (ITG) induction in order to semantically train spoken language translation systems. This approach represents a follow-up of our recent work on improving machine translation quality by tuning loglinear mixture weights using a semantic frame based objective function in the late, final stage of statistical machine translation training. In contrast, our new approach injects a semantic frame based objective function back into earlier stages of the training pipeline, during the actual learning of the translation model, biasing learning toward semantically more accurate alignments. Our work is motivated by the fact that ITG alignments have empirically been shown to fully cover crosslingual semantic frame alternations. We show that injecting a crosslingual semantic based objective function for driving ITG induction further sharpens the ITG constraints, leading to better performance than either the conventional ITG or the traditional GIZA++ based approaches.
{"title":"Semantically driven inversion transduction grammar induction for early stage training of spoken language translation","authors":"Meriem Beloucif, Dekai Wu","doi":"10.1109/SLT.2016.7846275","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846275","url":null,"abstract":"We propose an approach in which we inject a crosslingual semantic frame based objective function directly into inversion transduction grammar (ITG) induction in order to semantically train spoken language translation systems. This approach represents a follow-up of our recent work on improving machine translation quality by tuning loglinear mixture weights using a semantic frame based objective function in the late, final stage of statistical machine translation training. In contrast, our new approach injects a semantic frame based objective function back into earlier stages of the training pipeline, during the actual learning of the translation model, biasing learning toward semantically more accurate alignments. Our work is motivated by the fact that ITG alignments have empirically been shown to fully cover crosslingual semantic frame alternations. We show that injecting a crosslingual semantic based objective function for driving ITG induction further sharpens the ITG constraints, leading to better performance than either the conventional ITG or the traditional GIZA++ based approaches.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"48 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127451130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846245
Michael Heck, S. Sakti, Satoshi Nakamura
In this paper we propose a framework for building a full-fledged acoustic unit recognizer in a zero resource setting, i.e., without any provided labels. For that, we combine an iterative Dirichlet process Gaussian mixture model (DPGMM) clustering framework with a standard pipeline for supervised GMM-HMM acoustic model (AM) and n-gram language model (LM) training, enhanced by a scheme for iterative model re-training. We use the DPGMM to cluster feature vectors into a dynamically sized set of acoustic units. The frame based class labels serve as transcriptions of the audio data and are used as input to the AM and LM training pipeline. We show that iterative unsupervised model re-training of this DPGMM-HMM acoustic unit recognizer improves performance according to an ABX sound class discriminability task based evaluation. Our results show that the learned models generalize well and that sound class discriminability benefits from contextual information introduced by the language model. Our systems are competitive with supervisedly trained phone recognizers, and can beat the baseline set by DPGMM clustering.
{"title":"Iterative training of a DPGMM-HMM acoustic unit recognizer in a zero resource scenario","authors":"Michael Heck, S. Sakti, Satoshi Nakamura","doi":"10.1109/SLT.2016.7846245","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846245","url":null,"abstract":"In this paper we propose a framework for building a full-fledged acoustic unit recognizer in a zero resource setting, i.e., without any provided labels. For that, we combine an iterative Dirichlet process Gaussian mixture model (DPGMM) clustering framework with a standard pipeline for supervised GMM-HMM acoustic model (AM) and n-gram language model (LM) training, enhanced by a scheme for iterative model re-training. We use the DPGMM to cluster feature vectors into a dynamically sized set of acoustic units. The frame based class labels serve as transcriptions of the audio data and are used as input to the AM and LM training pipeline. We show that iterative unsupervised model re-training of this DPGMM-HMM acoustic unit recognizer improves performance according to an ABX sound class discriminability task based evaluation. Our results show that the learned models generalize well and that sound class discriminability benefits from contextual information introduced by the language model. Our systems are competitive with supervisedly trained phone recognizers, and can beat the baseline set by DPGMM clustering.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129738844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846312
Takashi Ushio, Hongjie Shi, M. Endo, K. Yamagami, Noriaki Horii
Spoken language understanding (SLU) is one of the important problem in natural language processing, and especially in dialog system. Fifth Dialog State Tracking Challenge (DSTC5) introduced a SLU challenge task, which is automatic tagging to speech utterances by two speaker roles with speech acts tag and semantic slots tag. In this paper, we focus on speech acts tagging. We propose local coactivate multi-task learning model for capturing structured speech acts, based on sentence features by recurrent convolutional neural networks. An experiment result, shows that our model outperformed all other submitted entries, and were able to capture coactivated local features of category and attribute, which are parts of speech act.
{"title":"Recurrent convolutional neural networks for structured speech act tagging","authors":"Takashi Ushio, Hongjie Shi, M. Endo, K. Yamagami, Noriaki Horii","doi":"10.1109/SLT.2016.7846312","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846312","url":null,"abstract":"Spoken language understanding (SLU) is one of the important problem in natural language processing, and especially in dialog system. Fifth Dialog State Tracking Challenge (DSTC5) introduced a SLU challenge task, which is automatic tagging to speech utterances by two speaker roles with speech acts tag and semantic slots tag. In this paper, we focus on speech acts tagging. We propose local coactivate multi-task learning model for capturing structured speech acts, based on sentence features by recurrent convolutional neural networks. An experiment result, shows that our model outperformed all other submitted entries, and were able to capture coactivated local features of category and attribute, which are parts of speech act.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125952108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846258
Lang-Chi Yu, Hung-yi Lee, Lin-Shan Lee
Headline generation for spoken content is important since spoken content is difficult to be shown on the screen and browsed by the user. It is a special type of abstractive summarization, for which the summaries are generated word by word from scratch without using any part of the original content. Many deep learning approaches for headline generation from text document have been proposed recently, all requiring huge quantities of training data, which is difficult for spoken document summarization. In this paper, we propose an ASR error modeling approach to learn the underlying structure of ASR error patterns and incorporate this model in an Attentive Recurrent Neural Network (ARNN) architecture. In this way, the model for abstractive headline generation for spoken content can be learned from abundant text data and the ASR data for some recognizers. Experiments showed very encouraging results and verified that the proposed ASR error model works well even when the input spoken content is recognized by a recognizer very different from the one the model learned from.
{"title":"Abstractive headline generation for spoken content by attentive recurrent neural networks with ASR error modeling","authors":"Lang-Chi Yu, Hung-yi Lee, Lin-Shan Lee","doi":"10.1109/SLT.2016.7846258","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846258","url":null,"abstract":"Headline generation for spoken content is important since spoken content is difficult to be shown on the screen and browsed by the user. It is a special type of abstractive summarization, for which the summaries are generated word by word from scratch without using any part of the original content. Many deep learning approaches for headline generation from text document have been proposed recently, all requiring huge quantities of training data, which is difficult for spoken document summarization. In this paper, we propose an ASR error modeling approach to learn the underlying structure of ASR error patterns and incorporate this model in an Attentive Recurrent Neural Network (ARNN) architecture. In this way, the model for abstractive headline generation for spoken content can be learned from abundant text data and the ASR data for some recognizers. Experiments showed very encouraging results and verified that the proposed ASR error model works well even when the input spoken content is recognized by a recognizer very different from the one the model learned from.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126701206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}