Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424161
Filip Jurcícek
Reinforcement learning methods have been successfully used to optimise dialogue strategies in statistical dialogue systems. Typically, reinforcement techniques learn on-policy i.e., the dialogue strategy is updated online while the system is interacting with a user. An alternative to this approach is off-policy reinforcement learning, which estimates an optimal dialogue strategy offline from a fixed corpus of previously collected dialogues. This paper proposes a novel off-policy reinforcement learning method based on natural policy gradients and importance sampling. The algorithm is evaluated on a spoken dialogue system in the tourist information domain. The experiments indicate that the proposed method learns a dialogue strategy, which significantly outperforms the baseline handcrafted dialogue policy.
{"title":"Reinforcement learning for spoken dialogue systems using off-policy natural gradient method","authors":"Filip Jurcícek","doi":"10.1109/SLT.2012.6424161","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424161","url":null,"abstract":"Reinforcement learning methods have been successfully used to optimise dialogue strategies in statistical dialogue systems. Typically, reinforcement techniques learn on-policy i.e., the dialogue strategy is updated online while the system is interacting with a user. An alternative to this approach is off-policy reinforcement learning, which estimates an optimal dialogue strategy offline from a fixed corpus of previously collected dialogues. This paper proposes a novel off-policy reinforcement learning method based on natural policy gradients and importance sampling. The algorithm is evaluated on a spoken dialogue system in the tourist information domain. The experiments indicate that the proposed method learns a dialogue strategy, which significantly outperforms the baseline handcrafted dialogue policy.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127388522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424199
Gina-Anne Levow, Siwei Wang
Verbal feedback provides important cues in establishing interactional rapport. The challenge of recognizing contexts for verbal feedback largely arises from relative sparseness and optionality. In addition, cross-language and inter-speaker variations can make recognition more difficult. In this paper, we show that boosting can improve accuracy in recognizing contexts for verbal feedback based on prosodic cues. In our experiments, we use dyads from three languages (English, Spanish and Arabic) to evaluate two boosting methods, generalized Adaboost and Gradient Boosting Trees, against Support Vector Machines (SVMs) and a naive baseline, with explicit oversampling on the minority verbal feedback instances. We find that both boosting methods outperform the baseline and SVM classifiers. Analysis of the feature weighting by the boosted classifiers highlights differences and similarities in the prosodic cues employed by members of these diverse language/cultural groups.
{"title":"Employing boosting to compare cues to verbal feedback in multi-lingual dialog","authors":"Gina-Anne Levow, Siwei Wang","doi":"10.1109/SLT.2012.6424199","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424199","url":null,"abstract":"Verbal feedback provides important cues in establishing interactional rapport. The challenge of recognizing contexts for verbal feedback largely arises from relative sparseness and optionality. In addition, cross-language and inter-speaker variations can make recognition more difficult. In this paper, we show that boosting can improve accuracy in recognizing contexts for verbal feedback based on prosodic cues. In our experiments, we use dyads from three languages (English, Spanish and Arabic) to evaluate two boosting methods, generalized Adaboost and Gradient Boosting Trees, against Support Vector Machines (SVMs) and a naive baseline, with explicit oversampling on the minority verbal feedback instances. We find that both boosting methods outperform the baseline and SVM classifiers. Analysis of the feature weighting by the boosted classifiers highlights differences and similarities in the prosodic cues employed by members of these diverse language/cultural groups.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130559027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424201
K. Laskowski
Stochastic turn-taking models have traditionally been implemented as N-grams, which condition predictions on recent binary-valued speech/non-speech contours. The current work re-implements this function using feed-forward neural networks, capable of accepting binary- as well as continuous-valued features; performance is shown to asymptotically approach that of the N-gram baseline as model complexity increases. The conditioning context is then extended to leverage loudness contours. Experiments indicate that the additional sensitivity to loudness considerably decreases average cross entropy rates on unseen data, by 0.03 bits per framing interval of 100 ms. This reduction is shown to make loudness-sensitive conversants capable of better predictions, with attention memory requirements at least 5 times smaller and responsiveness latency at least 10 times shorter than the loudness-insensitive baseline.
{"title":"Exploiting loudness dynamics in stochastic models of turn-taking","authors":"K. Laskowski","doi":"10.1109/SLT.2012.6424201","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424201","url":null,"abstract":"Stochastic turn-taking models have traditionally been implemented as N-grams, which condition predictions on recent binary-valued speech/non-speech contours. The current work re-implements this function using feed-forward neural networks, capable of accepting binary- as well as continuous-valued features; performance is shown to asymptotically approach that of the N-gram baseline as model complexity increases. The conditioning context is then extended to leverage loudness contours. Experiments indicate that the additional sensitivity to loudness considerably decreases average cross entropy rates on unseen data, by 0.03 bits per framing interval of 100 ms. This reduction is shown to make loudness-sensitive conversants capable of better predictions, with attention memory requirements at least 5 times smaller and responsiveness latency at least 10 times shorter than the loudness-insensitive baseline.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131381917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424247
Stefan Ziegler, Bogdan Ludusan, G. Gravier
In this work, we present a new approach for the classification and detection of speech units for the use in landmark or event-based speech recognition systems. We use segmentation to model any time-variable speech unit by a fixed-dimensional observation vector, in order to train a committee of boosted decision stumps on labeled training data. Given an unknown speech signal, the presence of a desired speech unit is estimated by searching for each time frame the corresponding segment, that provides the maximum classification score. This approach improves the accuracy of a phoneme classification task by 1.7%, compared to classification using HMMs. Applying this approach to the detection of broad phonetic landmarks inside a landmark-driven HMM-based speech recognizer significantly improves speech recognition.
{"title":"Towards a new speech event detection approach for landmark-based speech recognition","authors":"Stefan Ziegler, Bogdan Ludusan, G. Gravier","doi":"10.1109/SLT.2012.6424247","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424247","url":null,"abstract":"In this work, we present a new approach for the classification and detection of speech units for the use in landmark or event-based speech recognition systems. We use segmentation to model any time-variable speech unit by a fixed-dimensional observation vector, in order to train a committee of boosted decision stumps on labeled training data. Given an unknown speech signal, the presence of a desired speech unit is estimated by searching for each time frame the corresponding segment, that provides the maximum classification score. This approach improves the accuracy of a phoneme classification task by 1.7%, compared to classification using HMMs. Applying this approach to the detection of broad phonetic landmarks inside a landmark-driven HMM-based speech recognizer significantly improves speech recognition.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123344485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424266
Ian Kaplan, Andrew Rosenberg
In this paper, we describe investigations into the speech used in American Presidential and Vice-Presidential debates. We explore possible transcript-based features that may correlate with personally appealing or politically persuasive language. We identify, with chi-squared analysis, features that correlate with success in the debates. We find that with a set of surface-level features from historical debates, we can predict the winners of presidential debates with success moderately above chance.
{"title":"Analysis of speech transcripts to predict winners of U.S. Presidential and Vice-Presidential debates","authors":"Ian Kaplan, Andrew Rosenberg","doi":"10.1109/SLT.2012.6424266","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424266","url":null,"abstract":"In this paper, we describe investigations into the speech used in American Presidential and Vice-Presidential debates. We explore possible transcript-based features that may correlate with personally appealing or politically persuasive language. We identify, with chi-squared analysis, features that correlate with success in the debates. We find that with a set of surface-level features from historical debates, we can predict the winners of presidential debates with success moderately above chance.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122571775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424249
Daniel Bolaños
This article describes the design of Bavieca, an open-source speech recognition toolkit intended for speech research and system development. The toolkit supports lattice-based discriminative training, wide phonetic-context, efficient acoustic scoring, large n-gram language models, and the most common feature and model transformations. Bavieca is written entirely in C++ and presents a simple and modular design with an emphasis on scalability and reusability. Bavieca achieves competitive results in standard benchmarks. The toolkit is distributed under the highly unrestricted Apache 2.0 license, and is freely available on SourceForge.
{"title":"The Bavieca open-source speech recognition toolkit","authors":"Daniel Bolaños","doi":"10.1109/SLT.2012.6424249","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424249","url":null,"abstract":"This article describes the design of Bavieca, an open-source speech recognition toolkit intended for speech research and system development. The toolkit supports lattice-based discriminative training, wide phonetic-context, efficient acoustic scoring, large n-gram language models, and the most common feature and model transformations. Bavieca is written entirely in C++ and presents a simple and modular design with an emphasis on scalability and reusability. Bavieca achieves competitive results in standard benchmarks. The toolkit is distributed under the highly unrestricted Apache 2.0 license, and is freely available on SourceForge.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131501436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424218
Matthew Henderson, Milica Gasic, Blaise Thomson, P. Tsiakoulis, Kai Yu, S. Young
Current commercial dialogue systems typically use hand-crafted grammars for Spoken Language Understanding (SLU) operating on the top one or two hypotheses output by the speech recogniser. These systems are expensive to develop and they suffer from significant degradation in performance when faced with recognition errors. This paper presents a robust method for SLU based on features extracted from the full posterior distribution of recognition hypotheses encoded in the form of word confusion networks. Following [1], the system uses SVM classifiers operating on n-gram features, trained on unaligned input/output pairs. Performance is evaluated on both an off-line corpus and on-line in a live user trial. It is shown that a statistical discriminative approach to SLU operating on the full posterior ASR output distribution can substantially improve performance both in terms of accuracy and overall dialogue reward. Furthermore, additional gains can be obtained by incorporating features from the previous system output.
{"title":"Discriminative spoken language understanding using word confusion networks","authors":"Matthew Henderson, Milica Gasic, Blaise Thomson, P. Tsiakoulis, Kai Yu, S. Young","doi":"10.1109/SLT.2012.6424218","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424218","url":null,"abstract":"Current commercial dialogue systems typically use hand-crafted grammars for Spoken Language Understanding (SLU) operating on the top one or two hypotheses output by the speech recogniser. These systems are expensive to develop and they suffer from significant degradation in performance when faced with recognition errors. This paper presents a robust method for SLU based on features extracted from the full posterior distribution of recognition hypotheses encoded in the form of word confusion networks. Following [1], the system uses SVM classifiers operating on n-gram features, trained on unaligned input/output pairs. Performance is evaluated on both an off-line corpus and on-line in a live user trial. It is shown that a statistical discriminative approach to SLU operating on the full posterior ASR output distribution can substantially improve performance both in terms of accuracy and overall dialogue reward. Furthermore, additional gains can be obtained by incorporating features from the previous system output.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130603733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424260
Sandrine Brognaux, Sophie Roekhaut, Thomas Drugman, Richard Beaufort
Several automatic phonetic alignment tools have been proposed in the literature. They usually rely on pre-trained speaker-independent models to align new corpora. Their drawback is that they cover a very limited number of languages and might not perform properly for different speaking styles. This paper presents a new tool for automatic phonetic alignment available online. Its specificity is that it trains the model directly on the corpus to align, which makes it applicable to any language and speaking style. Experiments on three corpora show that it provides results comparable to other existing tools. It also allows the tuning of some training parameters. The use of tied-state triphones, for example, shows further improvement of about 1.5% for a 20 ms threshold. A manually-aligned part of the corpus can also be used as bootstrap to improve the model quality. Alignment rates were found to significantly increase, up to 20%, using only 30 seconds of bootstrapping data.
{"title":"Train&align: A new online tool for automatic phonetic alignment","authors":"Sandrine Brognaux, Sophie Roekhaut, Thomas Drugman, Richard Beaufort","doi":"10.1109/SLT.2012.6424260","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424260","url":null,"abstract":"Several automatic phonetic alignment tools have been proposed in the literature. They usually rely on pre-trained speaker-independent models to align new corpora. Their drawback is that they cover a very limited number of languages and might not perform properly for different speaking styles. This paper presents a new tool for automatic phonetic alignment available online. Its specificity is that it trains the model directly on the corpus to align, which makes it applicable to any language and speaking style. Experiments on three corpora show that it provides results comparable to other existing tools. It also allows the tuning of some training parameters. The use of tied-state triphones, for example, shows further improvement of about 1.5% for a 20 ms threshold. A manually-aligned part of the corpus can also be used as bootstrap to improve the model quality. Alignment rates were found to significantly increase, up to 20%, using only 30 seconds of bootstrapping data.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"142 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114217865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424214
G. Anumanchipalli, Luís C. Oliveira, A. Black
This paper presents an approach for transfer of speaker intent in speech-to-speech machine translation (S2SMT). Specifically, we describe techniques to retain the prominence patterns of the source language utterance through the translation pipeline and impose this information during speech synthesis in the target language. We first present an analysis of word focus across languages to motivate the problem of transfer. We then propose an approach for training an appropriate transfer function for intonation on a parallel speech corpus in the two languages within which the translation is carried out. We present our analysis and experiments on English↔Portuguese and English↔German language pairs and evaluate the proposed transformation techniques through objective measures.
{"title":"Intent transfer in speech-to-speech machine translation","authors":"G. Anumanchipalli, Luís C. Oliveira, A. Black","doi":"10.1109/SLT.2012.6424214","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424214","url":null,"abstract":"This paper presents an approach for transfer of speaker intent in speech-to-speech machine translation (S2SMT). Specifically, we describe techniques to retain the prominence patterns of the source language utterance through the translation pipeline and impose this information during speech synthesis in the target language. We first present an analysis of word focus across languages to motivate the problem of transfer. We then propose an approach for training an appropriate transfer function for intonation on a parallel speech corpus in the two languages within which the translation is carried out. We present our analysis and experiments on English↔Portuguese and English↔German language pairs and evaluate the proposed transformation techniques through objective measures.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124691012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424219
Hung-yi Lee, Tsung-Hsien Wen, Lin-Shan Lee
Retrieving objects semantically related to the query has been widely studied in text information retrieval. However, when applying the text-based techniques on spoken content, the inevitable recognition errors may seriously degrade the performance. In this paper, we propose to enhance the expected term frequencies estimated from spoken content by acoustic similarity graphs. For each word in the lexicon, a graph is constructed describing acoustic similarity among spoken segments in the archive. Score propagation over the graph helps in estimating the expected term frequencies. The enhanced expected term frequencies can be used in the language modeling retrieval approach, as well as semantic retrieval techniques such as the document expansion based on latent semantic analysis, and query expansion considering both words and latent topic information. Preliminary experiments performed on Mandarin broadcast news indicated that improved performance were achievable under different conditions.
{"title":"Improved semantic retrieval of spoken content by language models enhanced with acoustic similarity graph","authors":"Hung-yi Lee, Tsung-Hsien Wen, Lin-Shan Lee","doi":"10.1109/SLT.2012.6424219","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424219","url":null,"abstract":"Retrieving objects semantically related to the query has been widely studied in text information retrieval. However, when applying the text-based techniques on spoken content, the inevitable recognition errors may seriously degrade the performance. In this paper, we propose to enhance the expected term frequencies estimated from spoken content by acoustic similarity graphs. For each word in the lexicon, a graph is constructed describing acoustic similarity among spoken segments in the archive. Score propagation over the graph helps in estimating the expected term frequencies. The enhanced expected term frequencies can be used in the language modeling retrieval approach, as well as semantic retrieval techniques such as the document expansion based on latent semantic analysis, and query expansion considering both words and latent topic information. Preliminary experiments performed on Mandarin broadcast news indicated that improved performance were achievable under different conditions.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122277832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}