Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777872
I. Oparin, O. Glembek, L. Burget, J. Černocký
In this paper, we are concerned with using decision trees (DT) and random forests (RF) in language modeling for Czech LVCSR. We show that the RF approach can be successfully implemented for language modeling of an inflectional language. Performance of word-based and morphological DTs and RFs was evaluated on lecture recognition task. We show that while DTs perform worse than conventional trigram language models (LM), RFs of both kind outperform the latter. WER (up to 3.4% relative) and perplexity (10%) reduction over the trigram model can be gained with morphological RFs. Further improvement is obtained after interpolation of DT and RF LMs with the trigram one (up to 15.6% perplexity and 4.8% WER relative reduction). In this paper we also investigate distribution of morphological feature types chosen for splitting data at different levels of DTs.
{"title":"Morphological random forests for language modeling of inflectional languages","authors":"I. Oparin, O. Glembek, L. Burget, J. Černocký","doi":"10.1109/SLT.2008.4777872","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777872","url":null,"abstract":"In this paper, we are concerned with using decision trees (DT) and random forests (RF) in language modeling for Czech LVCSR. We show that the RF approach can be successfully implemented for language modeling of an inflectional language. Performance of word-based and morphological DTs and RFs was evaluated on lecture recognition task. We show that while DTs perform worse than conventional trigram language models (LM), RFs of both kind outperform the latter. WER (up to 3.4% relative) and perplexity (10%) reduction over the trigram model can be gained with morphological RFs. Further improvement is obtained after interpolation of DT and RF LMs with the trigram one (up to 15.6% perplexity and 4.8% WER relative reduction). In this paper we also investigate distribution of morphological feature types chosen for splitting data at different levels of DTs.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121807411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777903
Manish Gaurav
In this paper, we study the performance of different prosody and spectral features of speech on an emotion detection task. In particular, a feature selection algorithm has been used to assess the relevancy of the different features. Gaussian mixtures models have been used to model the features extracted at the frame-level, while support vector machines (SVM) and k-nearest neighbor (k-NN) methods have been used to model the features extracted at the utterance level. We use a normalization approach (T-norm) to combine the scores from the different models. The results using the above approach are reported for the Berlin emotional database corpus and the task consisted of classifying the six emotions namely - anger, happiness, neutral, sadness, boredom and anxiety. We show that the use of feature selection algorithm improves the result, while in addition the fusion of GMM and SVM results in an overall accuracy of 75.4% for the above task.
{"title":"Performance analysis of spectral and prosodic features and their fusion for emotion recognition in speech","authors":"Manish Gaurav","doi":"10.1109/SLT.2008.4777903","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777903","url":null,"abstract":"In this paper, we study the performance of different prosody and spectral features of speech on an emotion detection task. In particular, a feature selection algorithm has been used to assess the relevancy of the different features. Gaussian mixtures models have been used to model the features extracted at the frame-level, while support vector machines (SVM) and k-nearest neighbor (k-NN) methods have been used to model the features extracted at the utterance level. We use a normalization approach (T-norm) to combine the scores from the different models. The results using the above approach are reported for the Berlin emotional database corpus and the task consisted of classifying the six emotions namely - anger, happiness, neutral, sadness, boredom and anxiety. We show that the use of feature selection algorithm improves the result, while in addition the fusion of GMM and SVM results in an overall accuracy of 75.4% for the above task.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127151341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777894
Fernando Batista, R. Amaral, I. Trancoso, N. Mamede
The application of speech recognition to live subtitling of Broadcast News has motivated the adaptation of the lexical and language models of the recognizer on a daily basis with text material retrieved from online newspapers. This paper studies the impact of this adaptation on two of the blocks following the speech recognition module: capitalization and topic indexation. We describe and evaluate different adaptation approaches that try to explore the language dynamics.
{"title":"Impact of dynamic model adaptation beyond speech recognition","authors":"Fernando Batista, R. Amaral, I. Trancoso, N. Mamede","doi":"10.1109/SLT.2008.4777894","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777894","url":null,"abstract":"The application of speech recognition to live subtitling of Broadcast News has motivated the adaptation of the lexical and language models of the recognizer on a daily basis with text material retrieved from online newspapers. This paper studies the impact of this adaptation on two of the blocks following the speech recognition module: capitalization and topic indexation. We describe and evaluate different adaptation approaches that try to explore the language dynamics.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"515 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132754505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777886
F. Choi, S. Tsakalidis, S. Saleem, C. Kao, R. Meermeier, K. Krstovski, C. Moran, Krishna Subramanian, R. Prasad, P. Natarajan
We report on recent improvements in our English/Iraqi Arabic speech-to-speech translation system. User interface improvements include a novel parallel approach to user confirmation which makes confirmation cost-free in terms of dialog duration. Automatic speech recognition improvements include the incorporation of state-of-the-art techniques in feature transformation and discriminative training. Machine translation improvements include a novel combination of multiple alignments derived from various pre-processing techniques, such as Arabic segmentation and English word compounding, higher order N-grams for target language model, and use of context in form of semantic classes and part-of-speech tags.
{"title":"Recent improvements in BBN's English/Iraqi speech-to-speech translation system","authors":"F. Choi, S. Tsakalidis, S. Saleem, C. Kao, R. Meermeier, K. Krstovski, C. Moran, Krishna Subramanian, R. Prasad, P. Natarajan","doi":"10.1109/SLT.2008.4777886","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777886","url":null,"abstract":"We report on recent improvements in our English/Iraqi Arabic speech-to-speech translation system. User interface improvements include a novel parallel approach to user confirmation which makes confirmation cost-free in terms of dialog duration. Automatic speech recognition improvements include the incorporation of state-of-the-art techniques in feature transformation and discriminative training. Machine translation improvements include a novel combination of multiple alignments derived from various pre-processing techniques, such as Arabic segmentation and English word compounding, higher order N-grams for target language model, and use of context in form of semantic classes and part-of-speech tags.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133218748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777878
Richard Dufour, Y. Estève
Automatic speech recognition (ASR) systems are used in a large number of applications, in spite of the inevitable recognition errors. In this study we propose a pragmatic approach to automatically repair ASR outputs by taking into account linguistic and acoustic information, using formal rules or stochastic methods. The proposed strategy consists in developing a specific correction solution for each specific kind of errors. In this paper, we apply this strategy on two case studies specific to French language. We show that it is possible, on automatic transcriptions of French broadcast news, to decrease the error rate of a specific error by 11.4% in one of two the case studies, and 86.4% in the other one. These results are encouraging and show the interest of developing more specific solutions to cover a wider set of errors in a future work.
{"title":"Correcting asr outputs: Specific solutions to specific errors in French","authors":"Richard Dufour, Y. Estève","doi":"10.1109/SLT.2008.4777878","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777878","url":null,"abstract":"Automatic speech recognition (ASR) systems are used in a large number of applications, in spite of the inevitable recognition errors. In this study we propose a pragmatic approach to automatically repair ASR outputs by taking into account linguistic and acoustic information, using formal rules or stochastic methods. The proposed strategy consists in developing a specific correction solution for each specific kind of errors. In this paper, we apply this strategy on two case studies specific to French language. We show that it is possible, on automatic transcriptions of French broadcast news, to decrease the error rate of a specific error by 11.4% in one of two the case studies, and 86.4% in the other one. These results are encouraging and show the interest of developing more specific solutions to cover a wider set of errors in a future work.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128957470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/slt.2008.4777870
Fei Liu, Feifan Liu, Yang Liu
{"title":"Automatic keyword extraction for the meeting corpus using supervised approach and bigram expansion","authors":"Fei Liu, Feifan Liu, Yang Liu","doi":"10.1109/slt.2008.4777870","DOIUrl":"https://doi.org/10.1109/slt.2008.4777870","url":null,"abstract":"","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"28 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116859026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777873
C. Chueh, Jen-Tzung Chien
Continuous representation of word sequence can effectively solve data sparseness problem in n-gram language model, where the discrete variables of words are represented and the unseen events are prone to happen. This problem is increasingly severe when extracting long-distance regularities for high-order n-gram model. Rather than considering discrete word space, we construct the continuous space of word sequence where the latent topic information is extracted. The continuous vector is formed by the topic posterior probabilities and the least-squares projection matrix from discrete word space to continuous topic space is estimated accordingly. The unseen words can be predicted through the new continuous latent topic language model. In the experiments on continuous speech recognition, we obtain significant performance improvement over the conventional topic-based language model.
{"title":"Continuous topic language modeling for speech recognition","authors":"C. Chueh, Jen-Tzung Chien","doi":"10.1109/SLT.2008.4777873","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777873","url":null,"abstract":"Continuous representation of word sequence can effectively solve data sparseness problem in n-gram language model, where the discrete variables of words are represented and the unseen events are prone to happen. This problem is increasingly severe when extracting long-distance regularities for high-order n-gram model. Rather than considering discrete word space, we construct the continuous space of word sequence where the latent topic information is extracted. The continuous vector is formed by the topic posterior probabilities and the least-squares projection matrix from discrete word space to continuous topic space is estimated accordingly. The unseen words can be predicted through the new continuous latent topic language model. In the experiments on continuous speech recognition, we obtain significant performance improvement over the conventional topic-based language model.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116273104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777862
M. Sawaki, Yasuhiro Minami, Ryuichiro Higashinaka, Kohji Dohsaka, Eisaku Maeda
In order to design a dialogue system that users enjoy and want to be near for a long time, it is important to know the effect of the system's action on users. This paper describes ldquoWho is thisrdquo quiz dialogue system and its users' evaluation. Its quiz-style information presentation has been found effective for educational tasks. In our ongoing effort to make it closer to a conversational partner, we implemented the system as a stuffed-toy (or CG equivalent). Quizzes are automatically generated from Wikipedia articles, rather than from hand-crafted sets of biographical facts. Network mining is utilized to prepare adaptive system responses. Experiments showed the effectiveness of person network and the relationship of user attribute and interest level.
{"title":"“Who is this” quiz dialogue system and users' evaluation","authors":"M. Sawaki, Yasuhiro Minami, Ryuichiro Higashinaka, Kohji Dohsaka, Eisaku Maeda","doi":"10.1109/SLT.2008.4777862","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777862","url":null,"abstract":"In order to design a dialogue system that users enjoy and want to be near for a long time, it is important to know the effect of the system's action on users. This paper describes ldquoWho is thisrdquo quiz dialogue system and its users' evaluation. Its quiz-style information presentation has been found effective for educational tasks. In our ongoing effort to make it closer to a conversational partner, we implemented the system as a stuffed-toy (or CG equivalent). Quizzes are automatically generated from Wikipedia articles, rather than from hand-crafted sets of biographical facts. Network mining is utilized to prepare adaptive system responses. Experiments showed the effectiveness of person network and the relationship of user attribute and interest level.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"242 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114451000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777857
Keelan Evanini, P. Hunter, J. Liscombe, David Suendermann-Oeft, K. Dayanidhi, R. Pieraccini
In this paper we introduce a subjective metric for evaluating the performance of spoken dialog systems, caller experience (CE). CE is a useful metric for tracking the overall performance of a system in deployment, as well as for isolating individual problematic calls in which the system underperforms. The proposed CE metric differs from most performance evaluation metrics proposed in the past in that it is a) a subjective, qualitative rating of the call, and b) provided by expert, external listeners, not the callers themselves. The results of an experiment in which a set of human experts listened to the same calls three times are presented. The fact that these results show a high level of agreement among different listeners, despite the subjective nature of the task, demonstrates the validity of using CE as a standard metric. Finally, an automated rating system using objective measures is shown to perform at the same high level as the humans. This is an important advance, since it provides a way to reduce the human labor costs associated with producing a reliable CE.
{"title":"Caller Experience: A method for evaluating dialog systems and its automatic prediction","authors":"Keelan Evanini, P. Hunter, J. Liscombe, David Suendermann-Oeft, K. Dayanidhi, R. Pieraccini","doi":"10.1109/SLT.2008.4777857","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777857","url":null,"abstract":"In this paper we introduce a subjective metric for evaluating the performance of spoken dialog systems, caller experience (CE). CE is a useful metric for tracking the overall performance of a system in deployment, as well as for isolating individual problematic calls in which the system underperforms. The proposed CE metric differs from most performance evaluation metrics proposed in the past in that it is a) a subjective, qualitative rating of the call, and b) provided by expert, external listeners, not the callers themselves. The results of an experiment in which a set of human experts listened to the same calls three times are presented. The fact that these results show a high level of agreement among different listeners, despite the subjective nature of the task, demonstrates the validity of using CE as a standard metric. Finally, an automated rating system using objective measures is shown to perform at the same high level as the humans. This is an important advance, since it provides a way to reduce the human labor costs associated with producing a reliable CE.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127400180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777861
Filipe M. Martins, Joana Paulo Pardal, Luís Franqueira, Pedro Arez, N. Mamede
This paper presents a system that helps you cook a recipe through a spoken dialogue tutoring session. We report our experience while creating the first version of a tutoring dialogue system that helps the user cook a selected dish. Having a working framework to support us with the creation of the cooking assistant, the main challenge we faced was the change of paradigm: instead of the system being driven by the user, the user is instructed by the system. The result is a system capable of dictating generic contents to the user. On top of it, the system can be used in several domains where the goal is not the replacement of the user but providing some assistance while (s)he performs some procedural task.
{"title":"Starting to cook a tutoring dialogue system","authors":"Filipe M. Martins, Joana Paulo Pardal, Luís Franqueira, Pedro Arez, N. Mamede","doi":"10.1109/SLT.2008.4777861","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777861","url":null,"abstract":"This paper presents a system that helps you cook a recipe through a spoken dialogue tutoring session. We report our experience while creating the first version of a tutoring dialogue system that helps the user cook a selected dish. Having a working framework to support us with the creation of the cooking assistant, the main challenge we faced was the change of paradigm: instead of the system being driven by the user, the user is instructed by the system. The result is a system capable of dictating generic contents to the user. On top of it, the system can be used in several domains where the goal is not the replacement of the user but providing some assistance while (s)he performs some procedural task.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124710626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}