Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034653
K. Itou, Atsushi Fujii, Tetsuya Ishikawa
We report experimental results associated with speech-driven text retrieval, which facilitates retrieving information in multiple domains with spoken queries. Since users speak contents related to a target collection, we produce language models used for speech recognition based on the target collection, so as to improve both the recognition and retrieval accuracy. Experiments using existing test collections combined with dictated queries showed the effectiveness of our method.
{"title":"Language modeling for multi-domain speech-driven text retrieval","authors":"K. Itou, Atsushi Fujii, Tetsuya Ishikawa","doi":"10.1109/ASRU.2001.1034653","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034653","url":null,"abstract":"We report experimental results associated with speech-driven text retrieval, which facilitates retrieving information in multiple domains with spoken queries. Since users speak contents related to a target collection, we produce language models used for speech recognition based on the target collection, so as to improve both the recognition and retrieval accuracy. Experiments using existing test collections combined with dictated queries showed the effectiveness of our method.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130231520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034627
F. Béchet, R. de Mori, G. Subsol
This paper deals with the difficult task of recognition of a large vocabulary of proper names in a directory assistance application. After a presentation of the related work, it introduces a methodology for rescoring the N-best hypotheses generated by a first step recognition. First experiments give encouraging results and several topics for future research are presented.
{"title":"Very large vocabulary proper name recognition for directory assistance","authors":"F. Béchet, R. de Mori, G. Subsol","doi":"10.1109/ASRU.2001.1034627","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034627","url":null,"abstract":"This paper deals with the difficult task of recognition of a large vocabulary of proper names in a directory assistance application. After a presentation of the related work, it introduces a methodology for rescoring the N-best hypotheses generated by a first step recognition. First experiments give encouraging results and several topics for future research are presented.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116832540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034637
H. Bonneau-Maynard, F. Lefèvre
The need for human expertise in the development of a speech understanding system can be greatly reduced by the use of stochastic techniques. However corpus-based techniques require the annotation of large amounts of training data. Manual semantic annotation of such corpora is tedious, expensive, and subject to inconsistencies. This work investigates the influence of the training corpus size on the performance of the understanding module. The use of automatically annotated data is also investigated as a means to increase the corpus size at a very low cost. First, a stochastic speech understanding model developed using data collected with the LIMSI ARISE dialog system is presented. Its performance is shown to be comparable to that of the rule-based caseframe grammar currently used in the system. In a second step, two ways of reducing the development cost are pursued: (1) reducing of the amount of manually annotated data used to train the stochastic models and (2) using automatically annotated data in the training process.
{"title":"Investigating stochastic speech understanding","authors":"H. Bonneau-Maynard, F. Lefèvre","doi":"10.1109/ASRU.2001.1034637","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034637","url":null,"abstract":"The need for human expertise in the development of a speech understanding system can be greatly reduced by the use of stochastic techniques. However corpus-based techniques require the annotation of large amounts of training data. Manual semantic annotation of such corpora is tedious, expensive, and subject to inconsistencies. This work investigates the influence of the training corpus size on the performance of the understanding module. The use of automatically annotated data is also investigated as a means to increase the corpus size at a very low cost. First, a stochastic speech understanding model developed using data collected with the LIMSI ARISE dialog system is presented. Its performance is shown to be comparable to that of the rule-based caseframe grammar currently used in the system. In a second step, two ways of reducing the development cost are pursued: (1) reducing of the amount of manually annotated data used to train the stochastic models and (2) using automatically annotated data in the training process.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114767318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034589
V. Yanhoucke, M. Hochberg, C. Leggetter
We introduce a method for performing speaker-trained recognition based on context-dependent allophone models from a large-vocabulary, speaker-independent recognition system. A set of speaker-enrollment templates is selected from the context-dependent allophone models. These templates are used to build representations of the speaker-enrolled utterances. The advantages of this approach include improved performance and portability of the enrollments across different acoustic models. We describe the approach used to select the enrollment templates and how to apply them to speaker-trained recognition. The approach has been evaluated on an over-the-telephone, voice-activated dialing task and shows significant performance improvements over techniques based on context-independent phone models or general acoustic model templates. In addition, the portability of enrollments from one model set to another is shown to result in almost no performance degradation.
{"title":"Speaker-trained recognition using allophonic enrollment models","authors":"V. Yanhoucke, M. Hochberg, C. Leggetter","doi":"10.1109/ASRU.2001.1034589","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034589","url":null,"abstract":"We introduce a method for performing speaker-trained recognition based on context-dependent allophone models from a large-vocabulary, speaker-independent recognition system. A set of speaker-enrollment templates is selected from the context-dependent allophone models. These templates are used to build representations of the speaker-enrolled utterances. The advantages of this approach include improved performance and portability of the enrollments across different acoustic models. We describe the approach used to select the enrollment templates and how to apply them to speaker-trained recognition. The approach has been evaluated on an over-the-telephone, voice-activated dialing task and shows significant performance improvements over techniques based on context-independent phone models or general acoustic model templates. In addition, the portability of enrollments from one model set to another is shown to result in almost no performance degradation.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128736530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034670
P. Cosi, J.-P. Hosoma, A. Valente
The development of a high-performance telephone-bandwidth speaker independent connected digit recognizer for Italian is described. The CSLU Speech Toolkit was used to develop and implement the hybrid ANN/HMM system, which is trained on context-dependent categories to account for coarticulatory variation. Various front-end processing and system architectures were compared and, when the best features (MFCC with CMS + /spl Delta/) and network (4-layer fully connected feed-forward network) were considered, there was a 98.92% word recognition accuracy and a 92.62% sentence recognition accuracy on a test set of the FIELD continuous digits recognition task.
{"title":"High performance telephone bandwidth speaker independent continuous digit recognition","authors":"P. Cosi, J.-P. Hosoma, A. Valente","doi":"10.1109/ASRU.2001.1034670","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034670","url":null,"abstract":"The development of a high-performance telephone-bandwidth speaker independent connected digit recognizer for Italian is described. The CSLU Speech Toolkit was used to develop and implement the hybrid ANN/HMM system, which is trained on context-dependent categories to account for coarticulatory variation. Various front-end processing and system architectures were compared and, when the best features (MFCC with CMS + /spl Delta/) and network (4-layer fully connected feed-forward network) were considered, there was a 98.92% word recognition accuracy and a 92.62% sentence recognition accuracy on a test set of the FIELD continuous digits recognition task.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116596695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034679
S. Werner, G. Rigoll
In this paper, the usage of pseudo 2-dimensional hidden Markov models for speech recognition is discussed. This image processing method should better model the time-frequency structure in speech signals. The method calculates the emission probability of a standard HMM by embedded HMM for each state. If a temporal sequence of spectral vectors is imagined as a spectrogram, this leads to a 2-dimensional warping of the spectrogram. This additional warping of the frequency axis could be useful for speaker-independent recognition and can be considered to be similar to a vocal tract normalization. The effects of this paradigm are investigated in this paper using the TI-Digits database.
{"title":"Pseudo 2-dimensional hidden Markov models in speech recognition","authors":"S. Werner, G. Rigoll","doi":"10.1109/ASRU.2001.1034679","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034679","url":null,"abstract":"In this paper, the usage of pseudo 2-dimensional hidden Markov models for speech recognition is discussed. This image processing method should better model the time-frequency structure in speech signals. The method calculates the emission probability of a standard HMM by embedded HMM for each state. If a temporal sequence of spectral vectors is imagined as a spectrogram, this leads to a 2-dimensional warping of the spectrogram. This additional warping of the frequency axis could be useful for speaker-independent recognition and can be considered to be similar to a vocal tract normalization. The effects of this paradigm are investigated in this paper using the TI-Digits database.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127667310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034677
A. Sankar, Ashvin Kannan, B. Shahshahani, E. Jackson
Most published adaptation research focuses on speaker adaptation, and on adaptation for noisy channels and background environments. We study acoustic, grammar, and combined acoustic and grammar adaptation for creating task-specific recognition models. Comprehensive experimental results are presented using data from natural language quotes and a trading application. The results show that task adaptation gives substantial improvements in both utterance understanding accuracy, and recognition speed.
{"title":"Task-specific adaptation of speech recognition models","authors":"A. Sankar, Ashvin Kannan, B. Shahshahani, E. Jackson","doi":"10.1109/ASRU.2001.1034677","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034677","url":null,"abstract":"Most published adaptation research focuses on speaker adaptation, and on adaptation for noisy channels and background environments. We study acoustic, grammar, and combined acoustic and grammar adaptation for creating task-specific recognition models. Comprehensive experimental results are presented using data from natural language quotes and a trading application. The results show that task adaptation gives substantial improvements in both utterance understanding accuracy, and recognition speed.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117051335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034609
R. Lee, E. Choi
This paper presents a method for online model adaptation based on the parallel model combination (PMC) method. The proposed method makes use of the concept of Gaussian model clustering to reduce the computation load required by PMC. This model clustering, in combination with a set of derived transformation equations, provide a potential framework for online model adaptation in noisy speech recognition. The proposed method reduces the computation in adaptation by about 45% with only a slight degradation in improvements of an average 18% for a connected digit task and 9% for a large vocabulary Mandarin task when compared with standard PMC method.
{"title":"An online model adaptation method for compensating speech models for noise in continuous speech recognition","authors":"R. Lee, E. Choi","doi":"10.1109/ASRU.2001.1034609","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034609","url":null,"abstract":"This paper presents a method for online model adaptation based on the parallel model combination (PMC) method. The proposed method makes use of the concept of Gaussian model clustering to reduce the computation load required by PMC. This model clustering, in combination with a set of derived transformation equations, provide a potential framework for online model adaptation in noisy speech recognition. The proposed method reduces the computation in adaptation by about 45% with only a slight degradation in improvements of an average 18% for a connected digit task and 9% for a large vocabulary Mandarin task when compared with standard PMC method.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"192 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117113175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034638
T. M. DuBois, Alexander I. Rudnicky
Techniques for assessing dialog system performance commonly focus on characteristics of the interaction, using metrics such as completion, satisfaction or time on task. However, such metrics are not always capable of differentiating systems that operate on fundamentally different principles, particularly when tested on tasks that focus on common-denominator capabilities. We introduce a new metric, the open concept count, and show how it can be used to capture useful system properties of a dialog system.
{"title":"An open concept metric for assessing dialog system complexity","authors":"T. M. DuBois, Alexander I. Rudnicky","doi":"10.1109/ASRU.2001.1034638","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034638","url":null,"abstract":"Techniques for assessing dialog system performance commonly focus on characteristics of the interaction, using metrics such as completion, satisfaction or time on task. However, such metrics are not always capable of differentiating systems that operate on fundamentally different principles, particularly when tested on tasks that focus on common-denominator capabilities. We introduce a new metric, the open concept count, and show how it can be used to capture useful system properties of a dialog system.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130666658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034615
A. Stolcke, Elizabeth Shriberg
Summary form only given. Traditionally, "language" models capture only the word sequences of a language. A crucial component of spoken language, however is its prosody, i.e., rhythmic and melodic properties. This paper summarizes recent work on integrated, computationally efficient modeling of word sequences and prosodic properties of speech, for a variety of speech recognition and understanding tasks, such as dialog act tagging, disfluency detection, and segmentation into sentences and topics. In each case it turns out that hidden Markov representations of the underlying structures and associated observations arise naturally, and allow existing speech recognizers to be combined with separately trained prosodic classifiers. The same HMM-based models can be used in two modes: to recover hidden structure (such as sentence boundaries), or to evaluate speech recognition hypotheses, thereby integrating prosody into the recognition process.
{"title":"Markovian combination of language and prosodic models for better speech understanding and recognition","authors":"A. Stolcke, Elizabeth Shriberg","doi":"10.1109/ASRU.2001.1034615","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034615","url":null,"abstract":"Summary form only given. Traditionally, \"language\" models capture only the word sequences of a language. A crucial component of spoken language, however is its prosody, i.e., rhythmic and melodic properties. This paper summarizes recent work on integrated, computationally efficient modeling of word sequences and prosodic properties of speech, for a variety of speech recognition and understanding tasks, such as dialog act tagging, disfluency detection, and segmentation into sentences and topics. In each case it turns out that hidden Markov representations of the underlying structures and associated observations arise naturally, and allow existing speech recognizers to be combined with separately trained prosodic classifiers. The same HMM-based models can be used in two modes: to recover hidden structure (such as sentence boundaries), or to evaluate speech recognition hypotheses, thereby integrating prosody into the recognition process.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115924796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}