Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034678
Nguyen Quoc-Cuong, Pham Thi Ngoc Yen, E. Castelli
The tone recognition for Vietnamese standard language (Hanoi dialect) is described. The wavelet method is used to extract the pitch (F0) from a speech signal corpus. Thus, one feature vector for tone recognition of Vietnamese is proposed. Hidden Markov models (HMMs) are then used to recognize the tones. Our results show that tone recognition seems independent of the vowel but presents better accuracy if one of both monotonous tones is used as the pitch reference base. Finally, a first try of a completely isolated word recognition engine, adapted for Vietnamese, is presented.
{"title":"Shape vector characterization of Vietnamese tones and application to automatic recognition","authors":"Nguyen Quoc-Cuong, Pham Thi Ngoc Yen, E. Castelli","doi":"10.1109/ASRU.2001.1034678","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034678","url":null,"abstract":"The tone recognition for Vietnamese standard language (Hanoi dialect) is described. The wavelet method is used to extract the pitch (F0) from a speech signal corpus. Thus, one feature vector for tone recognition of Vietnamese is proposed. Hidden Markov models (HMMs) are then used to recognize the tones. Our results show that tone recognition seems independent of the vowel but presents better accuracy if one of both monotonous tones is used as the pitch reference base. Finally, a first try of a completely isolated word recognition engine, adapted for Vietnamese, is presented.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133449587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034671
S. Nakamura, K. Kumatani, S. Tamura
There has been a higher demand recently for automatic speech recognition (ASR) systems able to operate robustly in acoustically noisy environments. This paper proposes a method to integrate audio and visual information effectively in audio-visual (bi-modal) ASR systems. Such integration inevitably necessitates modeling of the synchronization of the audio and visual information. To address the time lag and correlation problems in individual features between speech and lip movements, we introduce a type of integrated HMM modeling of audio-visual information based on HMM composition. The proposed model can represent state synchronicity, not only within a phoneme, but also between phonemes. Evaluation experiments show that the proposed method improves the recognition accuracy for noisy speech.
{"title":"State synchronous modeling of audio-visual information for bi-modal speech recognition","authors":"S. Nakamura, K. Kumatani, S. Tamura","doi":"10.1109/ASRU.2001.1034671","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034671","url":null,"abstract":"There has been a higher demand recently for automatic speech recognition (ASR) systems able to operate robustly in acoustically noisy environments. This paper proposes a method to integrate audio and visual information effectively in audio-visual (bi-modal) ASR systems. Such integration inevitably necessitates modeling of the synchronization of the audio and visual information. To address the time lag and correlation problems in individual features between speech and lip movements, we introduce a type of integrated HMM modeling of audio-visual information based on HMM composition. The proposed model can represent state synchronicity, not only within a phoneme, but also between phonemes. Evaluation experiments show that the proposed method improves the recognition accuracy for noisy speech.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115305092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034577
B. Kotelly
Designing over-the-phone speech-recognition systems requires that designers have a design methodology and philosophy that enables them to understand how to research, design, evaluate and re-design their application.
{"title":"Brancusi, neo-plasticism, and the art of designing speech-recognition applications","authors":"B. Kotelly","doi":"10.1109/ASRU.2001.1034577","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034577","url":null,"abstract":"Designing over-the-phone speech-recognition systems requires that designers have a design methodology and philosophy that enables them to understand how to research, design, evaluate and re-design their application.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115687734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034594
L. Deng, J. Droppo, A. Acero
We present an algorithm for recursive estimation of parameters in a mildly nonlinear model involving incomplete data. In particular, we focus on the time-varying deterministic parameters of additive noise in the nonlinear model. For the nonstationary noise that we encounter in robust speech recognition, different observation data segments correspond to different noise parameter values. Hence, recursive estimation algorithms are more desirable than batch algorithms, since they can be designed to adaptively track the changing noise parameters. One such design based on the iterative stochastic approximation algorithm in the recursive-EM framework is described. This new algorithm jointly adapts time-varying noise parameters and the auxiliary parameters introduced to give a linear approximation of the nonlinear model. We present stereo-based robust speech recognition results for the AURORA task, which demonstrate the effectiveness of the new algorithm compared with a more traditional, MMSE noise estimation technique under otherwise identical experimental conditions.
{"title":"Recursive noise estimation using iterative stochastic approximation for stereo-based robust speech recognition","authors":"L. Deng, J. Droppo, A. Acero","doi":"10.1109/ASRU.2001.1034594","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034594","url":null,"abstract":"We present an algorithm for recursive estimation of parameters in a mildly nonlinear model involving incomplete data. In particular, we focus on the time-varying deterministic parameters of additive noise in the nonlinear model. For the nonstationary noise that we encounter in robust speech recognition, different observation data segments correspond to different noise parameter values. Hence, recursive estimation algorithms are more desirable than batch algorithms, since they can be designed to adaptively track the changing noise parameters. One such design based on the iterative stochastic approximation algorithm in the recursive-EM framework is described. This new algorithm jointly adapts time-varying noise parameters and the auxiliary parameters introduced to give a linear approximation of the nonlinear model. We present stereo-based robust speech recognition results for the AURORA task, which demonstrate the effectiveness of the new algorithm compared with a more traditional, MMSE noise estimation technique under otherwise identical experimental conditions.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"9 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120981650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034605
David Pearce
The ETSI STQ-Aurora DSR working group is developing the standard for the advanced DSR front-end. One of the main goals of the advanced front-end is improved robustness to noise compared to the existing ETSI DSR standard for the Mel-cepstrum front-end. The purpose of the paper is firstly to inform the wider speech research community about this activity and then to promote discussion on what further needs there are for DSR front-end standards. The scope of the DSR standard is described and the set of performance requirements that Aurora has specified for the advanced front-end. An important part of this is the evaluation and characterisation of the performance of candidate front-ends on noisy databases, and an overview of these is given. As the competition to select the best proposal draws to a close (submission deadline 28/sup th/ November 2001) an interesting question is "What next?".
ETSI STQ-Aurora DSR工作组正在开发先进的DSR前端标准。与现有的mel -倒频谱前端ETSI DSR标准相比,先进前端的主要目标之一是提高对噪声的鲁棒性。本文的目的首先是向更广泛的语音研究界通报这一活动,然后促进对DSR前端标准的进一步需求的讨论。描述了DSR标准的范围以及Aurora为高级前端指定的一组性能要求。其中一个重要部分是对候选前端在噪声数据库上的性能进行评估和表征,并给出了这些方面的概述。随着评选最佳方案的竞赛接近尾声(提交截止日期为2001年11月28日),一个有趣的问题是“下一步是什么?”
{"title":"Developing the ETSI Aurora advanced distributed speech recognition front-end and what next?","authors":"David Pearce","doi":"10.1109/ASRU.2001.1034605","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034605","url":null,"abstract":"The ETSI STQ-Aurora DSR working group is developing the standard for the advanced DSR front-end. One of the main goals of the advanced front-end is improved robustness to noise compared to the existing ETSI DSR standard for the Mel-cepstrum front-end. The purpose of the paper is firstly to inform the wider speech research community about this activity and then to promote discussion on what further needs there are for DSR front-end standards. The scope of the DSR standard is described and the set of performance requirements that Aurora has specified for the advanced front-end. An important part of this is the evaluation and characterisation of the performance of candidate front-ends on noisy databases, and an overview of these is given. As the competition to select the best proposal draws to a close (submission deadline 28/sup th/ November 2001) an interesting question is \"What next?\".","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"53 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116385495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034683
M. Hunt
Three classes of practical speech recognition dialogue systems are considered, starting with PC-based systems, specifically dictation systems. Although such systems have become very effective, they have not achieved mainstream use. Some reasons for this disappointing outcome are proposed. Speech recognition is now appearing in production cars. It is argued that the two most attractive in-car applications are for navigation systems and for dialing-by-name. The latter may be more suited to equipment that can be detached from the car and connected to a PC. After considering telephone applications in general, the importance of automated DA (directory assistance - also called directory enquiries or DQ in some countries) is established and its particular challenges are discussed. Among these are the size and dynamic nature of the databases accessed, and the variations produced by callers in naming a commercial/administrative entity whose number they are seeking. The advantages of a bottom-up phonetic speech recognition technique for automated DA are described. It is concluded that the combination of this technique and automatic methods for handling name variation makes automated DA, including access to business listings, a practical proposition.
{"title":"An examination of three classes of ASR dialogue systems: PC-based dictation, in-car systems and automated directory assistance","authors":"M. Hunt","doi":"10.1109/ASRU.2001.1034683","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034683","url":null,"abstract":"Three classes of practical speech recognition dialogue systems are considered, starting with PC-based systems, specifically dictation systems. Although such systems have become very effective, they have not achieved mainstream use. Some reasons for this disappointing outcome are proposed. Speech recognition is now appearing in production cars. It is argued that the two most attractive in-car applications are for navigation systems and for dialing-by-name. The latter may be more suited to equipment that can be detached from the car and connected to a PC. After considering telephone applications in general, the importance of automated DA (directory assistance - also called directory enquiries or DQ in some countries) is established and its particular challenges are discussed. Among these are the size and dynamic nature of the databases accessed, and the variations produced by callers in naming a commercial/administrative entity whose number they are seeking. The advantages of a bottom-up phonetic speech recognition technique for automated DA are described. It is concluded that the combination of this technique and automatic methods for handling name variation makes automated DA, including access to business listings, a practical proposition.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116704556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034643
Kuansan Wang
In this paper, we describe a multimodal dialog system based on the pattern recognition framework that has been successfully applied to automatic speech recognition. We treat the dialog problem as to recognize the optimal action based on the user's input and context. Analogous to the acoustic, pronunciation, and language models for speech recognition, the dialog system in this framework has language, semantic, and behavior models to take into account when it searches for the best result. The paper focuses on our approaches in semantic modeling, describing how semantic lexicon and domain knowledge are derived and integrated. We show that, once semantic abstraction is introduced, multimodal integration can be achieved using the reference resolution algorithm developed for natural language understanding. Several applications developed to test various aspects of the proposed framework are also described.
{"title":"Semantic modeling for dialog systems in a pattern recognition framework","authors":"Kuansan Wang","doi":"10.1109/ASRU.2001.1034643","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034643","url":null,"abstract":"In this paper, we describe a multimodal dialog system based on the pattern recognition framework that has been successfully applied to automatic speech recognition. We treat the dialog problem as to recognize the optimal action based on the user's input and context. Analogous to the acoustic, pronunciation, and language models for speech recognition, the dialog system in this framework has language, semantic, and behavior models to take into account when it searches for the best result. The paper focuses on our approaches in semantic modeling, describing how semantic lexicon and domain knowledge are derived and integrated. We show that, once semantic abstraction is introduced, multimodal integration can be achieved using the reference resolution algorithm developed for natural language understanding. Several applications developed to test various aspects of the proposed framework are also described.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125726247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034625
H. Soltau, Florian Metze, C. Fugen, A. Waibel
In this study, we examine how fast decoding of conversational speech with large vocabularies profits from efficient use of linguistic information, i.e. language models and grammars. Based on a re-entrant single pronunciation prefix tree, we use the concept of linguistic context polymorphism to allow an early incorporation of language model information. This approach allows us to use all available language model information in a one-pass decoder, using the same engine to decode with statistical n-gram language models as well as context free grammars or re-scoring of lattices in an efficient way. We compare this approach to our previous decoder, which needed three passes to incorporate all available information. The results on a very large vocabulary task show that the search can be speeded up by almost a factor of three, without introducing additional search errors.
{"title":"A one-pass decoder based on polymorphic linguistic context assignment","authors":"H. Soltau, Florian Metze, C. Fugen, A. Waibel","doi":"10.1109/ASRU.2001.1034625","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034625","url":null,"abstract":"In this study, we examine how fast decoding of conversational speech with large vocabularies profits from efficient use of linguistic information, i.e. language models and grammars. Based on a re-entrant single pronunciation prefix tree, we use the concept of linguistic context polymorphism to allow an early incorporation of language model information. This approach allows us to use all available language model information in a one-pass decoder, using the same engine to decode with statistical n-gram language models as well as context free grammars or re-scoring of lattices in an efficient way. We compare this approach to our previous decoder, which needed three passes to incorporate all available information. The results on a very large vocabulary task show that the search can be speeded up by almost a factor of three, without introducing additional search errors.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124587184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034658
R. Cattoni, Marcello Federico, A. Lavie
The paper is concerned with the analysis of automatic transcription of spoken input into an interlingua formalism for a speech-to-speech machine translation system. This process is based on two sub-tasks: (1) the recognition of the domain action (a speech act and a sequence of concepts); (2) the extraction of arguments consisting of feature-value information. Statistical models are used for the former, while a knowledge-based approach is employed for the latter. The paper proposes an algorithm that improves the analysis in terms of robustness and performance; it combines the scores of the statistical models with the extracted arguments, taking into account the well-formedness constraints defined by the interlingua formalism.
{"title":"Robust analysis of spoken input combining statistical and knowledge-based information sources","authors":"R. Cattoni, Marcello Federico, A. Lavie","doi":"10.1109/ASRU.2001.1034658","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034658","url":null,"abstract":"The paper is concerned with the analysis of automatic transcription of spoken input into an interlingua formalism for a speech-to-speech machine translation system. This process is based on two sub-tasks: (1) the recognition of the domain action (a speech act and a sequence of concepts); (2) the extraction of arguments consisting of feature-value information. Statistical models are used for the former, while a knowledge-based approach is employed for the latter. The paper proposes an algorithm that improves the analysis in terms of robustness and performance; it combines the scores of the statistical models with the extracted arguments, taking into account the well-formedness constraints defined by the interlingua formalism.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132428357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034607
M. Matassoni, G. Mian, M. Omologo, A. Santarelli, P. Svaizer
The use of noise reduction techniques for hands-free speech recognition in a car environment is investigated. A set of experiments was carried out using different speech enhancement algorithms based on noise estimation. In particular, linear spectral subtraction and MMSE estimators are considered with various parameter settings. Experiments were conducted on connected and isolated digits, extracted from the Italian version of the SpeechDat Car database. Recognition rates do not agree with acoustically perceived quality of noise reduction. As a result, the best performance is obtained by spectral subtraction with a suitable choice of the oversubtraction factor and a quantile noise estimator. It provides more than 30% relative performance improvement, from 94.4% of the baseline to 96.2% digit recognition accuracy.
{"title":"Some experiments on the use of one-channel noise reduction techniques with the Italian SpeechDat Car database","authors":"M. Matassoni, G. Mian, M. Omologo, A. Santarelli, P. Svaizer","doi":"10.1109/ASRU.2001.1034607","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034607","url":null,"abstract":"The use of noise reduction techniques for hands-free speech recognition in a car environment is investigated. A set of experiments was carried out using different speech enhancement algorithms based on noise estimation. In particular, linear spectral subtraction and MMSE estimators are considered with various parameter settings. Experiments were conducted on connected and isolated digits, extracted from the Italian version of the SpeechDat Car database. Recognition rates do not agree with acoustically perceived quality of noise reduction. As a result, the best performance is obtained by spectral subtraction with a suitable choice of the oversubtraction factor and a quantile noise estimator. It provides more than 30% relative performance improvement, from 94.4% of the baseline to 96.2% digit recognition accuracy.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132642516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}