Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034606
C. Nadeu, M. Tolos
Like the other SpeechDat-Car databases, the Spanish one has been collected using a 16 kHz sampling frequency, and several microphone positions and environmental noises. We aim at clarifying whether there is any advantage in terms of recognition performance from processing the 16 kHz-sampled signals instead of the usual 8 kHz-sampled ones. Recognition tests have been carried out within the Aurora experimental framework, which includes signals from both a close-talking microphone and a distant microphone. Our preliminary results indicate that it is possible to get a performance improvement from the increased bandwidth in the noisy car environment.
和其他的speech - dat - car数据库一样,西班牙语数据库使用16千赫的采样频率,以及几个麦克风位置和环境噪声来收集数据。我们的目的是澄清处理16 khz采样信号而不是通常的8 khz采样信号在识别性能方面是否有任何优势。识别测试已经在极光实验框架内进行,其中包括来自近距离通话麦克风和远距离麦克风的信号。我们的初步结果表明,在嘈杂的汽车环境中,带宽的增加可能会提高性能。
{"title":"Recognition experiments with the SpeechDat-Car Aurora Spanish database using 8 kHz- and 16 kHz-sampled signals","authors":"C. Nadeu, M. Tolos","doi":"10.1109/ASRU.2001.1034606","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034606","url":null,"abstract":"Like the other SpeechDat-Car databases, the Spanish one has been collected using a 16 kHz sampling frequency, and several microphone positions and environmental noises. We aim at clarifying whether there is any advantage in terms of recognition performance from processing the 16 kHz-sampled signals instead of the usual 8 kHz-sampled ones. Recognition tests have been carried out within the Aurora experimental framework, which includes signals from both a close-talking microphone and a distant microphone. Our preliminary results indicate that it is possible to get a performance improvement from the increased bandwidth in the noisy car environment.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116862653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034602
T. Nishiura, R. Gruhn, S. Nakamura
It is very important for multilingual teleconferencing through speech-to-speech translation to capture distant-talking speech with high quality. In addition, the speaker image is also needed to realize a natural communication in such a conference. A microphone array is an ideal candidate for capturing distant-talking speech. Uttered speech can be enhanced and speaker images can be captured by steering a microphone array and a video camera in the speaker direction. However, to realize automatic steering, it is necessary to localize the talker. To overcome this problem, we propose collaborative steering of the microphone array and the video camera in real-time for a multilingual teleconference through speech-to-speech translation. We conducted experiments in a real room environment. The speaker localization rate (i.e., speaker image capturing rate) was 97.7%, speech recognition rate was 90.0%, and TOEIC score was 530/spl sim/540 points, subject to locating the speaker at a 2.0 meter distance from the microphone array.
{"title":"Collaborative steering of microphone array and video camera toward multi-lingual tele-conference through speech-to-speech translation","authors":"T. Nishiura, R. Gruhn, S. Nakamura","doi":"10.1109/ASRU.2001.1034602","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034602","url":null,"abstract":"It is very important for multilingual teleconferencing through speech-to-speech translation to capture distant-talking speech with high quality. In addition, the speaker image is also needed to realize a natural communication in such a conference. A microphone array is an ideal candidate for capturing distant-talking speech. Uttered speech can be enhanced and speaker images can be captured by steering a microphone array and a video camera in the speaker direction. However, to realize automatic steering, it is necessary to localize the talker. To overcome this problem, we propose collaborative steering of the microphone array and the video camera in real-time for a multilingual teleconference through speech-to-speech translation. We conducted experiments in a real room environment. The speaker localization rate (i.e., speaker image capturing rate) was 97.7%, speech recognition rate was 90.0%, and TOEIC score was 530/spl sim/540 points, subject to locating the speaker at a 2.0 meter distance from the microphone array.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115515097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034667
D. Caseiro, I. Trancoso
We present the use of a specialized composition algorithm that allows the generation of a determinized search network for ASR in a single step. The algorithm is exact in the sense that the result is determinized when the lexicon and the language model are represented as determinized transducers. The composition and determinization are performed simultaneously, which is of great importance for "on-the-fly" operation. The algorithm pushes the language model weights towards the initial state of the network. Our results show that it is advantageous to use the maximum amount of information as early as possible in the decoding procedure.
{"title":"Transducer composition for \"on-the-fly\" lexicon and language model integration","authors":"D. Caseiro, I. Trancoso","doi":"10.1109/ASRU.2001.1034667","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034667","url":null,"abstract":"We present the use of a specialized composition algorithm that allows the generation of a determinized search network for ASR in a single step. The algorithm is exact in the sense that the result is determinized when the lexicon and the language model are represented as determinized transducers. The composition and determinization are performed simultaneously, which is of great importance for \"on-the-fly\" operation. The algorithm pushes the language model weights towards the initial state of the network. Our results show that it is advantageous to use the maximum amount of information as early as possible in the decoding procedure.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123644859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034629
W. N. Choi, Y. W. Wong, T. Lee, P. Ching
The tree-trellis forward-backward algorithm has been widely used for N-best searching in continuous speech recognition. In conventional approaches, the heuristic score used for the A* backward search is derived from the partial-path scores recorded during the forward pass. The inherently delayed use of a language model in the lexical tree structure leads to inefficient pruning and the partial-path score recorded is an underestimated heuristic score. This paper presents a novel method of computing the heuristic score that is more accurate than the partial-path score. The goal is to recover high-score sentence hypotheses that may have been pruned halfway during the forward search due to the delayed use of the LM. For the application of Hong Kong stock information inquiries, the proposed technique shows a noticeable performance improvement. In particular, a relative error-rate reduction of 12% has been achieved for top-1 sentences.
{"title":"Searching for the missing piece [speech recognition]","authors":"W. N. Choi, Y. W. Wong, T. Lee, P. Ching","doi":"10.1109/ASRU.2001.1034629","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034629","url":null,"abstract":"The tree-trellis forward-backward algorithm has been widely used for N-best searching in continuous speech recognition. In conventional approaches, the heuristic score used for the A* backward search is derived from the partial-path scores recorded during the forward pass. The inherently delayed use of a language model in the lexical tree structure leads to inefficient pruning and the partial-path score recorded is an underestimated heuristic score. This paper presents a novel method of computing the heuristic score that is more accurate than the partial-path score. The goal is to recover high-score sentence hypotheses that may have been pruned halfway during the forward search due to the delayed use of the LM. For the application of Hong Kong stock information inquiries, the proposed technique shows a noticeable performance improvement. In particular, a relative error-rate reduction of 12% has been achieved for top-1 sentences.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129093692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034600
P. Heracleous, S. Nakamura, K. Shikano
This paper deals with the simultaneous recognition of distant-talking speech of multiple talkers using the 3D N-best search algorithm. We describe the basic idea of the 3D N-best search and we address two additional techniques implemented into the baseline system. Namely, a path distance-based clustering and a likelihood normalization technique appeared to be necessary in order to build an efficient system for our purpose. In previous works we introduced the results of experiments carried out on simulated data. In this paper we introduce the results of the experiments carried out using reverberated data. The reverberated data are those simulated by the image method and recorded in a real room. The image method was used to find out the accuracy-reverberation time relationship, and the real data was used to evaluate the real performance of our algorithm. The obtained Top 3 results of the simultaneous word accuracy was 73.02% under 162 ms reverberation time and using the image method.
{"title":"Simultaneous recognition of distant talking speech of multiple sound sources based on 3-D N-best search algorithm","authors":"P. Heracleous, S. Nakamura, K. Shikano","doi":"10.1109/ASRU.2001.1034600","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034600","url":null,"abstract":"This paper deals with the simultaneous recognition of distant-talking speech of multiple talkers using the 3D N-best search algorithm. We describe the basic idea of the 3D N-best search and we address two additional techniques implemented into the baseline system. Namely, a path distance-based clustering and a likelihood normalization technique appeared to be necessary in order to build an efficient system for our purpose. In previous works we introduced the results of experiments carried out on simulated data. In this paper we introduce the results of the experiments carried out using reverberated data. The reverberated data are those simulated by the image method and recorded in a real room. The image method was used to find out the accuracy-reverberation time relationship, and the real data was used to evaluate the real performance of our algorithm. The obtained Top 3 results of the simultaneous word accuracy was 73.02% under 162 ms reverberation time and using the image method.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128772959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034676
Taisuke Itoh, K. Takeda, F. Itakura
The acoustic properties and a recognition method of whispered speech are discussed. A whispered speech database that consists of whispered speech, normal speech and the corresponding facial video images of more than 6,000 sentences from 100 speakers was prepared. The comparison between whispered and normal utterances show that: 1) the cepstrum distance between them is 4 dB for voiced and 2 dB for unvoiced phonemes; 2) the spectral tilt of whispered speech is less sloped than for normal speech; 3) the frequency of the lower formants (below 1.5 kHz) is lower than that of normal speech. Acoustic models (HMM) trained by the whispered speech database attain an accuracy of 60% in syllable recognition experiments. This accuracy can be improved to 63% when MLLR (maximum likelihood linear regression) adaptation is applied, while the normal speech HMMs adapted with whispered speech attain only 56% syllable accuracy.
{"title":"Acoustic analysis and recognition of whispered speech","authors":"Taisuke Itoh, K. Takeda, F. Itakura","doi":"10.1109/ASRU.2001.1034676","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034676","url":null,"abstract":"The acoustic properties and a recognition method of whispered speech are discussed. A whispered speech database that consists of whispered speech, normal speech and the corresponding facial video images of more than 6,000 sentences from 100 speakers was prepared. The comparison between whispered and normal utterances show that: 1) the cepstrum distance between them is 4 dB for voiced and 2 dB for unvoiced phonemes; 2) the spectral tilt of whispered speech is less sloped than for normal speech; 3) the frequency of the lower formants (below 1.5 kHz) is lower than that of normal speech. Acoustic models (HMM) trained by the whispered speech database attain an accuracy of 60% in syllable recognition experiments. This accuracy can be improved to 63% when MLLR (maximum likelihood linear regression) adaptation is applied, while the normal speech HMMs adapted with whispered speech attain only 56% syllable accuracy.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116218891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034588
R. Faltlhauser, G. Ruske
We propose a speaker clustering scheme working in 'eigenspace'. Speaker models are transformed to a low-dimensional subspace using 'eigenvoices'. For the speaker clustering procedure, simple distance measures, e.g. Euclidean distance, can be applied. Moreover, clustering can be accomplished with base models (for eigenvoice projection) like Gaussian mixture models as well as conventional HMMs. In case of HMMs, re-projection to the original space readily yields acoustic models. Clustering in subspace produces a well-balanced cluster and is easy to control. In the field of speaker adaptation, several principal techniques can be distinguished. The most prominent among them are Bayesian adaptation (e.g. MAP), transformation based approaches (MLLR - maximum likelihood linear regression), as well as so-called eigenspace techniques. Especially the latter have become increasingly popular, as they make use of a-priori information about the distribution of speaker models. The basic approach is commonly called the eigenvoice (EV) approach. Besides these techniques, speaker clustering is a further attractive adaptation scheme, especially since it can be - and has been - easily combined with the above methods.
{"title":"Robust speaker clustering in eigenspace","authors":"R. Faltlhauser, G. Ruske","doi":"10.1109/ASRU.2001.1034588","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034588","url":null,"abstract":"We propose a speaker clustering scheme working in 'eigenspace'. Speaker models are transformed to a low-dimensional subspace using 'eigenvoices'. For the speaker clustering procedure, simple distance measures, e.g. Euclidean distance, can be applied. Moreover, clustering can be accomplished with base models (for eigenvoice projection) like Gaussian mixture models as well as conventional HMMs. In case of HMMs, re-projection to the original space readily yields acoustic models. Clustering in subspace produces a well-balanced cluster and is easy to control. In the field of speaker adaptation, several principal techniques can be distinguished. The most prominent among them are Bayesian adaptation (e.g. MAP), transformation based approaches (MLLR - maximum likelihood linear regression), as well as so-called eigenspace techniques. Especially the latter have become increasingly popular, as they make use of a-priori information about the distribution of speaker models. The basic approach is commonly called the eigenvoice (EV) approach. Besides these techniques, speaker clustering is a further attractive adaptation scheme, especially since it can be - and has been - easily combined with the above methods.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131693597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034630
M. Danieli
Summary form only given. The need for accurate and flexible evaluation frameworks for spoken and multimodal dialogue systems has become crucial. In the early design phases of spoken dialogue systems, it is worthwhile evaluating the user's easiness in interacting with different dialogue strategies, rather than the efficiency of the dialogue system in providing the required information. The success of a task-oriented dialogue system greatly depends on the ability of providing a meaningful match between user's expectations and system capabilities, and a good trade-off improves the user's effectiveness. The evaluation methodology requires three steps. The first step has the goal of individuating the different tokens and relations that constitute the user mental model of the task. Once tokens and relations are considered for designing one or more dialogue strategies, the evaluation enters its second step which is constituted by a between-group experiment. Each strategy is tried by a representative set of experimental subjects. The third step includes measuring user effectiveness in providing the spoken dialogue system with the information it needs to solve the task. The paper argues that the application of the three-steps evaluation method may increase our understanding of the user mental model of a task during early stages of development of a spoken language agent. Experimental data supporting this claim are reported.
{"title":"Evaluating dialogue strategies and user behavior","authors":"M. Danieli","doi":"10.1109/ASRU.2001.1034630","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034630","url":null,"abstract":"Summary form only given. The need for accurate and flexible evaluation frameworks for spoken and multimodal dialogue systems has become crucial. In the early design phases of spoken dialogue systems, it is worthwhile evaluating the user's easiness in interacting with different dialogue strategies, rather than the efficiency of the dialogue system in providing the required information. The success of a task-oriented dialogue system greatly depends on the ability of providing a meaningful match between user's expectations and system capabilities, and a good trade-off improves the user's effectiveness. The evaluation methodology requires three steps. The first step has the goal of individuating the different tokens and relations that constitute the user mental model of the task. Once tokens and relations are considered for designing one or more dialogue strategies, the evaluation enters its second step which is constituted by a between-group experiment. Each strategy is tried by a representative set of experimental subjects. The third step includes measuring user effectiveness in providing the spoken dialogue system with the information it needs to solve the task. The paper argues that the application of the three-steps evaluation method may increase our understanding of the user mental model of a task during early stages of development of a spoken language agent. Experimental data supporting this claim are reported.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"474 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131835448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034620
Hans J. G. A. Dolfing, I. L. Hetherington
In the context of the weighted finite-state transducer approach to speech recognition, we investigate a novel decoding strategy to deal with very large n-gram language models often used in large-vocabulary systems. In particular, we present an alternative to full, static expansion and optimization of the finite-state transducer network. This alternative is useful when the individual knowledge sources, modeled as transducers, are too large to be composed and optimized. While the recognition decoder perceives a single, weighted finite-state transducer, we apply a divide-and-conquer technique to split the language model into two parts which add up exactly to the original language model. We investigate the merits of these 'incremental language models' and present some initial results.
{"title":"Incremental language models for speech recognition using finite-state transducers","authors":"Hans J. G. A. Dolfing, I. L. Hetherington","doi":"10.1109/ASRU.2001.1034620","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034620","url":null,"abstract":"In the context of the weighted finite-state transducer approach to speech recognition, we investigate a novel decoding strategy to deal with very large n-gram language models often used in large-vocabulary systems. In particular, we present an alternative to full, static expansion and optimization of the finite-state transducer network. This alternative is useful when the individual knowledge sources, modeled as transducers, are too large to be composed and optimized. While the recognition decoder perceives a single, weighted finite-state transducer, we apply a divide-and-conquer technique to split the language model into two parts which add up exactly to the original language model. We investigate the merits of these 'incremental language models' and present some initial results.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":" 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132123945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-12-09DOI: 10.1109/ASRU.2001.1034596
H. Nakano
This paper explains speech interfaces for mobile communication. Mobile interfaces have three important design rules: do not disturb the user's main task, work within the restrictions of user's ability, and minimize the resource requirements. Social acceptance is also important. In Japan, trial and regular services with speech interfaces in mobile environments have already been launched, but they are not widely used. They must be improved in mobile interfaces. The speech interface will not replace Web browsers, but should support and interwork with other interfaces. We also have to discover contents that suit speech interfaces.
{"title":"Speech interfaces for mobile communications","authors":"H. Nakano","doi":"10.1109/ASRU.2001.1034596","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034596","url":null,"abstract":"This paper explains speech interfaces for mobile communication. Mobile interfaces have three important design rules: do not disturb the user's main task, work within the restrictions of user's ability, and minimize the resource requirements. Social acceptance is also important. In Japan, trial and regular services with speech interfaces in mobile environments have already been launched, but they are not widely used. They must be improved in mobile interfaces. The speech interface will not replace Web browsers, but should support and interwork with other interfaces. We also have to discover contents that suit speech interfaces.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131276395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}