Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430091
Yi Chen, C. Wan, Lin-Shan Lee
In this paper, we propose a new approach to detecting and utilizing reliable frames and segments in corrupted signals for robust speech recognition. Novel approaches to estimating an energy-based measure and a harmonicity measure for each frame are developed. SNR-dependent GMM classifiers are then trained, together with a reliable frame selection and clustering module and a reliable segment identification module, to detect the most reliable frames in an utterance. These reliable frames and segments thus obtained can be properly used in both front-end feature enhancement and back-end Viterbi decoding. In the extensive experiments reported here, very significant improvements in recognition accuracies were obtained with the proposed approaches for all types of noise and all SNR values defined in the Aurora 2 database.
{"title":"Robust speech recognition by properly utilizing reliable frames and segments in corrupted signals","authors":"Yi Chen, C. Wan, Lin-Shan Lee","doi":"10.1109/ASRU.2007.4430091","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430091","url":null,"abstract":"In this paper, we propose a new approach to detecting and utilizing reliable frames and segments in corrupted signals for robust speech recognition. Novel approaches to estimating an energy-based measure and a harmonicity measure for each frame are developed. SNR-dependent GMM classifiers are then trained, together with a reliable frame selection and clustering module and a reliable segment identification module, to detect the most reliable frames in an utterance. These reliable frames and segments thus obtained can be properly used in both front-end feature enhancement and back-end Viterbi decoding. In the extensive experiments reported here, very significant improvements in recognition accuracies were obtained with the proposed approaches for all types of noise and all SNR values defined in the Aurora 2 database.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128143703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430161
M. Hwang, Gang Peng, Wen Wang, Arlo Faria, A. Heidel, Mari Ostendorf
We describe a highly accurate large-vocabulary continuous Mandarin speech recognizer, a collaborative effort among four research organizations. Particularly, we build two acoustic models (AMs) with significant differences but similar accuracy for the purposes of cross adaptation and system combination. This paper elaborates on the main differences between the two systems, where one recognizer incorporates a discriminatively trained feature while the other utilizes a discriminative feature transformation. Additionally we present an improved acoustic segmentation algorithm and topic-based language model (LM) adaptation. Coupled with increased acoustic training data, we reduced the character error rate (CER) of the DARPA GALE 2006 evaluation set to 15.3% from 18.4%.
我们描述了一个高度精确的大词汇连续普通话语音识别器,这是四个研究机构的合作成果。特别地,我们建立了两种具有显著差异但精度相近的声学模型(AMs),用于交叉适应和系统组合。本文详细阐述了两个系统之间的主要区别,其中一个识别器包含判别训练特征,而另一个识别器使用判别特征转换。此外,我们提出了一种改进的声学分割算法和基于主题的语言模型(LM)自适应。再加上声学训练数据的增加,我们将DARPA GALE 2006评估集的字符错误率(CER)从18.4%降低到15.3%。
{"title":"Building a highly accurate Mandarin speech recognizer","authors":"M. Hwang, Gang Peng, Wen Wang, Arlo Faria, A. Heidel, Mari Ostendorf","doi":"10.1109/ASRU.2007.4430161","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430161","url":null,"abstract":"We describe a highly accurate large-vocabulary continuous Mandarin speech recognizer, a collaborative effort among four research organizations. Particularly, we build two acoustic models (AMs) with significant differences but similar accuracy for the purposes of cross adaptation and system combination. This paper elaborates on the main differences between the two systems, where one recognizer incorporates a discriminatively trained feature while the other utilizes a discriminative feature transformation. Additionally we present an improved acoustic segmentation algorithm and topic-based language model (LM) adaptation. Coupled with increased acoustic training data, we reduced the character error rate (CER) of the DARPA GALE 2006 evaluation set to 15.3% from 18.4%.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115751332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430086
K. Kumatani, U. Mayer, Tobias Gehrig, Emilian Stoimenov, J. McDonough, Matthias Wölfel
In this work, we address an acoustic beamforming application where two speakers are simultaneously active. We construct one subband domain beamformer in generalized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly adjust the active weight vectors of both GSCs to obtain two output signals with minimum mutual information (MMI). In order to calculate the mutual information of the complex subband snapshots, we consider four probability density functions (pdfs), namely the Gaussian, Laplace, K0 and lceil pdfs. The latter three belong to the class of super-Gaussian density functions that are typically used in independent component analysis as opposed to conventional beam-forming. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the PASCAL Speech Separation Challenge. In the experiments, the delay-and-sum beamformer achieved a word error rate (WER) of 70.4 %. The MMI beamformer under a Gaussian assumption achieved 55.2 % WER which was further reduced to 52.0 % with a K0 pdf, whereas the WER for data recorded with close-talking microphone was 21.6 %.
{"title":"Minimum mutual information beamforming for simultaneous active speakers","authors":"K. Kumatani, U. Mayer, Tobias Gehrig, Emilian Stoimenov, J. McDonough, Matthias Wölfel","doi":"10.1109/ASRU.2007.4430086","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430086","url":null,"abstract":"In this work, we address an acoustic beamforming application where two speakers are simultaneously active. We construct one subband domain beamformer in generalized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly adjust the active weight vectors of both GSCs to obtain two output signals with minimum mutual information (MMI). In order to calculate the mutual information of the complex subband snapshots, we consider four probability density functions (pdfs), namely the Gaussian, Laplace, K0 and lceil pdfs. The latter three belong to the class of super-Gaussian density functions that are typically used in independent component analysis as opposed to conventional beam-forming. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the PASCAL Speech Separation Challenge. In the experiments, the delay-and-sum beamformer achieved a word error rate (WER) of 70.4 %. The MMI beamformer under a Gaussian assumption achieved 55.2 % WER which was further reduced to 52.0 % with a K0 pdf, whereas the WER for data recorded with close-talking microphone was 21.6 %.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129399289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430155
Björn Hoffmeister, Christian Plahl, P. Fritz, G. Heigold, J. Lööf, R. Schlüter, H. Ney
This paper describes the development of the RWTH Mandarin LVCSR system. Different acoustic front-ends together with multiple system cross-adaptation are used in a two stage decoding framework. We describe the system in detail and present systematic recognition results. Especially, we compare a variety of approaches for cross-adapting to multiple systems. During the development we did a comparative study on different methods for integrating tone and phoneme posterior features. Furthermore, we apply lattice based consensus decoding and system combination methods. In these methods, the effect of minimizing character instead of word errors is compared. The final system obtains a character error rate of 17.7% on the GALE 2006 evaluation data.
{"title":"Development of the 2007 RWTH Mandarin LVCSR system","authors":"Björn Hoffmeister, Christian Plahl, P. Fritz, G. Heigold, J. Lööf, R. Schlüter, H. Ney","doi":"10.1109/ASRU.2007.4430155","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430155","url":null,"abstract":"This paper describes the development of the RWTH Mandarin LVCSR system. Different acoustic front-ends together with multiple system cross-adaptation are used in a two stage decoding framework. We describe the system in detail and present systematic recognition results. Especially, we compare a variety of approaches for cross-adapting to multiple systems. During the development we did a comparative study on different methods for integrating tone and phoneme posterior features. Furthermore, we apply lattice based consensus decoding and system combination methods. In these methods, the effect of minimizing character instead of word errors is compared. The final system obtains a character error rate of 17.7% on the GALE 2006 evaluation data.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132294726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430097
Xiao Li, A. Gunawardana, A. Acero
This work investigates the use of acoustic data to improve grapheme-to-phoneme conversion for name recognition. We introduce a joint model of acoustics and graphonemes, and present two approaches, maximum likelihood training and discriminative training, in adapting graphoneme model parameters. Experiments on a large-scale voice-dialing system show that the maximum likelihood approach yields a relative 7% reduction in SER compared to the best baseline result we obtained without leveraging acoustic data, while discriminative training enlarges the SER reduction to 12%.
{"title":"Adapting grapheme-to-phoneme conversion for name recognition","authors":"Xiao Li, A. Gunawardana, A. Acero","doi":"10.1109/ASRU.2007.4430097","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430097","url":null,"abstract":"This work investigates the use of acoustic data to improve grapheme-to-phoneme conversion for name recognition. We introduce a joint model of acoustics and graphonemes, and present two approaches, maximum likelihood training and discriminative training, in adapting graphoneme model parameters. Experiments on a large-scale voice-dialing system show that the maximum likelihood approach yields a relative 7% reduction in SER compared to the best baseline result we obtained without leveraging acoustic data, while discriminative training enlarges the SER reduction to 12%.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115679615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430179
Huiqun Deng, D. O'Shaughnessy, Jean-Guy Dahan, W. Ganong
In distributed speech recognition, vector quantization is used to reduce the number of bits for coding speech features at the user end in order to save energy for transmitting speech feature streams to remote recognizers and reduce data traffic congestion. We notice that the overall bit rate of the transmitted feature streams could be further reduced by not sending redundant frames that can be interpolated at the remote server from received frames. Interpolation introduces errors and may degrade speech recognition. This paper investigates the methods of selecting frames for transmission and the effect of interpolation on recognition. Experiments on a large vocabulary recognizer show that with spline interpolation, the overall frame rate for transmission can be reduced by about 50% with a relative increase in word error rate less than 5.2% for clean and noisy speech.
{"title":"Interpolative variable frame rate transmission of speech features for distributed speech recognition","authors":"Huiqun Deng, D. O'Shaughnessy, Jean-Guy Dahan, W. Ganong","doi":"10.1109/ASRU.2007.4430179","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430179","url":null,"abstract":"In distributed speech recognition, vector quantization is used to reduce the number of bits for coding speech features at the user end in order to save energy for transmitting speech feature streams to remote recognizers and reduce data traffic congestion. We notice that the overall bit rate of the transmitted feature streams could be further reduced by not sending redundant frames that can be interpolated at the remote server from received frames. Interpolation introduces errors and may degrade speech recognition. This paper investigates the methods of selecting frames for transmission and the effect of interpolation on recognition. Experiments on a large vocabulary recognizer show that with spline interpolation, the overall frame rate for transmission can be reduced by about 50% with a relative increase in word error rate less than 5.2% for clean and noisy speech.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124205742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430116
S. Renals, Thomas Hain, H. Bourlard
The AMI and AMIDA projects are concerned with the recognition and interpretation of multiparty meetings. Within these projects we have: developed an infrastructure for recording meetings using multiple microphones and cameras; released a 100 hour annotated corpus of meetings; developed techniques for the recognition and interpretation of meetings based primarily on speech recognition and computer vision; and developed an evaluation framework at both component and system levels. In this paper we present an overview of these projects, with an emphasis on speech recognition and content extraction.
{"title":"Recognition and understanding of meetings the AMI and AMIDA projects","authors":"S. Renals, Thomas Hain, H. Bourlard","doi":"10.1109/ASRU.2007.4430116","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430116","url":null,"abstract":"The AMI and AMIDA projects are concerned with the recognition and interpretation of multiparty meetings. Within these projects we have: developed an infrastructure for recording meetings using multiple microphones and cameras; released a 100 hour annotated corpus of meetings; developed techniques for the recognition and interpretation of meetings based primarily on speech recognition and computer vision; and developed an evaluation framework at both component and system levels. In this paper we present an overview of these projects, with an emphasis on speech recognition and content extraction.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115067448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430183
M. Clements, M. Gavaldà
This paper discusses the challenges of building information retrieval applications that operate on large amounts of voice/audio data. Various problems and issues are presented along with proposed solutions. A set of techniques based on a phonetic keyword spotting approach is presented, together with examples of concrete applications that solve real-life problems.
{"title":"Voice/audio information retrieval: minimizing the need for human ears","authors":"M. Clements, M. Gavaldà","doi":"10.1109/ASRU.2007.4430183","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430183","url":null,"abstract":"This paper discusses the challenges of building information retrieval applications that operate on large amounts of voice/audio data. Various problems and issues are presented along with proposed solutions. A set of techniques based on a phonetic keyword spotting approach is presented, together with examples of concrete applications that solve real-life problems.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117317061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430171
F. Lefèvre, R. Mori
Following recent studies in stochastic dialog management, this paper introduces an unsupervised approach aiming at reducing the cost and complexity for the setup of a probabilistic POMDP-based dialog manager. The proposed method is based on a first decoding step deriving semantic basic constituents from user utterances. These isolated units and some relevant context features (as previous system actions, previous user utterances...) are combined to form vectors representing the on-going dialog states. After a clustering step, each partition of this space is intented to represent a particular dialog state. Then any new utterance can be classified according to these automatic states and the belief state can be updated before the POMDP-based dialog manager can take a decision on the best next action to perform. The proposed approach is applied to the French media task (tourist information and hotel booking). The media 10k-utterance training corpus is semantically rich (over 80 basic concepts) and is segmentally annotated in terms of basic concepts. Before user trials can be carried out, some insights on the method effectiveness are obtained by analysis of the convergence of the POMDP models.
{"title":"Unsupervised state clustering for stochastic dialog management","authors":"F. Lefèvre, R. Mori","doi":"10.1109/ASRU.2007.4430171","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430171","url":null,"abstract":"Following recent studies in stochastic dialog management, this paper introduces an unsupervised approach aiming at reducing the cost and complexity for the setup of a probabilistic POMDP-based dialog manager. The proposed method is based on a first decoding step deriving semantic basic constituents from user utterances. These isolated units and some relevant context features (as previous system actions, previous user utterances...) are combined to form vectors representing the on-going dialog states. After a clustering step, each partition of this space is intented to represent a particular dialog state. Then any new utterance can be classified according to these automatic states and the belief state can be updated before the POMDP-based dialog manager can take a decision on the best next action to perform. The proposed approach is applied to the French media task (tourist information and hotel booking). The media 10k-utterance training corpus is semantically rich (over 80 basic concepts) and is segmentally annotated in terms of basic concepts. Before user trials can be carried out, some insights on the method effectiveness are obtained by analysis of the convergence of the POMDP models.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116309298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430077
Amin Haji Abolhassani, S. Selouani, D. O'Shaughnessy
We present in this paper a signal subspace-based approach for enhancing a noisy signal. This algorithm is based on a principal component analysis (PCA) in which the optimal sub-space selection is provided by a variance of the reconstruction error (VRE) criterion. This choice overcomes many limitations encountered with other selection criteria, like over-estimation of the signal subspace or the need for empirical parameters. We have also extended our subspace algorithm to take into account the case of colored and babble noise. The performance evaluation, which is made on the Aurora database, measures improvements in the distributed speech recognition of noisy signals corrupted by different types of additive noises. Our algorithm succeeds in improving the recognition of noisy speech in all noisy conditions.
{"title":"Speech enhancement using PCA and variance of the reconstruction error in distributed speech recognition","authors":"Amin Haji Abolhassani, S. Selouani, D. O'Shaughnessy","doi":"10.1109/ASRU.2007.4430077","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430077","url":null,"abstract":"We present in this paper a signal subspace-based approach for enhancing a noisy signal. This algorithm is based on a principal component analysis (PCA) in which the optimal sub-space selection is provided by a variance of the reconstruction error (VRE) criterion. This choice overcomes many limitations encountered with other selection criteria, like over-estimation of the signal subspace or the need for empirical parameters. We have also extended our subspace algorithm to take into account the case of colored and babble noise. The performance evaluation, which is made on the Aurora database, measures improvements in the distributed speech recognition of noisy signals corrupted by different types of additive noises. Our algorithm succeeds in improving the recognition of noisy speech in all noisy conditions.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"163 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123422421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}