Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163955
Shuai Huang, D. Karakos, Glen A. Coppersmith, Kenneth Ward Church, S. Siniscalchi
In many inference and learning tasks, collecting large amounts of labeled training data is time consuming and expensive, and oftentimes impractical. Thus, being able to efficiently use small amounts of labeled data with an abundance of unlabeled data—the topic of semi-supervised learning (SSL) [1]—has garnered much attention. In this paper, we look at the problem of choosing these small amounts of labeled data, the first step in a bootstrapping paradigm. Contrary to traditional active learning where an initial trained model is employed to select the unlabeled data points which would be most informative if labeled, our selection has to be done in an unsupervised way, as we do not even have labeled data to train an initial model. We propose using unsupervised clustering algorithms, in particular integrated sensing and processing decision trees (ISPDTs) [2], to select small amounts of data to label and subsequently use in SSL (e.g. transductive SVMs). In a language identification task on the CallFriend1 and 2003 NIST Language Recognition Evaluation corpora [3], we demonstrate that the proposed method results in significantly improved performance over random selection of equivalently sized training data.
{"title":"Bootstrapping a spoken language identification system using unsupervised integrated sensing and processing decision trees","authors":"Shuai Huang, D. Karakos, Glen A. Coppersmith, Kenneth Ward Church, S. Siniscalchi","doi":"10.1109/ASRU.2011.6163955","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163955","url":null,"abstract":"In many inference and learning tasks, collecting large amounts of labeled training data is time consuming and expensive, and oftentimes impractical. Thus, being able to efficiently use small amounts of labeled data with an abundance of unlabeled data—the topic of semi-supervised learning (SSL) [1]—has garnered much attention. In this paper, we look at the problem of choosing these small amounts of labeled data, the first step in a bootstrapping paradigm. Contrary to traditional active learning where an initial trained model is employed to select the unlabeled data points which would be most informative if labeled, our selection has to be done in an unsupervised way, as we do not even have labeled data to train an initial model. We propose using unsupervised clustering algorithms, in particular integrated sensing and processing decision trees (ISPDTs) [2], to select small amounts of data to label and subsequently use in SSL (e.g. transductive SVMs). In a language identification task on the CallFriend1 and 2003 NIST Language Recognition Evaluation corpora [3], we demonstrate that the proposed method results in significantly improved performance over random selection of equivalently sized training data.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130928690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163918
V. Mitra, Hosung Nam, C. Espy-Wilson
Articulatory Phonology models speech as spatio-temporal constellation of constricting events (e.g. raising tongue tip, narrowing lips etc.), known as articulatory gestures. These gestures are associated with distinct organs (lips, tongue tip, tongue body, velum and glottis) along the vocal tract. In this paper we present a Dynamic Bayesian Network based speech recognition architecture that models the articulatory gestures as hidden variables and uses them for speech recognition. Using the proposed architecture we performed: (a) word recognition experiments on the noisy data of Aurora-2 and (b) phone recognition experiments on the University of Wisconsin X-ray microbeam database. Our results indicate that the use of gestural information helps to improve the performance of the recognition system compared to the system using acoustic information only.
{"title":"Robust speech recognition using articulatory gestures in a Dynamic Bayesian Network framework","authors":"V. Mitra, Hosung Nam, C. Espy-Wilson","doi":"10.1109/ASRU.2011.6163918","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163918","url":null,"abstract":"Articulatory Phonology models speech as spatio-temporal constellation of constricting events (e.g. raising tongue tip, narrowing lips etc.), known as articulatory gestures. These gestures are associated with distinct organs (lips, tongue tip, tongue body, velum and glottis) along the vocal tract. In this paper we present a Dynamic Bayesian Network based speech recognition architecture that models the articulatory gestures as hidden variables and uses them for speech recognition. Using the proposed architecture we performed: (a) word recognition experiments on the noisy data of Aurora-2 and (b) phone recognition experiments on the University of Wisconsin X-ray microbeam database. Our results indicate that the use of gestural information helps to improve the performance of the recognition system compared to the system using acoustic information only.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116002496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163961
Luis Javier Rodriguez-Fuentes, M. Peñagarikano, A. Varona, M. Díez, Germán Bordel, D. M. González, Jesús Antonio Villalba López, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO, A. Abad, Oscar Koller, I. Trancoso, Paula Lopez-Otero, Laura Docío Fernández, C. García-Mateo, R. Saeidi, Mehdi Soufifar, T. Kinnunen, T. Svendsen, P. Fränti
Best language recognition performance is commonly obtained by fusing the scores of several heterogeneous systems. Regardless the fusion approach, it is assumed that different systems may contribute complementary information, either because they are developed on different datasets, or because they use different features or different modeling approaches. Most authors apply fusion as a final resource for improving performance based on an existing set of systems. Though relative performance gains decrease as larger sets of systems are considered, best performance is usually attained by fusing all the available systems, which may lead to high computational costs. In this paper, we aim to discover which technologies combine the best through fusion and to analyse the factors (data, features, modeling methodologies, etc.) that may explain such a good performance. Results are presented and discussed for a number of systems provided by the participating sites and the organizing team of the Albayzin 2010 Language Recognition Evaluation. We hope the conclusions of this work help research groups make better decisions in developing language recognition technology.
{"title":"Multi-site heterogeneous system fusions for the Albayzin 2010 Language Recognition Evaluation","authors":"Luis Javier Rodriguez-Fuentes, M. Peñagarikano, A. Varona, M. Díez, Germán Bordel, D. M. González, Jesús Antonio Villalba López, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO, A. Abad, Oscar Koller, I. Trancoso, Paula Lopez-Otero, Laura Docío Fernández, C. García-Mateo, R. Saeidi, Mehdi Soufifar, T. Kinnunen, T. Svendsen, P. Fränti","doi":"10.1109/ASRU.2011.6163961","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163961","url":null,"abstract":"Best language recognition performance is commonly obtained by fusing the scores of several heterogeneous systems. Regardless the fusion approach, it is assumed that different systems may contribute complementary information, either because they are developed on different datasets, or because they use different features or different modeling approaches. Most authors apply fusion as a final resource for improving performance based on an existing set of systems. Though relative performance gains decrease as larger sets of systems are considered, best performance is usually attained by fusing all the available systems, which may lead to high computational costs. In this paper, we aim to discover which technologies combine the best through fusion and to analyse the factors (data, features, modeling methodologies, etc.) that may explain such a good performance. Results are presented and discussed for a number of systems provided by the participating sites and the organizing team of the Albayzin 2010 Language Recognition Evaluation. We hope the conclusions of this work help research groups make better decisions in developing language recognition technology.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126613706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163989
Kengo Ohta, Masatoshi Tsuchiya, S. Nakagawa
Although large-scale spontaneous speech corpora are crucial resource for various domains of spoken language processing, they are usually limited due to their construction cost especially in transcribing precisely. On the other hand, inexact transcribed corpora like shorthand notes, meeting records and closed captions are widely available. Unfortunately, it is difficult to use them directly as speech corpora for learning acoustic models, because they contain two kinds of text, precisely transcribed parts and edited parts. In order to resolve this problem, this paper proposes an automatic detection method of precisely transcribed parts from inexact transcribed corpora. Our method consists of two steps: the first step is an automatic alignment between the inexact transcription and its corresponding utterance, and the second step is a support vector machine based detector of precisely transcribed parts using several features obtained by the first step. Experiments using the Japanese National Diet Record shows that automatic detection of precise parts is effective for lightly supervised speaker adaptation, and shows that it achieves reasonable performance to reduce the converting cost from inexact transcribed corpora into precisely transcribed ones.
{"title":"Detection of precisely transcribed parts from inexact transcribed corpus","authors":"Kengo Ohta, Masatoshi Tsuchiya, S. Nakagawa","doi":"10.1109/ASRU.2011.6163989","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163989","url":null,"abstract":"Although large-scale spontaneous speech corpora are crucial resource for various domains of spoken language processing, they are usually limited due to their construction cost especially in transcribing precisely. On the other hand, inexact transcribed corpora like shorthand notes, meeting records and closed captions are widely available. Unfortunately, it is difficult to use them directly as speech corpora for learning acoustic models, because they contain two kinds of text, precisely transcribed parts and edited parts. In order to resolve this problem, this paper proposes an automatic detection method of precisely transcribed parts from inexact transcribed corpora. Our method consists of two steps: the first step is an automatic alignment between the inexact transcription and its corresponding utterance, and the second step is a support vector machine based detector of precisely transcribed parts using several features obtained by the first step. Experiments using the Japanese National Diet Record shows that automatic detection of precise parts is effective for lightly supervised speaker adaptation, and shows that it achieves reasonable performance to reduce the converting cost from inexact transcribed corpora into precisely transcribed ones.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128848634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163924
S. Selouani
This paper presents a new evolutionary-based approach that aims at investigating more solutions while simplifying the speaker adaptation process. In this approach, a single global transformation set of parameters is optimized by genetic algorithms using a discriminative objective function. The goal is to achieve accurate speaker adaptation whatever the amount of available adaptive data. Experiments using the ARPA-RM database demonstrate the effectiveness of the proposed method.
{"title":"Evolutionary discriminative speaker adaptation","authors":"S. Selouani","doi":"10.1109/ASRU.2011.6163924","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163924","url":null,"abstract":"This paper presents a new evolutionary-based approach that aims at investigating more solutions while simplifying the speaker adaptation process. In this approach, a single global transformation set of parameters is optimized by genetic algorithms using a discriminative objective function. The goal is to achieve accurate speaker adaptation whatever the amount of available adaptive data. Experiments using the ARPA-RM database demonstrate the effectiveness of the proposed method.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128555817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163887
Ekaterina Gonina, G. Friedland, Henry Cook, K. Keutzer
Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computationally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efficient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difficult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this paper we present a speaker diarization system captured in under 50 lines of Python that achieves 50–250× faster than real-time performance by using a specialization framework to automatically map and execute computationally intensive GMM training on an NVIDIA GPU, without significant loss in accuracy.
{"title":"Fast speaker diarization using a high-level scripting language","authors":"Ekaterina Gonina, G. Friedland, Henry Cook, K. Keutzer","doi":"10.1109/ASRU.2011.6163887","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163887","url":null,"abstract":"Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computationally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efficient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difficult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this paper we present a speaker diarization system captured in under 50 lines of Python that achieves 50–250× faster than real-time performance by using a specialization framework to automatically map and execute computationally intensive GMM training on an NVIDIA GPU, without significant loss in accuracy.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114890158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163912
M. Paulik, P. Panchapagesan
Lightly supervised acoustic model (AM) training has seen a tremendous amount of interest over the past decade. It promises significant cost-savings by relying on only small amounts of accurately transcribed speech and large amounts of imperfectly (loosely) transcribed speech. The latter can often times be acquired from existing sources, without additional cost. We identify corporate videos as one such source. After reviewing the state of the art in lightly supervised AM training, we describe our efforts on exploiting 977 hours of loosely transcribed corporate videos for AM training. We report strong reductions in word error rate of up to 19.4% over our baseline. We also report initial results for a simple, yet effective scheme to identify a subset of lightly supervised training labels that are more important to the training process.
{"title":"Leveraging large amounts of loosely transcribed corporate videos for acoustic model training","authors":"M. Paulik, P. Panchapagesan","doi":"10.1109/ASRU.2011.6163912","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163912","url":null,"abstract":"Lightly supervised acoustic model (AM) training has seen a tremendous amount of interest over the past decade. It promises significant cost-savings by relying on only small amounts of accurately transcribed speech and large amounts of imperfectly (loosely) transcribed speech. The latter can often times be acquired from existing sources, without additional cost. We identify corporate videos as one such source. After reviewing the state of the art in lightly supervised AM training, we describe our efforts on exploiting 977 hours of loosely transcribed corporate videos for AM training. We report strong reductions in word error rate of up to 19.4% over our baseline. We also report initial results for a simple, yet effective scheme to identify a subset of lightly supervised training labels that are more important to the training process.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124441121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163976
Di Lu, T. Nishimoto, N. Minematsu
In spoken dialog systems, it is important to reduce the delay in generating a response to a user's utterance. We investigate the use of incremental recognition results which can be obtained from a speech recognition engine before the input utterance ends. To enable the system to respond correctly before the end of the utterance, it is desired to utilize the incremental results effectively, although they are not reliable enough. We formulate this problem as a decision making task, in which the system makes choices iteratively either to answer based on previous observations, or to wait until the next observation. The reinforcement learning can be applied to the problem. As the results of experiments, the users highly evaluate the proposed method which estimate completion time of a user's utterance by using the results of speech recognition based on mora units.
{"title":"Decision of response timing for incremental speech recognition with reinforcement learning","authors":"Di Lu, T. Nishimoto, N. Minematsu","doi":"10.1109/ASRU.2011.6163976","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163976","url":null,"abstract":"In spoken dialog systems, it is important to reduce the delay in generating a response to a user's utterance. We investigate the use of incremental recognition results which can be obtained from a speech recognition engine before the input utterance ends. To enable the system to respond correctly before the end of the utterance, it is desired to utilize the incremental results effectively, although they are not reliable enough. We formulate this problem as a decision making task, in which the system makes choices iteratively either to answer based on previous observations, or to wait until the next observation. The reinforcement learning can be applied to the problem. As the results of experiments, the users highly evaluate the proposed method which estimate completion time of a user's utterance by using the results of speech recognition based on mora units.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130410230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163923
Daniel Povey, G. Zweig, A. Acero
In this paper we describe a linear transform that we call an Exponential Transform (ET), which integrates aspects of CMLLR, VTLN and STC/MLLT into a single transform with jointly trained components. Its main advantage is that a very small number of speaker-specific parameters is required, thus enabling effective adaptation with small amounts of speaker specific data. Our formulation shares some characteristics of Vocal Tract Length Normalization (VTLN), and is intended as a substitute for VTLN. The key part of the transform is controlled by a single speaker-specific parameter that is analogous to a VTLN warp factor. The transform has non-speaker-specific parameters that are learned from data, and we find that the axis along which male and female speakers differ is automatically learned. The exponential transform has no explicit notion of frequency warping, which makes it applicable in principle to non-standard features such as those derived from neural nets, or when the key axes may not be male-female. Based on our experiments with standard MFCC features, it appears to perform better than conventional VTLN.
{"title":"Speaker adaptation with an Exponential Transform","authors":"Daniel Povey, G. Zweig, A. Acero","doi":"10.1109/ASRU.2011.6163923","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163923","url":null,"abstract":"In this paper we describe a linear transform that we call an Exponential Transform (ET), which integrates aspects of CMLLR, VTLN and STC/MLLT into a single transform with jointly trained components. Its main advantage is that a very small number of speaker-specific parameters is required, thus enabling effective adaptation with small amounts of speaker specific data. Our formulation shares some characteristics of Vocal Tract Length Normalization (VTLN), and is intended as a substitute for VTLN. The key part of the transform is controlled by a single speaker-specific parameter that is analogous to a VTLN warp factor. The transform has non-speaker-specific parameters that are learned from data, and we find that the axis along which male and female speakers differ is automatically learned. The exponential transform has no explicit notion of frequency warping, which makes it applicable in principle to non-standard features such as those derived from neural nets, or when the key axes may not be male-female. Based on our experiments with standard MFCC features, it appears to perform better than conventional VTLN.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130576407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-01DOI: 10.1109/ASRU.2011.6163959
Liang Lu, Arnab Ghoshal, S. Renals
We investigate cross-lingual acoustic modelling for low resource languages using the subspace Gaussian mixture model (SGMM). We assume the presence of acoustic models trained on multiple source languages, and use the global subspace parameters from those models for improved modelling in a target language with limited amounts of transcribed speech. Experiments on the GlobalPhone corpus using Spanish, Portuguese, and Swedish as source languages and German as target language (with 1 hour and 5 hours of transcribed audio) show that multilingually trained SGMM shared parameters result in lower word error rates (WERs) than using those from a single source language. We also show that regularizing the estimation of the SGMM state vectors by penalizing their ℓ1-norm help to overcome numerical instabilities and lead to lower WER.
{"title":"Regularized subspace Gaussian mixture models for cross-lingual speech recognition","authors":"Liang Lu, Arnab Ghoshal, S. Renals","doi":"10.1109/ASRU.2011.6163959","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163959","url":null,"abstract":"We investigate cross-lingual acoustic modelling for low resource languages using the subspace Gaussian mixture model (SGMM). We assume the presence of acoustic models trained on multiple source languages, and use the global subspace parameters from those models for improved modelling in a target language with limited amounts of transcribed speech. Experiments on the GlobalPhone corpus using Spanish, Portuguese, and Swedish as source languages and German as target language (with 1 hour and 5 hours of transcribed audio) show that multilingually trained SGMM shared parameters result in lower word error rates (WERs) than using those from a single source language. We also show that regularizing the estimation of the SGMM state vectors by penalizing their ℓ1-norm help to overcome numerical instabilities and lead to lower WER.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116239947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}