Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430145
Oliver Bender, E. Matusov, Stefan Hahn, Sasa Hasan, Shahram Khadivi, H. Ney
We present the RWTH phrase-based statistical machine translation system designed for the translation of Arabic speech into English text. This system was used in the Global Autonomous Language Exploitation (GALE) Go/No-Go Translation Evaluation 2007. Using a two-pass approach, we first generate n-best translation candidates and then rerank these candidates using additional models. We give a short review of the decoder as well as of the models used in both passes. We stress the difficulties of spoken language translation, i.e. how to combine the recognition and translation systems and how to compensate for missing punctuation. In addition, we cover our work on domain adaptation for the applied language models. We present translation results for the official GALE 2006 evaluation set and the GALE 2007 development set.
{"title":"The RWTH Arabic-to-English spoken language translation system","authors":"Oliver Bender, E. Matusov, Stefan Hahn, Sasa Hasan, Shahram Khadivi, H. Ney","doi":"10.1109/ASRU.2007.4430145","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430145","url":null,"abstract":"We present the RWTH phrase-based statistical machine translation system designed for the translation of Arabic speech into English text. This system was used in the Global Autonomous Language Exploitation (GALE) Go/No-Go Translation Evaluation 2007. Using a two-pass approach, we first generate n-best translation candidates and then rerank these candidates using additional models. We give a short review of the decoder as well as of the models used in both passes. We stress the difficulties of spoken language translation, i.e. how to combine the recognition and translation systems and how to compensate for missing punctuation. In addition, we cover our work on domain adaptation for the applied language models. We present translation results for the official GALE 2006 evaluation set and the GALE 2007 development set.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126409212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430168
S. Varges, G. Riccardi
Data is becoming increasingly crucial for training and (self-) evaluation of spoken dialog systems (SDS). Data is used to train models (e.g. acoustic models) and is 'forgotten'. Data is generated on-line from the different components of the SDS system, e.g. the dialog manager, as well as from the world it is interacting with (e.g. news streams, ambient sensors etc.). The data is used to evaluate and analyze conversational systems both on-line and off-line. We need to be able query such heterogeneous data for further processing. In this paper we present an approach with two novel components: first, an architecture for SDSs that takes a data-centric view, ensuring persistency and consistency of data as it is generated. The architecture is centered around a database that stores dialog data beyond the lifetime of individual dialog sessions, facilitating dialog mining, annotation, and logging. Second, we take advantage of the state-fullness of the data-centric architecture by means of a lightweight, reactive and inference-based dialog manager that itself is stateless. The feasibility of our approach has been validated within a prototype of a phone-based university help-desk application. We detail SDS architecture and dialog management, model, and data representation.
{"title":"A data-centric architecture for data-driven spoken dialog systems","authors":"S. Varges, G. Riccardi","doi":"10.1109/ASRU.2007.4430168","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430168","url":null,"abstract":"Data is becoming increasingly crucial for training and (self-) evaluation of spoken dialog systems (SDS). Data is used to train models (e.g. acoustic models) and is 'forgotten'. Data is generated on-line from the different components of the SDS system, e.g. the dialog manager, as well as from the world it is interacting with (e.g. news streams, ambient sensors etc.). The data is used to evaluate and analyze conversational systems both on-line and off-line. We need to be able query such heterogeneous data for further processing. In this paper we present an approach with two novel components: first, an architecture for SDSs that takes a data-centric view, ensuring persistency and consistency of data as it is generated. The architecture is centered around a database that stores dialog data beyond the lifetime of individual dialog sessions, facilitating dialog mining, annotation, and logging. Second, we take advantage of the state-fullness of the data-centric architecture by means of a lightweight, reactive and inference-based dialog manager that itself is stateless. The feasibility of our approach has been validated within a prototype of a phone-based university help-desk application. We detail SDS architecture and dialog management, model, and data representation.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121132213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430191
U. Chaudhari, M. Picheny
This paper investigates an approximate similarity measure for searching in phone based audio transcripts. The baseline method combines elements found in the literature to form an approach based on a phonetic confusion matrix that is used to determine the similarity of an audio document and a query, both of which are parsed into phone N-grams. Experimental results show comparable performance to other approaches in the literature. Extensions of the approach are developed based on a constrained form of the similarity measure that can take into consideration the system dependent errors that can occur. This is done by accounting for higher order confusions, namely of phone bi-grams and tri-grams. Results show improved performance across a variety of system configurations.
{"title":"Improvements in phone based audio search via constrained match with high order confusion estimates","authors":"U. Chaudhari, M. Picheny","doi":"10.1109/ASRU.2007.4430191","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430191","url":null,"abstract":"This paper investigates an approximate similarity measure for searching in phone based audio transcripts. The baseline method combines elements found in the literature to form an approach based on a phonetic confusion matrix that is used to determine the similarity of an audio document and a query, both of which are parsed into phone N-grams. Experimental results show comparable performance to other approaches in the literature. Extensions of the approach are developed based on a constrained form of the similarity measure that can take into consideration the system dependent errors that can occur. This is done by accounting for higher order confusions, namely of phone bi-grams and tri-grams. Results show improved performance across a variety of system configurations.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121197364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430138
Andrei Alexandrescu, K. Kirchhoff
We introduce graph-based learning for acoustic-phonetic classification. In graph-based learning, training and test data points are jointly represented in a weighted undirected graph characterized by a weight matrix indicating similarities between different samples. Classification of test samples is achieved by label propagation over the entire graph. Although this learning technique is commonly applied in semi-supervised settings, we show how it can also be used as a postprocessing step to a supervised classifier by imposing additional regularization constraints based on the underlying data manifold. We also present a technique to adapt graph-based learning to large datasets and evaluate our system on a vowel classification task. Our results show that graph-based learning improves significantly over state-of-the art baselines.
{"title":"Graph-based learning for phonetic classification","authors":"Andrei Alexandrescu, K. Kirchhoff","doi":"10.1109/ASRU.2007.4430138","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430138","url":null,"abstract":"We introduce graph-based learning for acoustic-phonetic classification. In graph-based learning, training and test data points are jointly represented in a weighted undirected graph characterized by a weight matrix indicating similarities between different samples. Classification of test samples is achieved by label propagation over the entire graph. Although this learning technique is commonly applied in semi-supervised settings, we show how it can also be used as a postprocessing step to a supervised classifier by imposing additional regularization constraints based on the underlying data manifold. We also present a technique to adapt graph-based learning to large datasets and evaluate our system on a vowel classification task. Our results show that graph-based learning improves significantly over state-of-the art baselines.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115265774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430173
Yi Wu, Rong Zhang, Alexander I. Rudnicky
This paper presents a strategy for efficiently selecting informative data from large corpora of transcribed speech. We propose to choose data uniformly according to the distribution of some target speech unit (phoneme, word, character, etc). In our experiment, in contrast to the common belief that "there is no data like more data", we found it possible to select a highly informative subset of data that produces recognition performance comparable to a system that makes use of a much larger amount of data. At the same time, our selection process is efficient and fast.
{"title":"Data selection for speech recognition","authors":"Yi Wu, Rong Zhang, Alexander I. Rudnicky","doi":"10.1109/ASRU.2007.4430173","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430173","url":null,"abstract":"This paper presents a strategy for efficiently selecting informative data from large corpora of transcribed speech. We propose to choose data uniformly according to the distribution of some target speech unit (phoneme, word, character, etc). In our experiment, in contrast to the common belief that \"there is no data like more data\", we found it possible to select a highly informative subset of data that produces recognition performance comparable to a system that makes use of a much larger amount of data. At the same time, our selection process is efficient and fast.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122551977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430121
Kyu Jeong Han, Samuel Kim, Shrikanth S. Narayanan
Agglomerative hierarchical clustering (AHC) has been widely used in speaker diarization systems to classify speech segments in a given data source by speaker identity, but is known to be not robust to data source variation. In this paper, we identify one of the key potential sources of this variability that negatively affects clustering error rate (CER), namely short speech segments, and propose three solutions to tackle this issue. Through experiments on various meeting conversation excerpts, the proposed methods are shown to outperform simple AHC in terms of relative CER improvements in the range of 17-32%.
{"title":"Robust speaker clustering strategies to data source variation for improved speaker diarization","authors":"Kyu Jeong Han, Samuel Kim, Shrikanth S. Narayanan","doi":"10.1109/ASRU.2007.4430121","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430121","url":null,"abstract":"Agglomerative hierarchical clustering (AHC) has been widely used in speaker diarization systems to classify speech segments in a given data source by speaker identity, but is known to be not robust to data source variation. In this paper, we identify one of the key potential sources of this variability that negatively affects clustering error rate (CER), namely short speech segments, and propose three solutions to tackle this issue. Through experiments on various meeting conversation excerpts, the proposed methods are shown to outperform simple AHC in terms of relative CER improvements in the range of 17-32%.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122630267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430159
Hui-Ching Lin, J. Bilmes, D. Vergyri, K. Kirchhoff
We propose a new method for detecting out-of-vocabulary (OOV) words for large vocabulary continuous speech recognition (LVCSR) systems. Our method is based on performing a joint alignment between independently generated word and phone lattices, where the word-lattice is aligned via a recognition lexicon. Based on a similarity measure between phones, we can locate highly mis-aligned regions of time, and then specify those regions as candidate OOVs. This novel approach is implemented using the framework of graphical models (GMs), which enable fast flexible integration of different scores from word lattices, phone lattices, and the similarity measures. We evaluate our method on switchboard data using RT-04 as test set. Experimental results show that our approach provides a promising and scalable new way to detect OOV for LVCSR.
{"title":"OOV detection by joint word/phone lattice alignment","authors":"Hui-Ching Lin, J. Bilmes, D. Vergyri, K. Kirchhoff","doi":"10.1109/ASRU.2007.4430159","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430159","url":null,"abstract":"We propose a new method for detecting out-of-vocabulary (OOV) words for large vocabulary continuous speech recognition (LVCSR) systems. Our method is based on performing a joint alignment between independently generated word and phone lattices, where the word-lattice is aligned via a recognition lexicon. Based on a similarity measure between phones, we can locate highly mis-aligned regions of time, and then specify those regions as candidate OOVs. This novel approach is implemented using the framework of graphical models (GMs), which enable fast flexible integration of different scores from word lattices, phone lattices, and the similarity measures. We evaluate our method on switchboard data using RT-04 as test set. Experimental results show that our approach provides a promising and scalable new way to detect OOV for LVCSR.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122846032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430090
N.U. Nair, T. Sreenivas
We are addressing a new problem of improving automatic speech recognition performance, given multiple utterances of patterns from the same class. We have formulated the problem of jointly decoding K multiple patterns given a single hidden Markov model. It is shown that such a solution is possible by aligning the K patterns using the proposed multi pattern dynamic time warping algorithm followed by the constrained multi pattern Viterbi algorithm. The new formulation is tested in the context of speaker independent isolated word recognition for both clean and noisy patterns. When 10 percent of speech is affected by a burst noise at -5 dB signal to noise ratio (local), it is shown that joint decoding using only two noisy patterns reduces the noisy speech recognition error rate to about 51 percent, when compared to the single pattern decoding using the Viterbi Algorithm. In contrast a simple maximization of individual pattern likelihoods, provides only about 7 percent reduction in error rate.
{"title":"Joint decoding of multiple speech patterns for robust speech recognition","authors":"N.U. Nair, T. Sreenivas","doi":"10.1109/ASRU.2007.4430090","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430090","url":null,"abstract":"We are addressing a new problem of improving automatic speech recognition performance, given multiple utterances of patterns from the same class. We have formulated the problem of jointly decoding K multiple patterns given a single hidden Markov model. It is shown that such a solution is possible by aligning the K patterns using the proposed multi pattern dynamic time warping algorithm followed by the constrained multi pattern Viterbi algorithm. The new formulation is tested in the context of speaker independent isolated word recognition for both clean and noisy patterns. When 10 percent of speech is affected by a burst noise at -5 dB signal to noise ratio (local), it is shown that joint decoding using only two noisy patterns reduces the noisy speech recognition error rate to about 51 percent, when compared to the single pattern decoding using the Viterbi Algorithm. In contrast a simple maximization of individual pattern likelihoods, provides only about 7 percent reduction in error rate.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130104773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430129
Tara N. Sainath, D. Kanevsky, B. Ramabhadran
In many pattern recognition tasks, given some input data and a model, a probabilistic likelihood score is often computed to measure how well the model describes the data. Extended Baum-Welch (EBW) transformations are most commonly used as a discriminative technique for estimating parameters of Gaussian mixtures, though recently they have been used to derive a gradient steepness measurement to evaluate the quality of the model to match the distribution of the data. In this paper, we explore applying the EBW gradient steepness metric in the context of Hidden Markov Models (HMMs) for recognition of broad phonetic classes and present a detailed analysis and results on the use of this gradient metric on the TIMIT corpus. We find that our gradient metric is able to outperform the baseline likelihood method, and offers improvements in noisy conditions.
{"title":"Broad phonetic class recognition in a Hidden Markov model framework using extended Baum-Welch transformations","authors":"Tara N. Sainath, D. Kanevsky, B. Ramabhadran","doi":"10.1109/ASRU.2007.4430129","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430129","url":null,"abstract":"In many pattern recognition tasks, given some input data and a model, a probabilistic likelihood score is often computed to measure how well the model describes the data. Extended Baum-Welch (EBW) transformations are most commonly used as a discriminative technique for estimating parameters of Gaussian mixtures, though recently they have been used to derive a gradient steepness measurement to evaluate the quality of the model to match the distribution of the data. In this paper, we explore applying the EBW gradient steepness metric in the context of Hidden Markov Models (HMMs) for recognition of broad phonetic classes and present a detailed analysis and results on the use of this gradient metric on the TIMIT corpus. We find that our gradient metric is able to outperform the baseline likelihood method, and offers improvements in noisy conditions.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125305817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430100
Ahmad Emami, L. Mangu
In this paper we investigate the use of neural network language models for Arabic speech recognition. By using a distributed representation of words, the neural network model allows for more robust generalization and is better able to fight the data sparseness problem. We investigate different configurations of the neural probabilistic model, experimenting with such parameters as N-gram order, output vocabulary, normalization method, and model size and parameters. Experiments were carried out on Arabic broadcast news and broadcast conversations data and the optimized neural network language models showed significant improvements over the baseline N-gram model.
{"title":"Empirical study of neural network language models for Arabic speech recognition","authors":"Ahmad Emami, L. Mangu","doi":"10.1109/ASRU.2007.4430100","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430100","url":null,"abstract":"In this paper we investigate the use of neural network language models for Arabic speech recognition. By using a distributed representation of words, the neural network model allows for more robust generalization and is better able to fight the data sparseness problem. We investigate different configurations of the neural probabilistic model, experimenting with such parameters as N-gram order, output vocabulary, normalization method, and model size and parameters. Experiments were carried out on Arabic broadcast news and broadcast conversations data and the optimized neural network language models showed significant improvements over the baseline N-gram model.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128802748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}