首页 > 最新文献

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)最新文献

英文 中文
Call classification for automated troubleshooting on large corpora 呼叫分类用于大型语料库的自动故障排除
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430110
Keelan Evanini, David Suendermann-Oeft, R. Pieraccini
This paper compares six algorithms for call classification in the framework of a dialog system for automated troubleshooting. The comparison is carried out on large datasets, each consisting of over 100,000 utterances from two domains: television (TV) and Internet (INT). In spite of the high number of classes (79 for TV and 58 for INT), the best classifier (maximum entropy on word bigrams) achieved more than 77% classification accuracy on the TV dataset and 81% on the INT dataset.
本文比较了在自动故障排除对话系统框架下的6种呼叫分类算法。比较是在大型数据集上进行的,每个数据集由来自两个领域的10万多个话语组成:电视(TV)和互联网(INT)。尽管有大量的分类(TV为79个,INT为58个),但最好的分类器(词双元图的最大熵)在TV数据集上实现了77%以上的分类准确率,在INT数据集上实现了81%以上的分类准确率。
{"title":"Call classification for automated troubleshooting on large corpora","authors":"Keelan Evanini, David Suendermann-Oeft, R. Pieraccini","doi":"10.1109/ASRU.2007.4430110","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430110","url":null,"abstract":"This paper compares six algorithms for call classification in the framework of a dialog system for automated troubleshooting. The comparison is carried out on large datasets, each consisting of over 100,000 utterances from two domains: television (TV) and Internet (INT). In spite of the high number of classes (79 for TV and 58 for INT), the best classifier (maximum entropy on word bigrams) achieved more than 77% classification accuracy on the TV dataset and 81% on the INT dataset.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121623531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Comparing one and two-stage acoustic modeling in the recognition of emotion in speech 语音情感识别中一级和两级声学建模的比较
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430180
Björn Schuller, Bogdan Vlasenko, Ricardo Minguez, G. Rigoll, A. Wendemuth
In the search for a standard unit for use in recognition of emotion in speech, a whole turn, that is the full section of speech by one person in a conversation, is common. Within applications such turns often seem favorable. Yet, high effectiveness of sub-turn entities is known. In this respect a two-stage approach is investigated to provide higher temporal resolution by chunking of speech-turns according to acoustic properties, and multi-instance learning for turn-mapping after individual chunk analysis. For chunking fast pre-segmentation into emotionally quasi-stationary segments by one-pass Viterbi beam search with token passing basing on MFCC is used. Chunk analysis is realized by brute-force large feature space construction with subsequent subset selection, SVM classification, and speaker normalization. Extensive tests reveal differences compared to one-stage processing. Alternatively, syllables are used for chunking.
在寻找用于识别言语情感的标准单位时,一个完整的回合,即一个人在谈话中的完整部分,是很常见的。在应用程序中,这种转变通常看起来是有利的。然而,子轮实体的高效率是众所周知的。在这方面,研究了一种两阶段的方法,通过根据声学特性对语音转向进行分块,以及在单个块分析后进行多实例学习进行转向映射,来提供更高的时间分辨率。采用基于MFCC的单次维特比波束搜索和令牌传递将快速预分割成情感准平稳段。块分析是通过暴力大特征空间构建和随后的子集选择、SVM分类和说话人归一化来实现的。广泛的测试揭示了与单阶段处理相比的差异。或者,音节被用于分组。
{"title":"Comparing one and two-stage acoustic modeling in the recognition of emotion in speech","authors":"Björn Schuller, Bogdan Vlasenko, Ricardo Minguez, G. Rigoll, A. Wendemuth","doi":"10.1109/ASRU.2007.4430180","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430180","url":null,"abstract":"In the search for a standard unit for use in recognition of emotion in speech, a whole turn, that is the full section of speech by one person in a conversation, is common. Within applications such turns often seem favorable. Yet, high effectiveness of sub-turn entities is known. In this respect a two-stage approach is investigated to provide higher temporal resolution by chunking of speech-turns according to acoustic properties, and multi-instance learning for turn-mapping after individual chunk analysis. For chunking fast pre-segmentation into emotionally quasi-stationary segments by one-pass Viterbi beam search with token passing basing on MFCC is used. Chunk analysis is realized by brute-force large feature space construction with subsequent subset selection, SVM classification, and speaker normalization. Extensive tests reveal differences compared to one-stage processing. Alternatively, syllables are used for chunking.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133101078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Lattice-based Viterbi decoding techniques for speech translation 基于点阵的语音翻译Viterbi解码技术
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430143
G. Saon, M. Picheny
We describe a cardinal-synchronous Viterbi decoder for statistical phrase-based machine translation which can operate on general ASR lattices (as opposed to confusion networks). The decoder implements constrained source reordering on the input lattice and makes use of an outbound distortion model to score the possible reorderings. The phrase table, representing the decoding search space, is encoded as a weighted finite state acceptor which is determined and minimized. At a high level, the search proceeds by performing simultaneous transitions in two pairs of automata: (input lattice, phrase table FSM) and (phrase table FSM, target language model). An alternative decoding strategy that we explore is to break the search into two independent subproblems: first, we perform monotone lattice decoding and find the best foreign path through the ASR lattice and then, we decode this path with reordering using standard sentence-based SMT. We report experimental results on several testsets of a large scale Arabic-to-English speech translation task in the context of the global autonomous language exploitation (or GALE) DARPA project. The results indicate that, for monotone search, lattice-based decoding outperforms 1-best decoding whereas for search with reordering, only the second decoding strategy was found to be superior to 1-best decoding. In both cases, the improvements hold only for shallow lattices.
我们描述了一个基于统计短语的机器翻译的基数同步维特比解码器,它可以在一般的ASR格上运行(与混淆网络相反)。解码器在输入格上实现约束源重排序,并利用出站失真模型对可能的重排序进行评分。表示解码搜索空间的短语表被编码为一个加权的有限状态接受体,该有限状态接受体被确定并最小化。在高层次上,搜索通过在两对自动机中同时执行转换来进行:(输入格,短语表FSM)和(短语表FSM,目标语言模型)。我们探索的另一种解码策略是将搜索分解为两个独立的子问题:首先,我们执行单调格解码并通过ASR格找到最佳外部路径,然后,我们使用基于标准句子的SMT对该路径进行重新排序解码。我们报告了在全球自主语言开发(GALE) DARPA项目背景下的大规模阿拉伯语到英语语音翻译任务的几个测试集的实验结果。结果表明,对于单调搜索,基于格的解码优于1-best解码,而对于重排搜索,只有第二种解码策略优于1-best解码。在这两种情况下,改进只适用于浅格子。
{"title":"Lattice-based Viterbi decoding techniques for speech translation","authors":"G. Saon, M. Picheny","doi":"10.1109/ASRU.2007.4430143","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430143","url":null,"abstract":"We describe a cardinal-synchronous Viterbi decoder for statistical phrase-based machine translation which can operate on general ASR lattices (as opposed to confusion networks). The decoder implements constrained source reordering on the input lattice and makes use of an outbound distortion model to score the possible reorderings. The phrase table, representing the decoding search space, is encoded as a weighted finite state acceptor which is determined and minimized. At a high level, the search proceeds by performing simultaneous transitions in two pairs of automata: (input lattice, phrase table FSM) and (phrase table FSM, target language model). An alternative decoding strategy that we explore is to break the search into two independent subproblems: first, we perform monotone lattice decoding and find the best foreign path through the ASR lattice and then, we decode this path with reordering using standard sentence-based SMT. We report experimental results on several testsets of a large scale Arabic-to-English speech translation task in the context of the global autonomous language exploitation (or GALE) DARPA project. The results indicate that, for monotone search, lattice-based decoding outperforms 1-best decoding whereas for search with reordering, only the second decoding strategy was found to be superior to 1-best decoding. In both cases, the improvements hold only for shallow lattices.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133681294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Discriminative training of multi-state barge-in models 多状态驳船模型的判别训练
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430137
A. Ljolje, Vincent Goffin
A barge-in system designed to reflect the design of the acoustic model used in commercial applications has been built and evaluated. It uses standard hidden Markov model structures, cepstral features and multiple hidden Markov models for both the speech and non-speech parts of the model. It is tested on a large number of real-world databases using noisy speech onset positions which were determined by forced alignment of lexical transcriptions with the recognition model. The ML trained model achieves low false rejection rates at the expense of high false acceptance rates. The discriminative training using the modified algorithm based on the maximum mutual information criterion reduces the false acceptance rates by a half, while preserving the low false rejection rates. Combining an energy based voice activity detector with the hidden Markov model based barge-in models achieves the best performance.
为了反映商业应用中声学模型的设计,已经建立并评估了一个驳船系统。它使用标准的隐马尔可夫模型结构、倒谱特征和多个隐马尔可夫模型来处理模型的语音和非语音部分。它在大量真实世界的数据库上进行了测试,使用嘈杂的语音起始位置,这些位置是通过与识别模型强制对齐词法转录来确定的。机器学习训练的模型以高错误接受率为代价实现了低错误拒绝率。采用基于最大互信息准则的改进算法进行判别训练,在保持低误拒率的同时,将误接受率降低了一半。将基于能量的语音活动检测器与基于隐马尔可夫模型的驳船模型相结合,可以获得最佳性能。
{"title":"Discriminative training of multi-state barge-in models","authors":"A. Ljolje, Vincent Goffin","doi":"10.1109/ASRU.2007.4430137","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430137","url":null,"abstract":"A barge-in system designed to reflect the design of the acoustic model used in commercial applications has been built and evaluated. It uses standard hidden Markov model structures, cepstral features and multiple hidden Markov models for both the speech and non-speech parts of the model. It is tested on a large number of real-world databases using noisy speech onset positions which were determined by forced alignment of lexical transcriptions with the recognition model. The ML trained model achieves low false rejection rates at the expense of high false acceptance rates. The discriminative training using the modified algorithm based on the maximum mutual information criterion reduces the false acceptance rates by a half, while preserving the low false rejection rates. Combining an energy based voice activity detector with the hidden Markov model based barge-in models achieves the best performance.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133540383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Variational Kullback-Leibler divergence for Hidden Markov models 隐马尔可夫模型的变分Kullback-Leibler散度
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430132
J. Hershey, P. Olsen, Steven J. Rennie
Divergence measures are widely used tools in statistics and pattern recognition. The Kullback-Leibler (KL) divergence between two hidden Markov models (HMMs) would be particularly useful in the fields of speech and image recognition. Whereas the KL divergence is tractable for many distributions, including Gaussians, it is not in general tractable for mixture models or HMMs. Recently, variational approximations have been introduced to efficiently compute the KL divergence and Bhattacharyya divergence between two mixture models, by reducing them to the divergences between the mixture components. Here we generalize these techniques to approach the divergence between HMMs using a recursive backward algorithm. Two such methods are introduced, one of which yields an upper bound on the KL divergence, the other of which yields a recursive closed-form solution. The KL and Bhattacharyya divergences, as well as a weighted edit-distance technique, are evaluated for the task of predicting the confusability of pairs of words.
散度度量是统计学和模式识别中广泛使用的工具。两个隐马尔可夫模型(hmm)之间的Kullback-Leibler (KL)散度在语音和图像识别领域特别有用。尽管KL散度对于许多分布(包括高斯分布)是可处理的,但对于混合模型或hmm来说,它通常是不可处理的。最近,变分近似被引入到有效地计算两个混合模型之间的KL散度和Bhattacharyya散度,通过将它们简化为混合组分之间的散度。在这里,我们推广这些技术,使用递归后向算法来处理hmm之间的散度。本文介绍了两种这样的方法,其中一种方法给出了KL散度的上界,另一种方法给出了一个递归的闭形式解。KL和Bhattacharyya分歧,以及加权编辑距离技术,用于预测单词对的混淆性的任务进行评估。
{"title":"Variational Kullback-Leibler divergence for Hidden Markov models","authors":"J. Hershey, P. Olsen, Steven J. Rennie","doi":"10.1109/ASRU.2007.4430132","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430132","url":null,"abstract":"Divergence measures are widely used tools in statistics and pattern recognition. The Kullback-Leibler (KL) divergence between two hidden Markov models (HMMs) would be particularly useful in the fields of speech and image recognition. Whereas the KL divergence is tractable for many distributions, including Gaussians, it is not in general tractable for mixture models or HMMs. Recently, variational approximations have been introduced to efficiently compute the KL divergence and Bhattacharyya divergence between two mixture models, by reducing them to the divergences between the mixture components. Here we generalize these techniques to approach the divergence between HMMs using a recursive backward algorithm. Two such methods are introduced, one of which yields an upper bound on the KL divergence, the other of which yields a recursive closed-form solution. The KL and Bhattacharyya divergences, as well as a weighted edit-distance technique, are evaluated for the task of predicting the confusability of pairs of words.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124385471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Efficient use of overlap information in speaker diarization 在说话人特征化中有效利用重叠信息
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430194
Scott Otterson, Mari Ostendorf
Speaker overlap in meetings is thought to be a significant contributor to error in speaker diarization, but it is not clear if overlaps are problematic for speaker clustering and/or if errors could be addressed by assigning multiple labels in overlap regions. In this paper, we look at these issues experimentally, assuming perfect detection of overlaps, to assess the relative importance of these problems and the potential impact of overlap detection. With our best features, we find that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments. In addition, the use of cross-correlation features with MFCC's reduces the performance gap due to overlaps, so that there is little gain from removing overlapped regions before clustering.
会议上的演讲者重叠被认为是导致演讲者分类错误的一个重要因素,但目前尚不清楚重叠是否会影响演讲者的聚类,或者是否可以通过在重叠区域分配多个标签来解决错误。在本文中,我们通过实验研究这些问题,假设可以完美地检测到重叠,以评估这些问题的相对重要性以及重叠检测的潜在影响。利用我们的最佳特征,我们发现,使用一种简单的策略,即根据相邻片段的标签在重叠区域分配说话人标签,检测重叠可能会相对提高15%的拨号精度。此外,互相关特征与MFCC的使用减少了由于重叠造成的性能差距,因此在聚类之前去除重叠区域几乎没有增益。
{"title":"Efficient use of overlap information in speaker diarization","authors":"Scott Otterson, Mari Ostendorf","doi":"10.1109/ASRU.2007.4430194","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430194","url":null,"abstract":"Speaker overlap in meetings is thought to be a significant contributor to error in speaker diarization, but it is not clear if overlaps are problematic for speaker clustering and/or if errors could be addressed by assigning multiple labels in overlap regions. In this paper, we look at these issues experimentally, assuming perfect detection of overlaps, to assess the relative importance of these problems and the potential impact of overlap detection. With our best features, we find that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments. In addition, the use of cross-correlation features with MFCC's reduces the performance gap due to overlaps, so that there is little gain from removing overlapped regions before clustering.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123828210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Speech recognition with localized time-frequency pattern detectors 局域时频模式检测器的语音识别
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430135
K. Schutte, James R. Glass
A method for acoustic modeling of speech is presented which is based on learning and detecting the occurrence of localized time-frequency patterns in a spectrogram. A boosting algorithm is applied to both build classifiers and perform feature selection from a large set of features derived by filtering spectrograms. Initial experiments are performed to discriminate digits in the Aurora database. The system succeeds in learning sequences of localized time-frequency patterns which are highly interpretable from an acoustic-phonetic viewpoint. While the work and the results are preliminary, they suggest that pursuing these techniques further could lead to new approaches to acoustic modeling for ASR which are more noise robust and offer better encoding of temporal dynamics than typical features such as frame-based cepstra.
提出了一种基于学习和检测频谱图中局部时频模式的语音声学建模方法。利用增强算法建立分类器,并从滤波谱图得到的大量特征中进行特征选择。最初的实验是为了区分Aurora数据库中的数字。该系统成功地学习了局部时频模式序列,从声学-语音的角度来看,这些序列具有很高的可解释性。虽然这项工作和结果是初步的,但他们认为,进一步研究这些技术可能会导致ASR声学建模的新方法,这些方法比基于帧的倒频谱等典型特征更具噪声鲁棒性,并提供更好的时间动态编码。
{"title":"Speech recognition with localized time-frequency pattern detectors","authors":"K. Schutte, James R. Glass","doi":"10.1109/ASRU.2007.4430135","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430135","url":null,"abstract":"A method for acoustic modeling of speech is presented which is based on learning and detecting the occurrence of localized time-frequency patterns in a spectrogram. A boosting algorithm is applied to both build classifiers and perform feature selection from a large set of features derived by filtering spectrograms. Initial experiments are performed to discriminate digits in the Aurora database. The system succeeds in learning sequences of localized time-frequency patterns which are highly interpretable from an acoustic-phonetic viewpoint. While the work and the results are preliminary, they suggest that pursuing these techniques further could lead to new approaches to acoustic modeling for ASR which are more noise robust and offer better encoding of temporal dynamics than typical features such as frame-based cepstra.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114384991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
The IBM 2007 speech transcription system for European parliamentary speeches 欧洲议会演讲的IBM 2007语音转录系统
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430158
B. Ramabhadran, O. Siohan, A. Sethy
TC-STAR is an European Union funded speech to speech translation project to transcribe, translate and synthesize European Parliamentary Plenary Speeches (EPPS). This paper describes IBM's English speech recognition system submitted to the TC-STAR 2007 Evaluation. Language model adaptation based on clustering and data selection using relative entropy minimization provided significant gains in the 2007 evaluation. The additional advances over the 2006 system that we present in this paper include unsupervised training of acoustic and language models; a system architecture that is based on cross-adaptation across complementary systems and system combination through generation of an ensemble of systems using randomized decision tree state-tying. These advances reduced the error rate by 30% relative over the best-performing system in the TC-STAR 2006 evaluation on the 2006 English development and evaluation test sets, and produced one of the best performing systems on the 2007 evaluation in English with a word error rate of 7.1%.
TC-STAR是欧盟资助的演讲到演讲翻译项目,用于转录、翻译和合成欧洲议会全体会议演讲(EPPS)。本文介绍了IBM公司提交TC-STAR 2007评估的英语语音识别系统。基于聚类的语言模型自适应和使用相对熵最小化的数据选择在2007年的评估中取得了显著的进展。我们在本文中介绍的2006年系统的其他进步包括声学和语言模型的无监督训练;一种系统架构,它基于互补系统之间的交叉适应和系统组合,通过使用随机决策树状态绑定生成系统集合。这些进步将错误率相对于2006年英语发展和评估测试集的TC-STAR 2006评估中表现最好的系统降低了30%,并产生了2007年英语评估中表现最好的系统之一,单词错误率为7.1%。
{"title":"The IBM 2007 speech transcription system for European parliamentary speeches","authors":"B. Ramabhadran, O. Siohan, A. Sethy","doi":"10.1109/ASRU.2007.4430158","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430158","url":null,"abstract":"TC-STAR is an European Union funded speech to speech translation project to transcribe, translate and synthesize European Parliamentary Plenary Speeches (EPPS). This paper describes IBM's English speech recognition system submitted to the TC-STAR 2007 Evaluation. Language model adaptation based on clustering and data selection using relative entropy minimization provided significant gains in the 2007 evaluation. The additional advances over the 2006 system that we present in this paper include unsupervised training of acoustic and language models; a system architecture that is based on cross-adaptation across complementary systems and system combination through generation of an ensemble of systems using randomized decision tree state-tying. These advances reduced the error rate by 30% relative over the best-performing system in the TC-STAR 2006 evaluation on the 2006 English development and evaluation test sets, and produced one of the best performing systems on the 2007 evaluation in English with a word error rate of 7.1%.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"243 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122719420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Design and implementation of a robot audition system for automatic speech recognition of simultaneous speech 一种用于同步语音自动识别的机器人试听系统的设计与实现
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430093
S. Yamamoto, K. Nakadai, Mikio Nakano, H. Tsujino, J. Valin, Kazunori Komatani, T. Ogata, HIroshi G. Okuno
This paper addresses robot audition that can cope with speech that has a low signal-to-noise ratio (SNR) in real time by using robot-embedded microphones. To cope with such a noise, we exploited two key ideas; Preprocessing consisting of sound source localization and separation with a microphone array, and system integration based on missing feature theory (MFT). Preprocessing improves the SNR of a target sound signal using geometric source separation with multichannel post-filter. MFT uses only reliable acoustic features in speech recognition and masks unreliable parts caused by errors in preprocessing. MFT thus provides smooth integration between preprocessing and automatic speech recognition. A real-time robot audition system based on these two key ideas is constructed for Honda ASIMO and Humanoid SIG2 with 8-ch microphone arrays. The paper also reports the improvement of ASR performance by using two and three simultaneous speech signals.
本文研究了利用嵌入式机器人麦克风对低信噪比语音进行实时监听的问题。为了应对这样的噪音,我们利用了两个关键的想法;预处理包括声源定位和麦克风阵列分离,以及基于缺失特征理论(MFT)的系统集成。预处理利用几何源分离和多通道后置滤波器提高目标声音信号的信噪比。MFT在语音识别中只使用可靠的声学特征,而忽略了预处理错误导致的不可靠部分。因此,MFT提供了预处理和自动语音识别之间的平滑集成。基于这两个关键思想,构建了一个基于本田ASIMO和人形SIG2的8-ch麦克风阵列的实时机器人试听系统。本文还报道了使用两个和三个同步语音信号对ASR性能的改善。
{"title":"Design and implementation of a robot audition system for automatic speech recognition of simultaneous speech","authors":"S. Yamamoto, K. Nakadai, Mikio Nakano, H. Tsujino, J. Valin, Kazunori Komatani, T. Ogata, HIroshi G. Okuno","doi":"10.1109/ASRU.2007.4430093","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430093","url":null,"abstract":"This paper addresses robot audition that can cope with speech that has a low signal-to-noise ratio (SNR) in real time by using robot-embedded microphones. To cope with such a noise, we exploited two key ideas; Preprocessing consisting of sound source localization and separation with a microphone array, and system integration based on missing feature theory (MFT). Preprocessing improves the SNR of a target sound signal using geometric source separation with multichannel post-filter. MFT uses only reliable acoustic features in speech recognition and masks unreliable parts caused by errors in preprocessing. MFT thus provides smooth integration between preprocessing and automatic speech recognition. A real-time robot audition system based on these two key ideas is constructed for Honda ASIMO and Humanoid SIG2 with 8-ch microphone arrays. The paper also reports the improvement of ASR performance by using two and three simultaneous speech signals.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115259837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Robust speech recognition using noise suppression based on multiple composite models and multi-pass search 基于多复合模型和多通道搜索的噪声抑制鲁棒语音识别
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430083
T. Jitsuhiro, T. Toriyama, K. Kogure
This paper presents robust speech recognition using a noise suppression method based on multi-model compositions and multi-pass search. In real environments, many kinds of noise signals exists, and input speech for speech recognition systems include them. Our task in the E-Nightingale project is speech recognition of voice memoranda spoken by nurses during actual work at hospitals. To obtain good recognized candidates, suppressing many kinds of noise signals at once to find target speech is important. First, before noise suppression, to find speech and noise label sequences, we introduce multi-pass search with acoustic models including many kinds of noise models and their compositions, their n-gram models, and their lexicon. Second, noise suppression based on models is performed using the multiple composite models selected by recognized label sequences with time alignments. We evaluated this approach using the E-Nightingale task, and the proposed method outperformed the conventional method.
本文提出了一种基于多模型组合和多通道搜索的鲁棒语音识别噪声抑制方法。在实际环境中,存在着多种噪声信号,语音识别系统的输入语音也包含着噪声信号。我们在E-Nightingale项目中的任务是对医院护士在实际工作中所说的语音备忘录进行语音识别。为了获得良好的候选识别语音,需要同时抑制多种噪声信号以找到目标语音。首先,在噪声抑制之前,为了找到语音和噪声标签序列,我们引入了声学模型的多通道搜索,包括多种噪声模型及其组成、它们的n-gram模型和它们的词汇。其次,利用时间对齐的识别标签序列选择的多个复合模型进行基于模型的噪声抑制;我们使用E-Nightingale任务对该方法进行了评估,结果表明该方法优于传统方法。
{"title":"Robust speech recognition using noise suppression based on multiple composite models and multi-pass search","authors":"T. Jitsuhiro, T. Toriyama, K. Kogure","doi":"10.1109/ASRU.2007.4430083","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430083","url":null,"abstract":"This paper presents robust speech recognition using a noise suppression method based on multi-model compositions and multi-pass search. In real environments, many kinds of noise signals exists, and input speech for speech recognition systems include them. Our task in the E-Nightingale project is speech recognition of voice memoranda spoken by nurses during actual work at hospitals. To obtain good recognized candidates, suppressing many kinds of noise signals at once to find target speech is important. First, before noise suppression, to find speech and noise label sequences, we introduce multi-pass search with acoustic models including many kinds of noise models and their compositions, their n-gram models, and their lexicon. Second, noise suppression based on models is performed using the multiple composite models selected by recognized label sequences with time alignments. We evaluated this approach using the E-Nightingale task, and the proposed method outperformed the conventional method.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117006625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1