2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)最新文献

英文中文

Two extensions to ensemble speaker and speaking environment modeling for robust automatic speech recognition 两个扩展集成扬声器和说话环境建模鲁棒自动语音识别

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430087

Yu Tsao, Chin-Hui Lee

Recently an ensemble speaker and speaking environment modeling (ESSEM) approach to characterizing unknown testing environments was studied for robust speech recognition. Each environment is modeled by a super-vector consisting of the entire set of mean vectors from all Gaussian densities of a set of HMMs for a particular environment. The super-vector for a new testing environment is then obtained by an affine transformation on the ensemble super-vectors. In this paper, we propose a minimum classification error training procedure to obtain discriminative ensemble elements, and a super-vector clustering technique to achieve refined ensemble structures. We test these two extentions to ESSEM on Aurora2. In a per-utterance unsupervised adaptation mode we achieved an average WER of 4.99% from OdB to 20 dB conditions with these two extentions when compared with a 5.51% WER obtained with the ML-trained gender-dependent baseline. To our knowledge this represents the best result reported in the literature on the Aurora2 connected digit recognition task.

近年来，研究了一种集成说话人和说话环境建模(ESSEM)方法来表征未知测试环境，以实现鲁棒语音识别。每个环境都由一个超级向量建模，该超级向量由一组特定环境的hmm的所有高斯密度的整个均值向量集合组成。然后通过对集合超向量进行仿射变换得到新测试环境下的超向量。在本文中，我们提出了一种最小分类误差训练方法来获得判别集成元素，并提出了一种超向量聚类技术来获得精细集成结构。我们在Aurora2上测试ESSEM的这两个扩展。在每个话语的无监督适应模式中，我们从OdB到20db条件下获得了这两个扩展的平均WER为4.99%，而使用ml训练的性别依赖基线获得的WER为5.51%。据我们所知，这代表了在Aurora2连接数字识别任务的文献中报道的最佳结果。

引用次数: 10

A fast-match approach for robust, faster than real-time speaker diarization 一种鲁棒的快速匹配方法，比实时扬声器拨号更快

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430196

Yan Huang, Oriol Vinyals, G. Friedland, Christian A. Müller, Nikki Mirghafori, Chuck Wooters

During the past few years, speaker diarization has achieved satisfying accuracy in terms of speaker Diarization Error Rate (DER). The most successful approaches, based on agglomerative clustering, however, exhibit an inherent computational complexity which makes real-time processing, especially in combination with further processing steps, almost impossible. In this article we present a framework to speed up agglomerative clustering speaker diarization. The basic idea is to adopt a computationally cheap method to reduce the hypothesis space of the more expensive and accurate model selection via Bayesian Information Criterion (BIC). Two strategies based on the pitch-correlogram and the unscented-trans-form based approximation of KL-divergence are used independently as a fast-match approach to select the most likely clusters to merge. We performed the experiments using the existing ICSI speaker diarization system. The new system using KL-divergence fast-match strategy only performs 14% of total BIC comparisons needed in the baseline system, speeds up the system by 41% without affecting the speaker Diarization Error Rate (DER). The result is a robust and faster than real-time speaker diarization system.

在过去的几年里，说话人拨号在说话人拨号错误率(DER)方面取得了令人满意的精度。然而，基于聚集聚类的最成功的方法表现出固有的计算复杂性，这使得实时处理，特别是与进一步的处理步骤相结合，几乎是不可能的。在本文中，我们提出了一个加速凝聚聚类说话人化的框架。其基本思想是采用一种计算成本较低的方法，通过贝叶斯信息准则(BIC)来减少成本较高的模型选择的假设空间。基于音高相关图的两种策略和基于无气味变换的kl -散度近似分别作为快速匹配方法来选择最可能合并的聚类。我们使用现有的ICSI扬声器拨号系统进行了实验。使用kl -散度快速匹配策略的新系统仅执行基线系统所需的14%的BIC比较，在不影响说话人Diarization错误率(DER)的情况下，将系统速度提高了41%。结果表明，该系统鲁棒性好，速度快于实时说话人拨号系统。

{"title":"A fast-match approach for robust, faster than real-time speaker diarization","authors":"Yan Huang, Oriol Vinyals, G. Friedland, Christian A. Müller, Nikki Mirghafori, Chuck Wooters","doi":"10.1109/ASRU.2007.4430196","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430196","url":null,"abstract":"During the past few years, speaker diarization has achieved satisfying accuracy in terms of speaker Diarization Error Rate (DER). The most successful approaches, based on agglomerative clustering, however, exhibit an inherent computational complexity which makes real-time processing, especially in combination with further processing steps, almost impossible. In this article we present a framework to speed up agglomerative clustering speaker diarization. The basic idea is to adopt a computationally cheap method to reduce the hypothesis space of the more expensive and accurate model selection via Bayesian Information Criterion (BIC). Two strategies based on the pitch-correlogram and the unscented-trans-form based approximation of KL-divergence are used independently as a fast-match approach to select the most likely clusters to merge. We performed the experiments using the existing ICSI speaker diarization system. The new system using KL-divergence fast-match strategy only performs 14% of total BIC comparisons needed in the baseline system, speeds up the system by 41% without affecting the speaker Diarization Error Rate (DER). The result is a robust and faster than real-time speaker diarization system.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"26 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132352790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Automatic lexical pronunciations generation and update 自动词汇发音生成和更新

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430113

Ghinwa F. Choueiter, S. Seneff, James R. Glass

Most automatic speech recognizers use a dictionary that maps words to one or more canonical pronunciations. Such entries are typically hand-written by lexical experts. In this research, we investigate a new approach for automatically generating lexical pronunciations using a linguistically motivated subword model, and refining the pronunciations with spoken examples. The approach is evaluated on an isolated word recognition task with a 2 k lexicon of restaurant and street names. A letter-to-sound model is first used to generate seed baseforms for the lexicon. Then spoken utterances of words in the lexicon are presented to a subword recognizer and the top hypotheses are used to update the lexical base-forms. The spelling of each word is also used to constrain the subword search space and generate spelling-constrained baseforms. The results obtained are quite encouraging and indicate that our approach can be successfully used to learn valid pronunciations of new words.

大多数自动语音识别器使用字典将单词映射到一个或多个标准发音。这些条目通常是由词汇专家手写的。在这项研究中，我们研究了一种使用语言动机子词模型自动生成词汇发音的新方法，并通过口语例子来改进发音。该方法在一个孤立的单词识别任务上进行了评估，该任务使用了一个2k的餐馆和街道名称词典。首先使用字母到声音模型为词典生成种子基表单。然后将词汇中的话语呈现给子词识别器，并使用最上面的假设来更新词汇基本形式。每个单词的拼写还用于约束子单词搜索空间并生成拼写约束的基形式。实验结果令人鼓舞，表明我们的方法可以成功地用于新单词的有效发音学习。

引用次数: 5

Semantic translation error rate for evaluating translation systems 评价翻译系统的语义翻译错误率

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430144

Krishna Subramanian, D. Stallard, R. Prasad, S. Saleem, P. Natarajan

In this paper, we introduce a new metric which we call the semantic translation error rate, or STER, for evaluating the performance of machine translation systems. STER is based on the previously published translation error rate (TER) (Snover et al., 2006) and METEOR (Banerjee and Lavie, 2005) metrics. Specifically, STER extends TER in two ways: first, by incorporating word equivalence measures (WordNet and Porter stemming) standardly used by METEOR, and second, by disallowing alignments of concept words to non-concept words (aka stop words). We show how these features make STER alignments better suited for human-driven analysis than standard TER. We also present experimental results that show that STER is better correlated to human judgments than TER. Finally, we compare STER to METEOR, and illustrate that METEOR scores computed using the STER alignments have similar statistical properties to METEOR scores computed using METEOR alignments.

在本文中，我们引入了一个新的度量，我们称之为语义翻译错误率，或STER，用于评估机器翻译系统的性能。STER基于先前公布的翻译错误率(TER) (Snover等人，2006)和METEOR (Banerjee和Lavie, 2005)指标。具体来说，STER以两种方式扩展TER:首先，通过合并METEOR标准使用的单词等效度量(WordNet和Porter词干提取)，其次，通过不允许概念词与非概念词(又名停止词)对齐。我们展示了这些特征如何使STER比对比标准TER更适合人类驱动的分析。我们还提供了实验结果，表明STER比TER与人类判断的相关性更好。最后，我们比较了STER和METEOR，并说明使用STER对齐计算的METEOR分数与使用METEOR对齐计算的METEOR分数具有相似的统计属性。

引用次数: 7

Towards robust automatic evaluation of pathologic telephone speech 病态电话语音的鲁棒自动评价

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430200

K. Riedhammer, G. Stemmer, T. Haderlein, M. Schuster, F. Rosanowski, E. Nöth, A. Maier

For many aspects of speech therapy an objective evaluation of the intelligibility of a patient's speech is needed. We investigate the evaluation of the intelligibility of speech by means of automatic speech recognition. Previous studies have shown that measures like word accuracy are consistent with human experts' ratings. To ease the patient's burden, it is highly desirable to conduct the assessment via phone. However, the telephone channel influences the quality of the speech signal which negatively affects the results. To reduce inaccuracies, we propose a combination of two speech recognizers. Experiments on two sets of pathological speech show that the combination results in consistent improvements in the correlation between the automatic evaluation and the ratings by human experts. Furthermore, the approach leads to reductions of 10% and 25% of the maximum error of the intelligibility measure.

言语治疗的许多方面都需要对患者言语的可理解性进行客观的评价。本文研究了基于自动语音识别的语音可理解度评价方法。之前的研究表明，单词准确性等指标与人类专家的评级是一致的。为了减轻患者的负担，通过电话进行评估是非常可取的。然而，电话信道会影响语音信号的质量，从而对结果产生负面影响。为了减少不准确，我们提出了两个语音识别器的组合。在两组病理语音上的实验表明，组合后的自动评价与人类专家评分的相关性得到了一致的改善。此外，该方法可将可理解度度量的最大误差降低10%和25%。

引用次数: 15

Phonological feature based variable frame rate scheme for improved speech recognition 基于语音特征的变帧率语音识别改进方案

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430177

A. Sangwan, J. Hansen

In this paper, we propose a new scheme for variable frame rate (VFR) feature processing based on high level segmentation (HLS) of speech into broad phone classes. Traditional fixed-rate processing is not capable of accurately reflecting the dynamics of continuous speech. On the other hand, the proposed VFR scheme adapts the temporal representation of the speech signal by tying the framing strategy with the detected phone class sequence. The phone classes are detected and segmented by using appropriately trained phonological features (PFs). In this manner, the proposed scheme is capable of tracking the evolution of speech due to the underlying phonetic content, and exploiting the non-uniform information flow-rate of speech by using a variable framing strategy. The new VFR scheme is applied to automatic speech recognition of TIMIT and NTIMIT corpora, where it is compared to a traditional fixed window-size/frame-rate scheme. Our experiments yield encouraging results with relative reductions of 24% and 8% in WER (word error rate) for TIMIT and NTIMIT tasks, respectively.

本文提出了一种基于语音高阶分割(HLS)的变帧率(VFR)特征处理新方案。传统的固定速率处理不能准确地反映连续语音的动态。另一方面，所提出的VFR方案通过将分帧策略与检测到的电话类序列绑定来适应语音信号的时间表示。通过使用适当训练的语音特征(PFs)来检测和分割电话类别。通过这种方式，所提出的方案能够跟踪语音由于潜在语音内容的演变，并通过使用可变框架策略利用语音的非均匀信息流率。将新的VFR方案应用于TIMIT和NTIMIT语料库的自动语音识别，并与传统的固定窗口大小/帧率方案进行比较。我们的实验产生了令人鼓舞的结果，对于TIMIT和NTIMIT任务，WER(单词错误率)分别相对降低了24%和8%。

引用次数: 2

Deriving salient learners’ mispronunciations from cross-language phonological comparisons 从跨语言语音比较中得出显著的学习者发音错误

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430152

H. Meng, Y. Lo, Lan Wang, W. Lau

This work aims to derive salient mispronunciations made by Chinese (L1 being Cantonese) learners of English (L2 being American English) in order to support the design of pedagogical and remedial instructions. Our approach is grounded on the theory of language transfer and involves systematic phonological comparison between two languages to predict possible phonetic confusions that may lead to mispronunciations. We collect a corpus of speech recordings from some 21 Cantonese learners of English. We develop an automatic speech recognizer by training cross-word triphone models based on the TIMIT corpus. We also develop an "extended" pronunciation lexicon that incorporates the predicted phonetic confusions to generate additional, erroneous pronunciation variants for each word. The extended pronunciation lexicon is used to produce a confusion network in recognition of the English speech recordings of Cantonese learners. We refer to the statistics of the erroneous recognition outputs to derive salient mispronunciations that stipulates the predictions based on phonological comparison.

本研究旨在找出中国(母语为广东话)英语学习者(第二语言为美式英语)的显著发音错误，以支持教学和补救指导的设计。我们的方法以语言迁移理论为基础，涉及两种语言之间的系统语音比较，以预测可能导致发音错误的语音混淆。我们收集了大约21位粤语英语学习者的语音录音。我们基于TIMIT语料库，通过训练交叉词三音模型，开发了一个自动语音识别器。我们还开发了一个“扩展”的发音词典，它包含了预测的语音混淆，从而为每个单词生成额外的、错误的发音变体。利用扩展语音词典对粤语学习者的英语语音录音进行识别，形成混淆网络。我们参考错误识别输出的统计数据来得出显著的错误发音，从而规定基于语音比较的预测。

引用次数: 82

Combining statistical models with symbolic grammar in parsing 将统计模型与符号语法相结合进行语法分析

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430140

Junichi Tsujii

There are two streams of research in computational linguistics and natural language processing, the empiricist and rationalist traditions. Theories and computational techniques in these two streams have been developed separately and different in nature. Although the two traditions have been considered irreconcilable and have often been antagonistic toward each other, I have contention with this assertion, and thus claim that these two research streams in linguistics, despite or due to their differences, can be complementary to each other and should be combined into a unified methodology. I will demonstrate in my talk that there have been interesting developments in this direction of integration, and would like to discuss some of the recent results with their implications on engineering application.

在计算语言学和自然语言处理领域有两种研究流派:经验主义和理性主义。这两个领域的理论和计算技术是分开发展的，本质上是不同的。尽管这两种传统被认为是不可调和的，而且经常是相互对立的，但我对这种说法有异议，并因此声称，这两种语言学研究流，尽管或由于它们的差异，可以相互补充，应该结合成一个统一的方法。我将在我的演讲中展示在这个集成方向上有一些有趣的发展，并想讨论一些最近的结果及其对工程应用的影响。

引用次数: 0

The LIMSI QAst systems: Comparison between human and automatic rules generation for question-answering on speech transcriptions LIMSI QAst系统:人工和自动规则生成语音转录问答的比较

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430188

S. Rosset, Olivier Galibert, G. Adda, Eric Bilinski

In this paper, we present two different question-answering systems on speech transcripts. These two systems are based on a complete and multi-level analysis of both queries and documents. The first system uses handcrafted rules for small text fragments (snippet) selection and answer extraction. The second one replaces the handcrafting with an automatically generated research descriptor. A score based on those descriptors is used to select documents and snippets. The extraction and scoring of candidate answers is based on proximity measurements within the research descriptor elements and a number of secondary factors. The preliminary results obtained on QAst (QA on speech transcripts) development data are promising ranged from 72% correct answer at 1 st rank on manually transcribed meeting data to 94% on manually transcribed lecture data.

在本文中，我们提出了两种不同的语音答疑系统。这两个系统都基于对查询和文档的完整和多层次的分析。第一个系统使用手工规则进行小文本片段(片段)选择和答案提取。第二个用自动生成的研究描述符代替手工制作。基于这些描述符的分数用于选择文档和片段。候选答案的提取和评分是基于研究描述符元素和一些次要因素内的接近测量。在QAst(语音记录上的QA)开发数据上获得的初步结果是有希望的，从人工转录的会议数据的第一排名的72%正确率到人工转录的讲座数据的94%。

引用次数: 8

Efficient use of overlap information in speaker diarization 在说话人特征化中有效利用重叠信息

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430194

Scott Otterson, Mari Ostendorf

Speaker overlap in meetings is thought to be a significant contributor to error in speaker diarization, but it is not clear if overlaps are problematic for speaker clustering and/or if errors could be addressed by assigning multiple labels in overlap regions. In this paper, we look at these issues experimentally, assuming perfect detection of overlaps, to assess the relative importance of these problems and the potential impact of overlap detection. With our best features, we find that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments. In addition, the use of cross-correlation features with MFCC's reduces the performance gap due to overlaps, so that there is little gain from removing overlapped regions before clustering.

会议上的演讲者重叠被认为是导致演讲者分类错误的一个重要因素，但目前尚不清楚重叠是否会影响演讲者的聚类，或者是否可以通过在重叠区域分配多个标签来解决错误。在本文中，我们通过实验研究这些问题，假设可以完美地检测到重叠，以评估这些问题的相对重要性以及重叠检测的潜在影响。利用我们的最佳特征，我们发现，使用一种简单的策略，即根据相邻片段的标签在重叠区域分配说话人标签，检测重叠可能会相对提高15%的拨号精度。此外，互相关特征与MFCC的使用减少了由于重叠造成的性能差距，因此在聚类之前去除重叠区域几乎没有增益。

引用次数: 43

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀