2011 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文中文

Designing text corpus using phone-error distribution for acoustic modeling 利用电话误差分布设计文本语料库进行声学建模

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163929

H. Murakami, K. Shinoda, S. Furui

It is expensive to prepare a sufficient amount of training data for acoustic modeling for developing large vocabulary continuous speech recognition systems. This is a serious problem especially for resource-deficient languages. We propose an active learning method that effectively reduces the amount of training data without any degradation in recognition performance. It is used to design a text corpus for read speech collection. It first estimates phone-error distribution using a small amount of fully transcribed speech data. Second, it constructs a sentence set whose phone-occurrence distribution is close to the phone-error distribution and collects its speech data. It then extends this process to diphones and triphones and collects more speech data. We evaluated our method with simulation experiments using the Corpus of Spontaneous Japanese. It required only 76 h of speech data to achieve word accuracy of 74.7%, while the conventional training method required 152 h of data to achieve the same rate.

为开发大词汇量连续语音识别系统，准备足够数量的声学建模训练数据是非常昂贵的。这是一个严重的问题，特别是对于资源缺乏的语言。我们提出了一种主动学习方法，可以有效地减少训练数据的数量，而不会降低识别性能。它被用来设计一个用于读语音收集的文本语料库。它首先使用少量完全转录的语音数据来估计电话错误分布。其次，构建一个电话-发生分布与电话-错误分布接近的句子集，并收集其语音数据;然后，它将这个过程扩展到双音和三音，并收集更多的语音数据。我们使用自发日语语料库进行模拟实验来评估我们的方法。它只需要76小时的语音数据就可以达到74.7%的单词准确率，而传统的训练方法需要152小时的数据才能达到相同的准确率。

引用次数: 2

Blind noise suppression for Non-Audible Murmur recognition with stereo signal processing 用立体声信号处理盲噪声抑制非听杂音识别

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163981

Shunta Ishii, T. Toda, H. Saruwatari, S. Sakti, Satoshi Nakamura

In this paper, we propose a blind noise suppression method for Non-Audible Murmur (NAM) recognition. NAM is a very soft whispered voice detected with NAM microphone, which is one of the body-conductive microphones. Due to its recording mechanism, the detected signal suffers from noise caused by speaker's movements. In the proposed method using a stereo signal detected with two NAM microphones, the noise is estimated with blind source separation, and then, spectral subtraction is performed in each channel to reduce the noise. Moreover, channel selection is performed frame by frame to generate less distorted monaural NAM signal. Experimental results show that 1) word accuracy in large vocabulary continuous NAM recognition is degraded from 69.2% to 53.6% by the noise and 2) it is significantly recovered to 63.3% in a simulated situation and 58.6% in a real situation with the proposed method.

本文提出了一种用于非可听杂音(NAM)识别的盲噪声抑制方法。NAM是一种非常柔和的低声声音，由NAM麦克风检测，它是身体传导麦克风之一。由于其记录机制，检测到的信号受到说话人运动产生的噪声的影响。该方法采用双麦克风检测的立体声信号，采用盲源分离法对噪声进行估计，然后在每个通道进行频谱减法来降低噪声。此外，逐帧进行信道选择以产生畸变较小的单频NAM信号。实验结果表明:(1)噪声使大词汇量连续非NAM识别的词正确率从69.2%下降到53.6%，(2)模拟情景和真实情景下的词正确率分别恢复到63.3%和58.6%。

引用次数: 6

Convolutive Bottleneck Network features for LVCSR LVCSR的卷积瓶颈网络特征

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163903

Karel Veselý, M. Karafiát, F. Grézl

In this paper, we focus on improvements of the bottleneck ANN in a Tandem LVCSR system. First, the influence of training set size and the ANN size is evaluated. Second, a very positive effect of linear bottleneck is shown. Finally a Convolutive Bottleneck Network is proposed as extension of the current state-of-the-art Universal Context Network. The proposed training method leads to 5.5% relative reduction of WER, compared to the Universal Context ANN baseline. The relative improvement compared to the 5-layer single-bottleneck network is 17.7%. The dataset ctstrain07 composed of more than 2000 hours of English Conversational Telephone Speech was used for the experiments. The TNet toolkit with CUDA GPGPU implementation was used for fast training.

在本文中，我们重点研究了瓶颈人工神经网络在串联LVCSR系统中的改进。首先，评估了训练集大小和人工神经网络大小的影响。其次，显示了线性瓶颈的非常积极的影响。最后提出了一种卷积瓶颈网络，作为当前最先进的通用上下文网络的扩展。与通用上下文人工神经网络基线相比，所提出的训练方法导致WER相对降低5.5%。与5层单瓶颈网络相比，相对改进了17.7%。实验使用了由2000多个小时的英语会话电话语音组成的数据集ctstrain07。使用TNet工具包与CUDA GPGPU实现进行快速训练。

引用次数: 108

Automatic detection of “g-dropping” in American English using forced alignment 使用强制对齐自动检测美式英语中的“g-drop”

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163980

Jiahong Yuan, M. Liberman

This study investigated the use of forced alignment for automatic detection of “g-dropping” in American English (e.g., walkin'). Two acoustic models were trained, one for -in' and the other for -ing. The models were added to the Penn Phonetics Lab Forced Aligner, and forced alignment will choose the more probable pronunciation from the two alternatives. The agreement rates between the forced alignment method and native English speakers ranged from 79% to 90%, which were comparable to the agreement rates among the native speakers (79% – 96%). The two variations of pronunciation not only differed in their nasal codas, but also - and even more so - in their vowel quality. This is shown by both the KL-divergence between the two models, and that native Mandarin speakers performed poorly on classification of “g-dropping”.

本研究调查了使用强制对齐来自动检测美式英语中的“g-drop”(例如，walkin')。训练了两个声学模型，一个用于“-”，另一个用于“-”。这些模型被添加到宾夕法尼亚大学语音实验室的强制对齐器中，强制对齐将从两个备选方案中选择更可能的发音。强迫对齐法与英语母语者之间的一致性率为79% ~ 90%，与母语者之间的一致性率(79% ~ 96%)相当。这两种不同的发音不仅在鼻尾上不同，而且在元音音质上更是不同。两个模型之间的kl差异表明了这一点，母语为普通话的人在“落g”的分类上表现不佳。

引用次数: 30

Sentiment analysis of text-to-speech input using latent affective mapping 基于潜在情感映射的文本-语音输入情感分析

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163948

J. Bellegarda

To impart a congruent emotional quality to synthetic speech, it is expedient to leverage the overall polarity of the input text. This is feasible inasmuch as speech generation complies with the outcome of sentiment analysis. We have recently introduced latent affective mapping [1]–[3], a new approach to emotion detection which exploits two separate levels of semantic information: one that encapsulates the foundations of the domain considered, and one that specifically accounts for the overall affective fabric of the language. The ensuing framework exposes the emergent relationship between these two levels in order to advantageously inform affective evaluation. This paper applies latent affective mapping to the narrower problem of sentiment analysis, in order to achieve a more robust identification of the polarity of textual data. Empirical evidence gathered on the “Affective Text” portion of the SemEval-2007 corpus [4] shows that this approach is promising for automatic sentiment prediction in text. This bodes well as a first step in ensuring emotional congruence in text-to-speech synthesis.

为了给合成语音赋予一致的情感质量，利用输入文本的整体极性是有利的。这是可行的，因为语音生成符合情感分析的结果。我们最近介绍了潜在情感映射[1]-[3]，这是一种新的情感检测方法，它利用了两个不同层次的语义信息:一个封装了所考虑的领域的基础，另一个专门说明了语言的整体情感结构。随后的框架揭示了这两个层次之间的紧急关系，以便有利地为情感评估提供信息。本文将潜在情感映射应用于情感分析的狭义问题，以实现对文本数据极性的更稳健识别。在SemEval-2007语料库[4]的“情感文本”部分收集的经验证据表明，该方法有望实现文本中的自动情感预测。这预示着在文本到语音合成中确保情感一致性的第一步。

引用次数: 1

Topic modeling for spoken documents using only phonetic information 仅使用语音信息的口语文档主题建模

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163964

Timothy J. Hazen, M. Siu, H. Gish, S. Lowe, Arthur Chan

This paper explores both supervised and unsupervised topic modeling for spoken audio documents using only phonetic information. In cases where word-based recognition is unavailable or infeasible, phonetic information can be used to indirectly learn and capture information provided by topically relevant lexical items. In some situations, a lack of transcribed data can prevent supervised training of a same-language phonetic recognition system. In these cases, phonetic recognition can use cross-language models or self-organizing units (SOUs) learned in a completely unsupervised fashion. This paper presents recent improvements in topic modeling using only phonetic information. We present new results using recently developed techniques for discriminative training for topic identification used in conjunction with recent improvements in SOU learning. A preliminary examination of the use of unsupervised latent topic modeling for unsupervised discovery of topics and topically relevant lexical items from phonetic information is also presented.

本文探讨了仅使用语音信息的语音文档的有监督和无监督主题建模。在基于词的识别不可用或不可行的情况下，语音信息可以间接地学习和获取与主题相关的词汇项提供的信息。在某些情况下，缺乏转录数据可能会阻碍对同语言语音识别系统的监督训练。在这些情况下，语音识别可以使用跨语言模型或以完全无监督的方式学习的自组织单元(soe)。本文介绍了仅使用语音信息进行主题建模的最新进展。我们提出了新的结果，使用了最近开发的鉴别训练技术，用于主题识别，并结合了最近在SOU学习方面的改进。本文还提出了使用无监督潜在主题模型从语音信息中无监督地发现主题和话题相关词汇项目的初步研究。

引用次数: 19

Automatic detection of unnatural word-level segments in unit-selection speech synthesis 单位选择语音合成中非自然词段的自动检测

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163946

William Yang Wang, Kallirroi Georgila

We investigate the problem of automatically detecting unnatural word-level segments in unit selection speech synthesis. We use a large set of features, namely, target and join costs, language models, prosodic cues, energy and spectrum, and Delta Term Frequency Inverse Document Frequency (TF-IDF), and we report comparative results between different feature types and their combinations. We also compare three modeling methods based on Support Vector Machines (SVMs), Random Forests, and Conditional Random Fields (CRFs). We then discuss our results and present a comprehensive error analysis.

研究了单元选择语音合成中非自然词段的自动检测问题。我们使用了大量的特征，即目标和连接成本、语言模型、韵律线索、能量和频谱以及Delta项频率逆文档频率(TF-IDF)，并报告了不同特征类型及其组合之间的比较结果。我们还比较了基于支持向量机(svm)、随机森林和条件随机场(CRFs)的三种建模方法。然后我们讨论我们的结果，并提出一个全面的误差分析。

引用次数: 15

A convex hull approach to sparse representations for exemplar-based speech recognition 基于范例的语音识别稀疏表示的凸包方法

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163906

Tara N. Sainath, D. Nahamoo, D. Kanevsky, B. Ramabhadran, P. Shah

In this paper, we propose a novel exemplar based technique for classification problems where for every new test sample the classification model is re-estimated from a subset of relevant samples of the training data.We formulate the exemplar-based classification paradigm as a sparse representation (SR) problem, and explore the use of convex hull constraints to enforce both regularization and sparsity. Finally, we utilize the Extended Baum-Welch (EBW) optimization technique to solve the SR problem. We explore our proposed methodology on the TIMIT phonetic classification task, showing that our proposed method offers statistically significant improvements over common classification methods, and provides an accuracy of 82.9%, the best single-classifier number reported to date.

在本文中，我们提出了一种新的基于示例的分类问题技术，其中对于每个新的测试样本，分类模型从训练数据的相关样本子集中重新估计。我们将基于示例的分类范式表述为一个稀疏表示(SR)问题，并探索使用凸包约束来强制正则化和稀疏性。最后，我们利用扩展Baum-Welch (EBW)优化技术来解决SR问题。我们在TIMIT语音分类任务上探索了我们提出的方法，表明我们提出的方法在统计上比常见的分类方法有显著的改进，并且提供了82.9%的准确率，这是迄今为止报道的最好的单分类器数量。

引用次数: 12

A novel neural-based pronunciation modeling method for robust speech recognition 鲁棒语音识别中一种新的基于神经网络的语音建模方法

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163985

Guangpu Huang, M. Er

This paper describes a recurrent neural network (RNN) based articulatory-phonetic inversion (API) model for improved speech recognition. And a specialized optimization algorithm is introduced to enable human-like heuristic learning in an efficient data-driven manner to capture the dynamic nature of English speech pronunciations. The API model demonstrates superior pronunciation modeling ability and robustness against noise contaminations in large-vocabulary speech recognition experiments. Using a simple rescoring formula, it improves the hidden Markov model (HMM) baseline speech recognizer with consistent error rates reduction of 5.30% and 10.14% for phoneme recognition tasks on clean and noisy speech respectively on the selected TIMIT datasets. And an error rate reduction of 3.35% is obtained for the SCRIBE-TIMIT word recognition tasks. The proposed system qualifies as a competitive candidate for profound pronunciation modeling with intrinsic salient features such as generality and portability.

本文提出了一种基于递归神经网络(RNN)的发音-语音反转(API)模型，用于改进语音识别。并引入了一种专门的优化算法，以有效的数据驱动方式实现类似人类的启发式学习，以捕捉英语语音发音的动态特性。在大词汇量语音识别实验中，该API模型显示了良好的语音建模能力和抗噪声污染的鲁棒性。利用简单的评分公式，在选定的TIMIT数据集上对隐马尔可夫模型(HMM)基线语音识别器进行改进，在干净语音和有噪声语音的音素识别任务中错误率分别降低了5.30%和10.14%。在SCRIBE-TIMIT词识别任务中，错误率降低了3.35%。所提出的系统具有通用性和可移植性等内在显著特征，具有深度语音建模的竞争力。

引用次数: 3

Frame-level AnyBoost for LVCSR with the MMI Criterion 具有MMI标准的LVCSR的帧级AnyBoost

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163897

Ryuki Tachibana, Takashi Fukuda, U. Chaudhari, B. Ramabhadran, P. Zhan

This paper propose a variant of AnyBoost for a large vocabulary continuous speech recognition (LVCSR) task. AnyBoost is an efficient algorithm to train an ensemble of weak learners by gradient descent for an objective function.We present a novel training procedure that trains acoustic models via the MMI criterion using data that is weighted proportional to the summation of the posterior functions of previous round of weak learners. Optimized for system combination by n-best ROVER at runtime, data weights for a new weak learner are computed as a weighted summation of posteriors of previous weak learners. We compare a frame-based version and a sentence-based version of our proposed algorithm with a frame-based AdaBoost algorithm. We will present results on a voice search task trained with different amounts of data with gains of 5.1% to 7.5% relative in WER can be obtained by three rounds of boosting.

本文提出了AnyBoost的一种变体，用于大词汇量连续语音识别(LVCSR)任务。AnyBoost是一种利用梯度下降方法训练弱学习器集合的有效算法。我们提出了一种新的训练过程，通过MMI标准训练声学模型，使用与前一轮弱学习者的后验函数之和成比例的加权数据。在运行时通过n-best ROVER对系统组合进行优化，新的弱学习器的数据权重计算为先前弱学习器的后验加权和。我们将基于帧的算法和基于句子的算法与基于帧的AdaBoost算法进行比较。我们将展示用不同数据量训练的语音搜索任务的结果，通过三轮提升，相对于WER可以获得5.1%到7.5%的增益。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀