2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文中文

Joint training of interpolated exponential n-gram models 插值指数n图模型的联合训练

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707700

A. Sethy, Stanley F. Chen, E. Arisoy, B. Ramabhadran, Kartik Audhkhasi, Shrikanth S. Narayanan, Paul Vozila

For many speech recognition tasks, the best language model performance is achieved by collecting text from multiple sources or domains, and interpolating language models built separately on each individual corpus. When multiple corpora are available, it has also been shown that when using a domain adaptation technique such as feature augmentation [1], the performance on each individual domain can be improved by training a joint model across all of the corpora. In this paper, we explore whether improving each domain model via joint training also improves performance when interpolating the models together. We show that the diversity of the individual models is an important consideration, and propose a method for adjusting diversity to optimize overall performance. We present results using word n-gram models and Model M, a class-based n-gram model, and demonstrate improvements in both perplexity and word-error rate relative to state-of-the-art results on a Broadcast News transcription task.

对于许多语音识别任务，通过从多个来源或领域收集文本，并在每个单独的语料库上单独构建语言模型来实现最佳的语言模型性能。当多个语料库可用时，也有研究表明，当使用领域自适应技术(如特征增强[1])时，可以通过跨所有语料库训练联合模型来提高每个单独领域的性能。在本文中，我们探讨了通过联合训练来改进每个领域模型是否也能在模型一起插值时提高性能。我们证明了个体模型的多样性是一个重要的考虑因素，并提出了一种调整多样性以优化整体性能的方法。我们展示了使用单词n-gram模型和模型M(一个基于类的n-gram模型)的结果，并展示了相对于广播新闻转录任务的最新结果，在困惑和单词错误率方面的改进。

引用次数: 4

The TAO of ATWV: Probing the mysteries of keyword search performance ATWV的TAO:探索关键字搜索性能的奥秘

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707728

S. Wegmann, Arlo Faria, Adam L. Janin, K. Riedhammer, N. Morgan

In this paper we apply diagnostic analysis to gain a deeper understanding of the performance of the the keyword search system that we have developed for conversational telephone speech in the IARPA Babel program. We summarize the Babel task, its primary performance metric, “actual term weighted value” (ATWV), and our recognition and keyword search systems. Our analysis uses two new oracle ATWV measures, a bootstrap-based ATWV confidence interval, and includes a study of the underpinnings of the large ATWV gains due to system combination. This analysis quantifies the potential ATWV gains from improving the number of true hits and the overall quality of the detection scores in our system's posting lists. It also shows that system combination improves our systems' ATWV via a small increase in the number of true hits in the posting lists.

在本文中，我们应用诊断分析来更深入地了解我们在IARPA巴别塔计划中为会话电话语音开发的关键字搜索系统的性能。我们总结了Babel任务，它的主要性能指标，“实际术语加权值”(ATWV)，以及我们的识别和关键字搜索系统。我们的分析使用了两个新的oracle ATWV度量，一个基于引导的ATWV置信区间，并包括对由于系统组合而产生的大ATWV增益的基础的研究。该分析量化了通过提高真实命中次数和系统发布列表中检测分数的整体质量而获得的潜在ATWV收益。它还表明，系统组合通过在张贴列表中增加真实命中数来提高系统的ATWV。

引用次数: 47

Hierarchical neural networks and enhanced class posteriors for social signal classification 层次神经网络与增强类后验的社会信号分类

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707757

Raymond Brueckner, Björn Schuller

With the impressive advances of deep learning in recent years the interest in neural networks has resurged in the fields of automatic speech recognition and emotion recognition. In this paper we apply neural networks to address speaker-independent detection and classification of laughter and filler vocalizations in speech. We first explore modeling class posteriors with standard neural networks and deep stacked autoencoders. Then, we adopt a hierarchical neural architecture to compute enhanced class posteriors and demonstrate that this approach introduces significant and consistent improvements on the Social Signals Sub-Challenge of the Interspeech 2013 Computational Paralinguistics Challenge (ComParE). On this task we achieve a value of 92.4% of the unweighted average area-under-the-curve, which is the official competition measure, on the test set. This constitutes an improvement of 9.1% over the baseline and is the best result obtained so far on this task.

近年来，随着深度学习的显著进步，神经网络在自动语音识别和情感识别领域的兴趣重新燃起。在本文中，我们应用神经网络来解决与说话人无关的笑声和填充发声的检测和分类问题。我们首先探索用标准神经网络和深度堆叠自编码器建模类后验。然后，我们采用分层神经结构来计算增强的类后验，并证明该方法对Interspeech 2013计算副语言学挑战(ComParE)的社会信号子挑战(Social Signals Sub-Challenge)带来了显著且一致的改进。在这个任务中，我们在测试集中实现了未加权平均曲线下面积的92.4%，这是官方的竞争指标。这比基线提高了9.1%，是迄今为止在此任务中获得的最佳结果。

引用次数: 15

Deep maxout networks for low-resource speech recognition 低资源语音识别的深度最大输出网络

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707763

Yajie Miao, Florian Metze, Shourabh Rawat

As a feed-forward architecture, the recently proposed maxout networks integrate dropout naturally and show state-of-the-art results on various computer vision datasets. This paper investigates the application of deep maxout networks (DMNs) to large vocabulary continuous speech recognition (LVCSR) tasks. Our focus is on the particular advantage of DMNs under low-resource conditions with limited transcribed speech. We extend DMNs to hybrid and bottleneck feature systems, and explore optimal network structures (number of maxout layers, pooling strategy, etc) for both setups. On the newly released Babel corpus, behaviors of DMNs are extensively studied under different levels of data availability. Experiments show that DMNs improve low-resource speech recognition significantly. Moreover, DMNs introduce sparsity to their hidden activations and thus can act as sparse feature extractors.

作为一种前馈结构，最近提出的maxout网络自然地集成了dropout，并在各种计算机视觉数据集上显示出最先进的结果。研究了深度最大输出网络(DMNs)在大词汇量连续语音识别中的应用。我们的重点是DMNs在低资源条件下具有有限转录语音的特殊优势。我们将DMNs扩展到混合和瓶颈特征系统，并探索两种设置的最佳网络结构(maxout层数，池化策略等)。在新发布的Babel语料库上，广泛研究了DMNs在不同数据可用性水平下的行为。实验表明，DMNs对低资源语音识别有显著改善。此外，dmn为其隐藏激活引入了稀疏性，因此可以作为稀疏特征提取器。

引用次数: 99

Discriminative piecewise linear transformation based on deep learning for noise robust automatic speech recognition 基于深度学习的判别分段线性变换噪声鲁棒自动语音识别

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707755

Yosuke Kashiwagi, D. Saito, N. Minematsu, K. Hirose

In this paper, we propose the use of deep neural networks to expand conventional methods of statistical feature enhancement based on piecewise linear transformation. Stereo-based piecewise linear compensation for environments (SPLICE), which is a powerful statistical approach for feature enhancement, models the probabilistic distribution of input noisy features as a mixture of Gaussians. However, soft assignment of an input vector to divided regions is sometimes done inadequately and the vector comes to go through inadequate conversion. Especially when conversion has to be linear, the conversion performance will be easily degraded. Feature enhancement using neural networks is another powerful approach which can directly model a non-linear relationship between noisy and clean feature spaces. In this case, however, it tends to suffer from over-fitting problems. In this paper, we attempt to mitigate this problem by reducing the number of model parameters to estimate. Our neural network is trained whose output layer is associated with the states in the clean feature space, not in the noisy feature space. This strategy makes the size of the output layer independent of the kind of a given noisy environment. Firstly, we characterize the distribution of clean features as a Gaussian mixture model and then, by using deep neural networks, estimate discriminatively the state in the clean space that an input noisy feature corresponds to. Experimental evaluations using the Aurora 2 dataset demonstrate that our proposed method has the best performance compared to conventional methods.

在本文中，我们提出使用深度神经网络来扩展传统的基于分段线性变换的统计特征增强方法。基于立体的环境分段线性补偿(SPLICE)是一种功能强大的特征增强统计方法，它将输入噪声特征的概率分布建模为高斯分布的混合。然而，输入向量对分割区域的软赋值有时做得不充分，向量会经过不充分的转换。特别是当转换必须是线性时，转换性能很容易下降。利用神经网络的特征增强是另一种强大的方法，它可以直接模拟噪声和干净特征空间之间的非线性关系。然而，在这种情况下，它往往会出现过拟合问题。在本文中，我们试图通过减少模型参数估计的数量来缓解这个问题。我们训练的神经网络的输出层与干净特征空间中的状态相关联，而不是在有噪声的特征空间中。这种策略使得输出层的大小与给定噪声环境的类型无关。首先，我们将干净特征的分布描述为高斯混合模型，然后通过深度神经网络判别估计输入噪声特征对应的干净空间状态。使用极光2号数据集的实验评估表明，与传统方法相比，我们提出的方法具有最佳性能。

{"title":"Discriminative piecewise linear transformation based on deep learning for noise robust automatic speech recognition","authors":"Yosuke Kashiwagi, D. Saito, N. Minematsu, K. Hirose","doi":"10.1109/ASRU.2013.6707755","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707755","url":null,"abstract":"In this paper, we propose the use of deep neural networks to expand conventional methods of statistical feature enhancement based on piecewise linear transformation. Stereo-based piecewise linear compensation for environments (SPLICE), which is a powerful statistical approach for feature enhancement, models the probabilistic distribution of input noisy features as a mixture of Gaussians. However, soft assignment of an input vector to divided regions is sometimes done inadequately and the vector comes to go through inadequate conversion. Especially when conversion has to be linear, the conversion performance will be easily degraded. Feature enhancement using neural networks is another powerful approach which can directly model a non-linear relationship between noisy and clean feature spaces. In this case, however, it tends to suffer from over-fitting problems. In this paper, we attempt to mitigate this problem by reducing the number of model parameters to estimate. Our neural network is trained whose output layer is associated with the states in the clean feature space, not in the noisy feature space. This strategy makes the size of the output layer independent of the kind of a given noisy environment. Firstly, we characterize the distribution of clean features as a Gaussian mixture model and then, by using deep neural networks, estimate discriminatively the state in the clean space that an input noisy feature corresponds to. Experimental evaluations using the Aurora 2 dataset demonstrate that our proposed method has the best performance compared to conventional methods.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115673587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A hierarchical system for word discovery exploiting DTW-based initialization 利用基于dwt的初始化进行单词发现的分层系统

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707761

Oliver Walter, Timo Korthals, Reinhold Häb-Umbach, B. Raj

Discovering the linguistic structure of a language solely from spoken input asks for two steps: phonetic and lexical discovery. The first is concerned with identifying the categorical subword unit inventory and relating it to the underlying acoustics, while the second aims at discovering words as repeated patterns of subword units. The hierarchical approach presented here accounts for classification errors in the first stage by modelling the pronunciation of a word in terms of subword units probabilistically: a hidden Markov model with discrete emission probabilities, emitting the observed subword unit sequences. We describe how the system can be learned in a completely unsupervised fashion from spoken input. To improve the initialization of the training of the word pronunciations, the output of a dynamic time warping based acoustic pattern discovery system is used, as it is able to discover similar temporal sequences in the input data. This improved initialization, using only weak supervision, has led to a 40% reduction in word error rate on a digit recognition task.

仅从语音输入中发现语言结构需要两个步骤:语音发现和词汇发现。第一种方法是识别分类子词单元清单并将其与潜在声学联系起来，而第二种方法旨在发现作为子词单元重复模式的单词。本文提出的分层方法通过根据子词单元概率建模单词的发音来解释第一阶段的分类错误:一个具有离散发射概率的隐马尔可夫模型，发射观察到的子词单元序列。我们描述了系统如何以完全无监督的方式从语音输入中学习。为了改善单词发音训练的初始化，使用了基于动态时间翘曲的声学模式发现系统的输出，因为它能够在输入数据中发现相似的时间序列。这种改进的初始化，只使用弱监督，导致数字识别任务中的单词错误率降低了40%。

引用次数: 38

Neighbour selection and adaptation for rapid speaker-dependent ASR 快速说话人依赖ASR的邻居选择与适应

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707706

Udhyakumar Nallasamy, Mark C. Fuhs, M. Woszczyna, Florian Metze, Tanja Schultz

Speaker dependent (SD) ASR systems have significantly lower word error rates (WER) compared to speaker independent (SI) systems. However, SD systems require sufficient training data from the target speaker, which is impractical to collect in a short time. We present a technique for training SD models using just few minutes of speaker's data. We compensate for the lack of adequate speaker-specific data by selecting neighbours from a database of existing speakers who are acoustically close to the target speaker. These neighbours provide ample training data, which is used to adapt the SI model to obtain an initial SD model for the new speaker with significantly lower WER. We evaluate various neighbour selection algorithms on a large-scale medical transcription task and report significant reduction in WER using only 5 mins of speaker-specific data. We conduct a detailed analysis of various factors such as gender and accent in the neighbour selection. Finally, we study neighbour selection and adaptation in the context of discriminative objective functions.

与独立说话人(SI)系统相比，依赖说话人(SD)的ASR系统具有明显较低的单词错误率(WER)。然而，SD系统需要从目标说话者那里获得足够的训练数据，这在短时间内收集是不切实际的。我们提出了一种只用几分钟说话人数据训练SD模型的技术。我们通过从声学上接近目标说话者的现有说话者数据库中选择邻居来弥补缺乏足够的说话者特定数据。这些邻居提供了充足的训练数据，用于调整SI模型，以获得具有显著较低WER的新扬声器的初始SD模型。我们在大规模医学转录任务中评估了各种邻居选择算法，并报告了仅使用5分钟特定讲话者数据就显著降低了WER。我们对邻居选择中的性别、口音等因素进行了详细的分析。最后，我们研究了在判别目标函数背景下的邻居选择和适应。

引用次数: 3

Phonetic and anthropometric conditioning of MSA-KST cognitive impairment characterization system MSA-KST认知障碍表征系统的语音和人体测量调节

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707734

A. Ivanov, S. Jalalvand, R. Gretter, D. Falavigna

We explore the impact of speech- and speaker-specific modeling onto the Modulation Spectrum Analysis - Kolmogorov-Smirnov feature Testing (MSA-KST) characterization method in the task of automated prediction of the cognitive impairment diagnosis, namely dysphasia and pervasive development disorder. Phoneme-synchronous capturing of speech dynamics is a reasonable choice for a segmental speech characterization system as it allows comparing speech dynamics in the similar phonetic contexts. Speaker-specific modeling aims at reducing the “within-the-class” variability of the characterized speech or speaker population by removing the effect of speaker properties that should have no relation to the characterization. Specifically the vocal tract length of a speaker has nothing to do with the diagnosis attribution and, thus, the feature set shall be normalized accordingly. The resulting system compares favorably to the baseline system of the Interspeech'2013 Computational Paralinguistics Challenge.

我们探讨了语音和说话人特定建模对调制频谱分析- Kolmogorov-Smirnov特征测试(MSA-KST)表征方法在自动预测认知障碍诊断任务中的影响，即语言障碍和广泛性发育障碍。语音动态的音素同步捕获是分段语音表征系统的合理选择，因为它允许在相似的语音上下文中比较语音动态。特定于说话人的建模旨在通过消除与特征无关的说话人属性的影响，减少特征语音或说话人群体的“类内”可变性。具体来说，说话人的声道长度与诊断归因无关，因此需要对特征集进行归一化处理。由此产生的系统与Interspeech 2013年计算副语言学挑战赛的基准系统相比具有优势。

引用次数: 3

Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks 基于深度信念网络的隐马尔可夫模型的自发语音情感识别

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707732

Duc Le, E. Provost

Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition [1]. Our work provides insights into important similarities and differences between speech and emotion.

情绪识别的研究旨在深入了解情绪的时间特性。然而，由于不理想的记录条件和高度模糊的基础真值标签，从自发语音中自动识别情感是具有挑战性的。此外，情绪识别系统通常处理嘈杂的高维数据，因此很难找到具有代表性的特征并训练有效的分类器。我们通过使用深度信念网络来解决这个问题，它可以在低级特征之间建立复杂和非线性的高级关系。我们提出并评估了一套基于隐马尔可夫模型和深度信念网络的混合分类器。我们在FAU Aibo上取得了最先进的结果，FAU Aibo是情感识别领域的一个基准数据集。我们的工作为语言和情感之间重要的异同提供了洞见。

引用次数: 102

An SVD-based scheme for MFCC compression in distributed speech recognition system 分布式语音识别系统中基于奇异值分解的MFCC压缩方案

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707738

A. Touazi, M. Debyeche

This paper proposes a new scheme for low bit-rate source coding of Mel Frequency Cepstral Coefficients (MFCCs) in Distributed Speech Recognition (DSR) system. The method uses the compressed ETSI Advanced Front-End (ETSI-AFE) features factorized into SVD components. By investigating the correlation property between successive MFCC frames, the odd ones are encoded using ETSI-AFE, while only the singular values and the nearest left singular vectors index are encoded and transmitted for the even frames. At the server side, the non-transmitted MFCCs are evaluated through their quantized singular values and the nearest left singular vectors. The system provides a compression bit-rate of 2.7 kbps. The recognition experiments were carried out on the Aurora-2 database for clean and multi-condition training modes. The simulation results show good recognition performance without significant degradation, with respect to the ETSI-AFE encoder.

提出了一种分布式语音识别(DSR)系统中低频倒谱系数(MFCCs)的低比特率源编码方案。该方法将压缩后的ETSI高级前端(ETSI- afe)特征分解为SVD分量。通过研究连续MFCC帧之间的相关性，采用ETSI-AFE对奇数帧进行编码，而对偶数帧只编码并传输奇异值和最接近的左奇异向量索引。在服务器端，通过量化奇异值和最接近的左奇异向量来评估非传输mfc。系统提供2.7 kbps的压缩比特率。在Aurora-2数据库上进行了清洁和多条件训练模式的识别实验。仿真结果表明，相对于ETSI-AFE编码器，该算法具有良好的识别性能，且没有明显的退化。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀