首页 > 最新文献

2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文 中文
Acoustic characteristics related to the perceptual pitch in whispered vowels 与低声元音的感知音高有关的声学特性
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707737
H. Konno, Hideo Kanemitsu, N. Takahashi, Mineichi Kudo
The characteristics of whispered speech are not well known. The most remarkable difference from ordinal speech is the pitch (the height of speech), since whispered speech has no fundamental frequency. In this study, we have tried to reveal the mechanism of producing pitch in whispered speech through an experiment in which a male and a female subjects uttered Japanese whispered vowels in a way so as to tune their pitch to the guidance tone with different five to nine frequencies. We applied multivariate analysis such as the principal component analysis to the data in order to make clear which part of frequency contributes much to the change of pitch. We have succeeded in endorsing the previous observations, i.e. shift of formants is dominant, with more detailed numerical evidence. In addition, we obtained some implications to approach the pitch mechanism of whispered speech. The main result obtained is that two or three formants of less than 5 kHz are shifted upward and the energy is increased in high frequency region over 5 kHz.
低声说话的特点并不为人所知。与有序语音最显著的区别是音高(语音的高度),因为低语没有基本频率。在本研究中,我们试图通过实验揭示耳语中产生音高的机制,在实验中,一名男性和一名女性受试者以不同的方式发出日语耳语元音,从而将其音高调整到不同的5到9个频率的引导音。我们对数据进行了多元分析,如主成分分析,以明确频率的哪一部分对音高的变化贡献较大。我们已经成功地认可了以前的观察结果,即共振峰的移位是主要的,有更详细的数字证据。此外,本文还对探讨耳语语音的音高机制提供了一些启示。得到的主要结果是两个或三个小于5khz的共振峰向上移动,并且在5khz以上的高频区域能量增加。
{"title":"Acoustic characteristics related to the perceptual pitch in whispered vowels","authors":"H. Konno, Hideo Kanemitsu, N. Takahashi, Mineichi Kudo","doi":"10.1109/ASRU.2013.6707737","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707737","url":null,"abstract":"The characteristics of whispered speech are not well known. The most remarkable difference from ordinal speech is the pitch (the height of speech), since whispered speech has no fundamental frequency. In this study, we have tried to reveal the mechanism of producing pitch in whispered speech through an experiment in which a male and a female subjects uttered Japanese whispered vowels in a way so as to tune their pitch to the guidance tone with different five to nine frequencies. We applied multivariate analysis such as the principal component analysis to the data in order to make clear which part of frequency contributes much to the change of pitch. We have succeeded in endorsing the previous observations, i.e. shift of formants is dominant, with more detailed numerical evidence. In addition, we obtained some implications to approach the pitch mechanism of whispered speech. The main result obtained is that two or three formants of less than 5 kHz are shifted upward and the energy is increased in high frequency region over 5 kHz.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121139059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Learning state labels for sparse classification of speech with matrix deconvolution 基于矩阵反卷积的语音稀疏分类状态标签学习
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707724
Antti Hurmalainen, T. Virtanen
Non-negative spectral factorisation with long temporal context has been successfully used for noise robust recognition of speech in multi-source environments. Sparse classification from activations of speech atoms can be employed instead of conventional GMMs to determine speech state likelihoods. For accurate classification, correct linguistic state labels must be assigned to speech atoms. We propose using non-negative matrix deconvolution for learning the labels with algorithms closely matching a framework that separates speech from additive noises. Experiments on the 1st CHiME Challenge corpus show improvement in recognition accuracy over labels acquired from original atom sources or previously used least squares regression. The new approach also circumvents numerical issues encountered in previous learning methods, and opens up possibilities for new speech basis generation algorithms.
长时间背景下的非负谱分解已成功用于多源环境下的语音噪声鲁棒识别。语音原子激活的稀疏分类可以代替传统的gmm来确定语音状态的可能性。为了准确分类,必须给语音原子分配正确的语言状态标签。我们建议使用非负矩阵反卷积来学习标签,算法与将语音与加性噪声分离的框架密切匹配。在第一个CHiME Challenge语料库上的实验表明,与从原始原子源或先前使用的最小二乘回归获得的标签相比,识别精度有所提高。新方法也避免了以前的学习方法中遇到的数值问题,并为新的语音基生成算法开辟了可能性。
{"title":"Learning state labels for sparse classification of speech with matrix deconvolution","authors":"Antti Hurmalainen, T. Virtanen","doi":"10.1109/ASRU.2013.6707724","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707724","url":null,"abstract":"Non-negative spectral factorisation with long temporal context has been successfully used for noise robust recognition of speech in multi-source environments. Sparse classification from activations of speech atoms can be employed instead of conventional GMMs to determine speech state likelihoods. For accurate classification, correct linguistic state labels must be assigned to speech atoms. We propose using non-negative matrix deconvolution for learning the labels with algorithms closely matching a framework that separates speech from additive noises. Experiments on the 1st CHiME Challenge corpus show improvement in recognition accuracy over labels acquired from original atom sources or previously used least squares regression. The new approach also circumvents numerical issues encountered in previous learning methods, and opens up possibilities for new speech basis generation algorithms.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129252356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Improved cepstral mean and variance normalization using Bayesian framework 改进的贝叶斯框架倒谱均值和方差归一化
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707722
N. Prasad, S. Umesh
Cepstral Mean and Variance Normalization (CMVN) is a computationally efficient normalization technique for noise robust speech recognition. The performance of CMVN is known to degrade for short utterances, due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. In this work, we propose to use posterior estimates of mean and variance in CMVN, instead of the maximum likelihood estimates. This Bayesian approach, in addition to providing a robust estimate of parameters, is also shown to preserve discriminable information without increase in computational cost, making it particularly relevant for Interactive Voice Response (IVR)-based applications. The relative WER reduction of this approach w.r.t. Cepstral Mean Normalization, CMVN and Histogram Equalization are (i) 40.1%, 27% and 4.3% with the Aurora2 database for all utterances, (ii) 25.7%, 38.6% and 30.4% with the Aurora2 database for short utterances, and (iii) 18.7%, 12.6% and 2.5% with the Aurora4 database.
倒谱均值方差归一化(CMVN)是一种计算效率高的噪声鲁棒语音识别归一化技术。众所周知,CMVN的性能对于短话语会下降,这是由于用于参数估计的数据不足,以及由于所有话语都被迫具有零均值和单位方差而导致的可判别信息的丢失。在这项工作中,我们建议在CMVN中使用均值和方差的后验估计,而不是最大似然估计。这种贝叶斯方法除了提供参数的鲁棒估计外,还显示出在不增加计算成本的情况下保留可区分信息,使其特别适用于基于交互式语音应答(IVR)的应用程序。在Aurora2数据库中,该方法对所有话语的相对加权加权降低率分别为(i) 40.1%、27%和4.3%;在Aurora2数据库中,对短话语的相对加权加权降低率分别为(ii) 25.7%、38.6%和30.4%;在Aurora4数据库中,相对加权加权降低率分别为18.7%、12.6%和2.5%。
{"title":"Improved cepstral mean and variance normalization using Bayesian framework","authors":"N. Prasad, S. Umesh","doi":"10.1109/ASRU.2013.6707722","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707722","url":null,"abstract":"Cepstral Mean and Variance Normalization (CMVN) is a computationally efficient normalization technique for noise robust speech recognition. The performance of CMVN is known to degrade for short utterances, due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. In this work, we propose to use posterior estimates of mean and variance in CMVN, instead of the maximum likelihood estimates. This Bayesian approach, in addition to providing a robust estimate of parameters, is also shown to preserve discriminable information without increase in computational cost, making it particularly relevant for Interactive Voice Response (IVR)-based applications. The relative WER reduction of this approach w.r.t. Cepstral Mean Normalization, CMVN and Histogram Equalization are (i) 40.1%, 27% and 4.3% with the Aurora2 database for all utterances, (ii) 25.7%, 38.6% and 30.4% with the Aurora2 database for short utterances, and (iii) 18.7%, 12.6% and 2.5% with the Aurora4 database.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121416370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
ASR for electro-laryngeal speech ASR是指电喉语音
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707735
A. Fuchs, J. A. Morales-Cordovilla, Martin Hagmüller
The electro-larynx device (EL) offers the possibility to re-obtain speech when the larynx is removed after a total laryngectomy. Speech produced with an EL suffers from inadequate speech sound quality, therefore there is a strong need to enhance EL speech. When disordered speech is applied to Automatic Speech Recognition (ASR) systems, the performance will significantly decrease. ASR systems are increasingly part of daily life and therefore, the word accuracy rate of disordered speech should be reasonably high in order to be able to make ASR technologies accessible for patients suffering from speech disorders. Moreover, ASR is a method to get an objective rating for the intelligibility of disordered speech. In this paper we apply disordered speech, namely speech produced by an EL, on an ASR system which was designed for normal, healthy speech and evaluate its performance with different types of adaptation. Furthermore, we show that two approaches to reduce the directly radiated EL (DREL) noise from the device itself are able to increase the word accuracy rate compared to the unprocessed EL speech.
电喉装置(EL)提供了在全喉切除术后喉部切除后重新获得语言的可能性。用慢速语音产生的语音存在语音音质不足的问题,因此迫切需要提高慢速语音。当无序语音应用于自动语音识别(ASR)系统时,其性能会显著下降。ASR系统越来越多地成为日常生活的一部分,因此,语音障碍的单词准确率应该相当高,以便能够使语音障碍患者使用ASR技术。此外,ASR是一种对语音障碍的可理解性进行客观评价的方法。在本文中,我们将无序语音,即由EL产生的语音,应用于为正常、健康语音设计的ASR系统,并通过不同类型的自适应来评估其性能。此外,我们表明,与未处理的EL语音相比,两种降低设备本身直接辐射EL (DREL)噪声的方法能够提高单词准确率。
{"title":"ASR for electro-laryngeal speech","authors":"A. Fuchs, J. A. Morales-Cordovilla, Martin Hagmüller","doi":"10.1109/ASRU.2013.6707735","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707735","url":null,"abstract":"The electro-larynx device (EL) offers the possibility to re-obtain speech when the larynx is removed after a total laryngectomy. Speech produced with an EL suffers from inadequate speech sound quality, therefore there is a strong need to enhance EL speech. When disordered speech is applied to Automatic Speech Recognition (ASR) systems, the performance will significantly decrease. ASR systems are increasingly part of daily life and therefore, the word accuracy rate of disordered speech should be reasonably high in order to be able to make ASR technologies accessible for patients suffering from speech disorders. Moreover, ASR is a method to get an objective rating for the intelligibility of disordered speech. In this paper we apply disordered speech, namely speech produced by an EL, on an ASR system which was designed for normal, healthy speech and evaluate its performance with different types of adaptation. Furthermore, we show that two approaches to reduce the directly radiated EL (DREL) noise from the device itself are able to increase the word accuracy rate compared to the unprocessed EL speech.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127877663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Automatic model complexity control for generalized variable parameter HMMs 广义变参数hmm模型复杂度自动控制
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707721
Rongfeng Su, Xunying Liu, Lan Wang
An important task for speech recognition systems is to handle the mismatch against a target environment introduced by acoustic factors such as variable ambient noise. To address this issue, it is possible to explicitly approximate the continuous trajectory of optimal, well matched model parameters against the varying noise using, for example, using generalized variable parameter HMMs (GVP-HMM). In order to improve the generalization and computational efficiency of conventional GVP-HMMs, this paper investigates a novel model complexity control method for GVP-HMMs. The optimal polynomial degrees of Gaussian mean, variance and model space linear transform trajectories are automatically determined at local level. Significant error rate reductions of 20% and 28% relative were obtained over the multi-style training baseline systems on Aurora 2 and a medium vocabulary Mandarin Chinese speech recognition task respectively. Consistent performance improvements and model size compression of 57% relative were also obtained over the baseline GVP-HMM systems using a uniformly assigned polynomial degree.
语音识别系统的一个重要任务是处理由可变环境噪声等声学因素引入的与目标环境的不匹配。为了解决这个问题,可以使用广义变参数hmm (GVP-HMM)明确地近似最优匹配模型参数的连续轨迹,以对抗变化的噪声。为了提高常规gvp - hmm的泛化和计算效率,研究了一种新的gvp - hmm模型复杂度控制方法。在局部自动确定高斯均值、方差和模型空间线性变换轨迹的最优多项式度。在Aurora 2和中等词汇量的汉语普通话语音识别任务上,多风格训练基线系统的错误率分别显著降低了20%和28%。使用统一分配的多项式度,在基线GVP-HMM系统上也获得了一致的性能改进和相对57%的模型尺寸压缩。
{"title":"Automatic model complexity control for generalized variable parameter HMMs","authors":"Rongfeng Su, Xunying Liu, Lan Wang","doi":"10.1109/ASRU.2013.6707721","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707721","url":null,"abstract":"An important task for speech recognition systems is to handle the mismatch against a target environment introduced by acoustic factors such as variable ambient noise. To address this issue, it is possible to explicitly approximate the continuous trajectory of optimal, well matched model parameters against the varying noise using, for example, using generalized variable parameter HMMs (GVP-HMM). In order to improve the generalization and computational efficiency of conventional GVP-HMMs, this paper investigates a novel model complexity control method for GVP-HMMs. The optimal polynomial degrees of Gaussian mean, variance and model space linear transform trajectories are automatically determined at local level. Significant error rate reductions of 20% and 28% relative were obtained over the multi-style training baseline systems on Aurora 2 and a medium vocabulary Mandarin Chinese speech recognition task respectively. Consistent performance improvements and model size compression of 57% relative were also obtained over the baseline GVP-HMM systems using a uniformly assigned polynomial degree.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117024410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Dialogue management for leading the conversation in persuasive dialogue systems 在说服性对话系统中引导对话的对话管理
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707715
Takuya Hiraoka, Yuki Yamauchi, Graham Neubig, S. Sakti, T. Toda, Satoshi Nakamura
In this research, we propose a probabilistic dialogue modeling method for persuasive dialogue systems that interact with the user based on a specific goal, and lead the user to take actions that the system intends from candidate actions satisfying the user's needs. As a baseline system, we develop a dialogue model assuming the user makes decisions based on preference. Then we improve the model by introducing methods to guide the user from topic to topic. We evaluate the system knowledge and dialogue manager in a task that tests the system's persuasive power, and find that the proposed method is effective in this respect.
在本研究中,我们提出了一种基于特定目标与用户交互的说服性对话系统的概率对话建模方法,并引导用户从满足用户需求的候选行动中采取系统打算采取的行动。作为基线系统,我们开发了一个对话模型,假设用户根据偏好做出决策。然后,我们通过引入引导用户从一个主题到另一个主题的方法来改进模型。在一个测试系统说服力的任务中,我们对系统知识和对话管理器进行了评估,发现所提出的方法在这方面是有效的。
{"title":"Dialogue management for leading the conversation in persuasive dialogue systems","authors":"Takuya Hiraoka, Yuki Yamauchi, Graham Neubig, S. Sakti, T. Toda, Satoshi Nakamura","doi":"10.1109/ASRU.2013.6707715","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707715","url":null,"abstract":"In this research, we propose a probabilistic dialogue modeling method for persuasive dialogue systems that interact with the user based on a specific goal, and lead the user to take actions that the system intends from candidate actions satisfying the user's needs. As a baseline system, we develop a dialogue model assuming the user makes decisions based on preference. Then we improve the model by introducing methods to guide the user from topic to topic. We evaluate the system knowledge and dialogue manager in a task that tests the system's persuasive power, and find that the proposed method is effective in this respect.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123582167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Speaker adaptation of neural network acoustic models using i-vectors 基于i向量的说话人神经网络声学模型自适应
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707705
G. Saon, H. Soltau, D. Nahamoo, M. Picheny
We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.
我们提出通过提供说话人身份向量(i-vectors)作为网络的输入特征,与ASR的常规声学特征并行,使深度神经网络(DNN)声学模型适应目标说话人。对于训练和测试,给定说话者的i向量连接到属于该说话者的每个帧,并在不同的说话者之间变化。在总机300小时语料库上的实验结果表明,基于说话人无关特征和i向量训练的深度神经网络在单词错误率(WER)方面比仅基于说话人无关特征训练的网络相对提高了10%。这些网络在性能上与基于扬声器自适应特征(带有VTLN和FMLLR)训练的dnn相当,其优点是只需要一次解码。此外,与仅使用说话人自适应特征训练的网络相比,使用说话人自适应特征和i向量训练的网络在经过无hessian序列训练后的WER相对提高了5-6%。
{"title":"Speaker adaptation of neural network acoustic models using i-vectors","authors":"G. Saon, H. Soltau, D. Nahamoo, M. Picheny","doi":"10.1109/ASRU.2013.6707705","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707705","url":null,"abstract":"We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121036310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 650
Deep maxout neural networks for speech recognition 用于语音识别的深度最大输出神经网络
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707745
Meng Cai, Yongzhe Shi, Jia Liu
A recently introduced type of neural network called maxout has worked well in many domains. In this paper, we propose to apply maxout for acoustic models in speech recognition. The maxout neuron picks the maximum value within a group of linear pieces as its activation. This nonlinearity is a generalization to the rectified nonlinearity and has the ability to approximate any form of activation functions. We apply maxout networks to the Switchboard phone-call transcription task and evaluate the performances under both a 24-hour low-resource condition and a 300-hour core condition. Experimental results demonstrate that maxout networks converge faster, generalize better and are easier to optimize than rectified linear networks and sigmoid networks. Furthermore, experiments show that maxout networks reduce underfitting and are able to achieve good results without dropout training. Under both conditions, maxout networks yield relative improvements of 1.1-5.1% over rectified linear networks and 2.6-14.5% over sigmoid networks on benchmark test sets.
最近引入的一种称为maxout的神经网络在许多领域都表现良好。在本文中,我们提出将最大输出应用于语音识别中的声学模型。maxout神经元在一组线性片段中选择最大值作为其激活。这种非线性是对整流非线性的一种推广,具有近似任何形式的激活函数的能力。我们将maxout网络应用于总机电话转录任务,并评估了24小时低资源条件和300小时核心条件下的性能。实验结果表明,与整流线性网络和s型网络相比,maxout网络收敛速度快,泛化能力强,易于优化。此外,实验表明,maxout网络减少了欠拟合,并且能够在不放弃训练的情况下获得良好的结果。在这两种情况下,maxout网络在基准测试集上比整流线性网络相对提高1.1-5.1%,比sigmoid网络相对提高2.6-14.5%。
{"title":"Deep maxout neural networks for speech recognition","authors":"Meng Cai, Yongzhe Shi, Jia Liu","doi":"10.1109/ASRU.2013.6707745","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707745","url":null,"abstract":"A recently introduced type of neural network called maxout has worked well in many domains. In this paper, we propose to apply maxout for acoustic models in speech recognition. The maxout neuron picks the maximum value within a group of linear pieces as its activation. This nonlinearity is a generalization to the rectified nonlinearity and has the ability to approximate any form of activation functions. We apply maxout networks to the Switchboard phone-call transcription task and evaluate the performances under both a 24-hour low-resource condition and a 300-hour core condition. Experimental results demonstrate that maxout networks converge faster, generalize better and are easier to optimize than rectified linear networks and sigmoid networks. Furthermore, experiments show that maxout networks reduce underfitting and are able to achieve good results without dropout training. Under both conditions, maxout networks yield relative improvements of 1.1-5.1% over rectified linear networks and 2.6-14.5% over sigmoid networks on benchmark test sets.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127010455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 77
Automatic pronunciation clustering using a World English archive and pronunciation structure analysis 使用世界英语档案和发音结构分析的自动发音聚类
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707733
Han-Ping Shen, N. Minematsu, T. Makino, S. Weinberger, T. Pongkittiphan, Chung-Hsien Wu
English is the only language available for global communication. Due to the influence of speakers' mother tongue, however, those from different regions inevitably have different accents in their pronunciation of English. The ultimate goal of our project is creating a global pronunciation map of World Englishes on an individual basis, for speakers to use to locate similar English pronunciations. If the speaker is a learner, he can also know how his pronunciation compares to other varieties. Creating the map mathematically requires a matrix of pronunciation distances among all the speakers considered. This paper investigates invariant pronunciation structure analysis and Support Vector Regression (SVR) to predict the inter-speaker pronunciation distances. In experiments, the Speech Accent Archive (SAA), which contains speech data of worldwide accented English, is used as training and testing samples. IPA narrow transcriptions in the archive are used to prepare reference pronunciation distances, which are then predicted based on structural analysis and SVR, not with IPA transcriptions. Correlation between the reference distances and the predicted distances is calculated. Experimental results show very promising results and our proposed method outperforms by far a baseline system developed using an HMM-based phoneme recognizer.
英语是全球唯一可用的交流语言。然而,由于母语的影响,来自不同地区的人在英语发音中不可避免地会有不同的口音。我们项目的最终目标是在个人基础上创建一个世界英语的全球发音地图,供说话者使用来定位相似的英语发音。如果说话者是一个学习者,他还可以知道他的发音与其他变体的发音相比如何。在数学上创建地图需要一个所有被考虑的说话者之间发音距离的矩阵。本文研究了不变发音结构分析和支持向量回归(SVR)来预测说话人之间的发音距离。在实验中,使用包含全球口音英语语音数据的语音口音档案(SAA)作为训练和测试样本。使用档案中的国际音标窄转录来准备参考发音距离,然后根据结构分析和SVR预测,而不是使用国际音标转录。计算参考距离和预测距离之间的相关性。实验结果显示了非常有希望的结果,并且我们提出的方法迄今为止优于使用基于hmm的音素识别器开发的基线系统。
{"title":"Automatic pronunciation clustering using a World English archive and pronunciation structure analysis","authors":"Han-Ping Shen, N. Minematsu, T. Makino, S. Weinberger, T. Pongkittiphan, Chung-Hsien Wu","doi":"10.1109/ASRU.2013.6707733","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707733","url":null,"abstract":"English is the only language available for global communication. Due to the influence of speakers' mother tongue, however, those from different regions inevitably have different accents in their pronunciation of English. The ultimate goal of our project is creating a global pronunciation map of World Englishes on an individual basis, for speakers to use to locate similar English pronunciations. If the speaker is a learner, he can also know how his pronunciation compares to other varieties. Creating the map mathematically requires a matrix of pronunciation distances among all the speakers considered. This paper investigates invariant pronunciation structure analysis and Support Vector Regression (SVR) to predict the inter-speaker pronunciation distances. In experiments, the Speech Accent Archive (SAA), which contains speech data of worldwide accented English, is used as training and testing samples. IPA narrow transcriptions in the archive are used to prepare reference pronunciation distances, which are then predicted based on structural analysis and SVR, not with IPA transcriptions. Correlation between the reference distances and the predicted distances is calculated. Experimental results show very promising results and our proposed method outperforms by far a baseline system developed using an HMM-based phoneme recognizer.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116496012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Hybrid acoustic models for distant and multichannel large vocabulary speech recognition 远距离和多通道大词汇语音识别的混合声学模型
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707744
P. Swietojanski, Arnab Ghoshal, S. Renals
We investigate the application of deep neural network (DNN)-hidden Markov model (HMM) hybrid acoustic models for far-field speech recognition of meetings recorded using microphone arrays. We show that the hybrid models achieve significantly better accuracy than conventional systems based on Gaussian mixture models (GMMs). We observe up to 8% absolute word error rate (WER) reduction from a discriminatively trained GMM baseline when using a single distant microphone, and between 4-6% absolute WER reduction when using beamforming on various combinations of array channels. By training the networks on audio from multiple channels, we find the networks can recover significant part of accuracy difference between the single distant microphone and beamformed configurations. Finally, we show that the accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones.
研究了深度神经网络(DNN)-隐马尔可夫模型(HMM)混合声学模型在麦克风阵列录制会议远场语音识别中的应用。结果表明,混合模型比基于高斯混合模型(GMMs)的传统系统具有更好的精度。我们观察到,当使用单个远距离麦克风时,从鉴别训练的GMM基线来看,绝对字错误率(WER)降低了8%,当在各种阵列信道组合上使用波束形成时,绝对字错误率(WER)降低了4-6%。通过对多声道音频进行训练,我们发现网络可以恢复单远端麦克风和波束形成配置之间的很大一部分精度差异。最后,我们表明,通过使用来自其他麦克风的数据进行训练,网络识别来自单个远端麦克风的语音的准确性可以接近多麦克风设置的准确性。
{"title":"Hybrid acoustic models for distant and multichannel large vocabulary speech recognition","authors":"P. Swietojanski, Arnab Ghoshal, S. Renals","doi":"10.1109/ASRU.2013.6707744","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707744","url":null,"abstract":"We investigate the application of deep neural network (DNN)-hidden Markov model (HMM) hybrid acoustic models for far-field speech recognition of meetings recorded using microphone arrays. We show that the hybrid models achieve significantly better accuracy than conventional systems based on Gaussian mixture models (GMMs). We observe up to 8% absolute word error rate (WER) reduction from a discriminatively trained GMM baseline when using a single distant microphone, and between 4-6% absolute WER reduction when using beamforming on various combinations of array channels. By training the networks on audio from multiple channels, we find the networks can recover significant part of accuracy difference between the single distant microphone and beamformed configurations. Finally, we show that the accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133626109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 112
期刊
2013 IEEE Workshop on Automatic Speech Recognition and Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1