IEEE Transactions on Audio Speech and Language Processing最新文献

英文中文

Voice Activity Detection in Presence of Transient Noise Using Spectral Clustering 基于谱聚类的瞬态噪声下的语音活动检测

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-06-01 DOI: 10.1109/TASL.2013.2248717

S. Mousazadeh, I. Cohen

Voice activity detection has attracted significant research efforts in the last two decades. Despite much progress in designing voice activity detectors, voice activity detection (VAD) in presence of transient noise is a challenging problem. In this paper, we develop a novel VAD algorithm based on spectral clustering methods. We propose a VAD technique which is a supervised learning algorithm. This algorithm divides the input signal into two separate clusters (i.e., speech presence and speech absence frames). We use labeled data in order to adjust the parameters of the kernel used in spectral clustering methods for computing the similarity matrix. The parameters obtained in the training stage together with the eigenvectors of the normalized Laplacian of the similarity matrix and Gaussian mixture model (GMM) are utilized to compute the likelihood ratio needed for voice activity detection. Simulation results demonstrate the advantage of the proposed method compared to conventional statistical model-based VAD algorithms in presence of transient noise.

在过去的二十年里，语音活动检测吸引了大量的研究工作。尽管语音活动检测器的设计取得了很大的进展，但存在瞬态噪声的语音活动检测(VAD)是一个具有挑战性的问题。本文提出了一种基于谱聚类的VAD算法。我们提出了一种VAD技术，它是一种监督学习算法。该算法将输入信号分成两个独立的簇(即语音存在帧和语音缺失帧)。我们使用标记数据来调整谱聚类方法中用于计算相似矩阵的核的参数。利用训练阶段获得的参数、相似矩阵归一化拉普拉斯特征向量和高斯混合模型(GMM)计算语音活动检测所需的似然比。仿真结果表明，在存在瞬态噪声的情况下，与传统的基于统计模型的VAD算法相比，该方法具有明显的优越性。

引用次数: 47

Compensation of Loudspeaker–Room Responses in a Robust MIMO Control Framework 鲁棒MIMO控制框架下扬声器-房间响应补偿

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-06-01 DOI: 10.1109/TASL.2013.2245650

Lars-Johan Brännmark, A. Bahne, A. Ahlén

A new multichannel approach to robust broadband loudspeaker-room equalization is presented. Traditionally, the equalization (or room correction) problem has been treated primarily by single-channel methods, where loudspeaker input signals are prefiltered individually by separate scalar filters. Single-channel methods are generally able to improve the average spectral flatness of the acoustic transfer functions in a listening region, but they cannot reduce the variability of the transfer functions within the region. Most modern audio reproduction systems, however, contain two or more loudspeakers, and in this paper we aim at improving the equalization performance by using all available loudspeakers jointly. To this end we propose a polynomial based MIMO formulation of the equalization problem. The new approach, which is a generalization of an earlier single-channel approach by the authors, is found to reduce the average reproduction error and the transfer function variability over a region in space. Moreover, pre-ringing artifacts are avoided, and the reproduction error below 1000 Hz is significantly reduced with an amount that scales with the number of loudspeakers used.

提出了一种新的多通道宽带扩音室鲁棒均衡方法。传统上，均衡(或房间校正)问题主要通过单通道方法处理，其中扬声器输入信号由单独的标量滤波器单独预滤波。单通道方法一般能够提高聆听区域内声学传递函数的平均频谱平坦度，但不能降低该区域内传递函数的可变性。然而，大多数现代音频重放系统包含两个或多个扬声器，在本文中，我们旨在通过联合使用所有可用的扬声器来提高均衡性能。为此，我们提出了一种基于多项式的MIMO均衡问题公式。新方法是作者早期的单通道方法的推广，发现可以减少空间区域内的平均再现误差和传递函数变异性。此外，避免了预振铃伪影，并且随着所使用的扬声器数量的增加，显着降低了低于1000 Hz的再现误差。

{"title":"Compensation of Loudspeaker–Room Responses in a Robust MIMO Control Framework","authors":"Lars-Johan Brännmark, A. Bahne, A. Ahlén","doi":"10.1109/TASL.2013.2245650","DOIUrl":"https://doi.org/10.1109/TASL.2013.2245650","url":null,"abstract":"A new multichannel approach to robust broadband loudspeaker-room equalization is presented. Traditionally, the equalization (or room correction) problem has been treated primarily by single-channel methods, where loudspeaker input signals are prefiltered individually by separate scalar filters. Single-channel methods are generally able to improve the average spectral flatness of the acoustic transfer functions in a listening region, but they cannot reduce the variability of the transfer functions within the region. Most modern audio reproduction systems, however, contain two or more loudspeakers, and in this paper we aim at improving the equalization performance by using all available loudspeakers jointly. To this end we propose a polynomial based MIMO formulation of the equalization problem. The new approach, which is a generalization of an earlier single-channel approach by the authors, is found to reduce the average reproduction error and the transfer function variability over a region in space. Moreover, pre-ringing artifacts are avoided, and the reproduction error below 1000 Hz is significantly reduced with an amount that scales with the number of loudspeakers used.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2245650","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62887740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination 小波最大色散在呼吸-紧张语音识别中的应用

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-06-01 DOI: 10.1109/TASL.2013.2245653

John Kane, C. Gobl

This paper proposes a new parameter, the Maxima Dispersion Quotient (MDQ), for differentiating breathy to tense voice. Maxima derived following wavelet decomposition are often used for detecting edges in image processing, where locations of these maxima organize in the vicinity of the edge location. Similarly for tense voice, which typically displays sharp glottal closing characteristics, maxima following wavelet analysis are organized in the vicinity of the glottal closure instant (GCI). Contrastingly, as the phonation type tends away from tense voice towards a breathier phonation it is observed that the maxima become increasingly dispersed. The MDQ parameter is designed to quantify the extent of this dispersion and is shown to compare favorably to existing voice quality parameters, particularly for the analysis of continuous speech. Also, classification experiments reveal a significant improvement in the detection of the voice qualities when MDQ is included as an input to the classifier. Finally, MDQ is shown to be robust to additive noise down to a Signal-to-Noise Ratio of 10 dB.

本文提出了一个新的参数，即最大离散商(MDQ)，用于区分呼吸声和紧张声。在图像处理中，经常使用小波分解后得到的极大值来检测边缘，这些极大值的位置组织在边缘位置附近。同样，对于通常表现出尖锐声门关闭特征的紧张声音，小波分析后的最大值组织在声门关闭瞬间(GCI)附近。相反，当发音类型从紧张的声音趋向于呼吸式的发音时，可以观察到最大值变得越来越分散。MDQ参数被设计用来量化这种分散的程度，并被证明与现有的语音质量参数比较有利，特别是对于连续语音的分析。此外，分类实验表明，当将MDQ作为分类器的输入时，对语音质量的检测有显著改善。最后，MDQ被证明对加性噪声具有鲁棒性，信噪比低至10 dB。

引用次数: 89

Eigentriphones for Context-Dependent Acoustic Modeling 上下文相关声学建模的特征三音器

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-06-01 DOI: 10.1109/TASL.2013.2248722

Tom Ko, B. Mak

Most automatic speech recognizers employ tied-state triphone hidden Markov models (HMM), in which the corresponding triphone states of the same base phone are tied. State tying is commonly performed with the use of a phonetic regression class tree which renders robust context-dependent modeling possible by carefully balancing the amount of training data with the degree of tying. However, tying inevitably introduces quantization error: triphones tied to the same state are not distinguishable in that state. Recently we proposed a new triphone modeling approach called eigentriphone modeling in which all triphone models are, in general, distinct. The idea is to create an eigenbasis for each base phone (or phone state) and all its triphones (or triphone states) are represented as distinct points in the space spanned by the basis. We have shown that triphone HMMs trained using model-based or state-based eigentriphones perform at least as well as conventional tied-state HMMs. In this paper, we further generalize the definition of eigentriphones over clusters of acoustic units. Our experiments on TIMIT phone recognition and the Wall Street Journal 5K-vocabulary continuous speech recognition show that eigentriphones estimated from state clusters defined by the nodes in the same phonetic regression class tree used in state tying result in further performance gain.

大多数自动语音识别器都采用了绑定状态三音隐马尔可夫模型(HMM)，在该模型中，同一基本电话的对应三音状态是绑定的。状态绑定通常使用语音回归类树来执行，该树通过仔细平衡训练数据的数量和绑定程度来实现健壮的上下文相关建模。然而，绑定不可避免地引入了量化误差:绑定到同一状态的三联音在该状态下是不可区分的。最近，我们提出了一种新的三音器建模方法，称为特征三音器建模，其中所有的三音器模型通常是不同的。这个想法是为每个基本电话(或电话状态)创建一个特征基，并且它的所有三通电话(或三通电话状态)都被表示为该基所跨越的空间中的不同点。我们已经证明，使用基于模型或基于状态的特征三音器训练的三音hmm的表现至少与传统的捆绑状态hmm一样好。在本文中，我们进一步推广了声学单元簇上特征三音的定义。我们对TIMIT电话识别和华尔街日报5k词汇连续语音识别的实验表明，从状态绑定中使用的相同语音回归类树的节点定义的状态簇中估计的特征三音可以进一步提高性能。

{"title":"Eigentriphones for Context-Dependent Acoustic Modeling","authors":"Tom Ko, B. Mak","doi":"10.1109/TASL.2013.2248722","DOIUrl":"https://doi.org/10.1109/TASL.2013.2248722","url":null,"abstract":"Most automatic speech recognizers employ tied-state triphone hidden Markov models (HMM), in which the corresponding triphone states of the same base phone are tied. State tying is commonly performed with the use of a phonetic regression class tree which renders robust context-dependent modeling possible by carefully balancing the amount of training data with the degree of tying. However, tying inevitably introduces quantization error: triphones tied to the same state are not distinguishable in that state. Recently we proposed a new triphone modeling approach called eigentriphone modeling in which all triphone models are, in general, distinct. The idea is to create an eigenbasis for each base phone (or phone state) and all its triphones (or triphone states) are represented as distinct points in the space spanned by the basis. We have shown that triphone HMMs trained using model-based or state-based eigentriphones perform at least as well as conventional tied-state HMMs. In this paper, we further generalize the definition of eigentriphones over clusters of acoustic units. Our experiments on TIMIT phone recognition and the Wall Street Journal 5K-vocabulary continuous speech recognition show that eigentriphones estimated from state clusters defined by the nodes in the same phonetic regression class tree used in state tying result in further performance gain.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2248722","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A Compressed Sensing Approach to Blind Separation of Speech Mixture Based on a Two-Layer Sparsity Model 基于两层稀疏度模型的混合语音盲分离压缩感知方法

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2012.2234110

Guangzhao Bao, Z. Ye, Xu Xu, Yingyue Zhou

This paper discusses underdetermined blind source separation (BSS) using a compressed sensing (CS) approach, which contains two stages. In the first stage we exploit a modified K-means method to estimate the unknown mixing matrix. The second stage is to separate the sources from the mixed signals using the estimated mixing matrix from the first stage. In the second stage a two-layer sparsity model is used. The two-layer sparsity model assumes that the low frequency components of speech signals are sparse on K-SVD dictionary and the high frequency components are sparse on discrete cosine transformation (DCT) dictionary. This model, taking advantage of two dictionaries, can produce effective separation performance even if the sources are not sparse in time-frequency (TF) domain.

本文讨论了一种基于压缩感知的欠定盲源分离(BSS)方法，该方法包括两个阶段。在第一阶段，我们利用改进的K-means方法来估计未知混合矩阵。第二阶段是利用第一阶段估计的混合矩阵从混合信号中分离源信号。第二阶段采用两层稀疏度模型。两层稀疏性模型假设语音信号的低频分量在K-SVD字典上是稀疏的，高频分量在离散余弦变换(DCT)字典上是稀疏的。该模型利用两个字典的优势，即使源在时频域中不稀疏，也能产生有效的分离性能。

引用次数: 43

Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data 复值数据非负矩阵分解的多通道扩展

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2239990

H. Sawada, H. Kameoka, S. Araki, N. Ueda

This paper presents new formulations and algorithms for multichannel extensions of non-negative matrix factorization (NMF). The formulations employ Hermitian positive semidefinite matrices to represent a multichannel version of non-negative elements. Multichannel Euclidean distance and multichannel Itakura-Saito (IS) divergence are defined based on appropriate statistical models utilizing multivariate complex Gaussian distributions. To minimize this distance/divergence, efficient optimization algorithms in the form of multiplicative updates are derived by using properly designed auxiliary functions. Two methods are proposed for clustering NMF bases according to the estimated spatial property. Convolutive blind source separation (BSS) is performed by the multichannel extensions of NMF with the clustering mechanism. Experimental results show that 1) the derived multiplicative update rules exhibited good convergence behavior, and 2) BSS tasks for several music sources with two microphones and three instrumental parts were evaluated successfully.

本文提出了非负矩阵分解(NMF)多通道扩展的新公式和新算法。该公式采用厄米正半定矩阵来表示非负元素的多通道版本。多通道欧几里得距离和多通道Itakura-Saito (IS)散度是基于适当的统计模型，利用多元复高斯分布定义的。为了最小化这种距离/散度，通过使用适当设计的辅助函数，以乘法更新的形式推导出有效的优化算法。根据估计的空间特性，提出了两种NMF基聚类方法。卷积盲源分离(BSS)是利用NMF的多通道扩展和聚类机制实现的。实验结果表明:(1)推导的乘法更新规则具有良好的收敛性;(2)成功地评估了具有两个传声器和三个乐器部件的多个音乐源的BSS任务。

引用次数: 259

Improving Statistical Machine Translation Using Bayesian Word Alignment and Gibbs Sampling 利用贝叶斯词对齐和吉布斯抽样改进统计机器翻译

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2244087

Coskun Mermer, M. Saraçlar, R. Sarikaya

We present a Bayesian approach to word alignment inference in IBM Models 1 and 2. In the original approach, word translation probabilities (i.e., model parameters) are estimated using the expectation-maximization (EM) algorithm. In the proposed approach, they are random variables with a prior and are integrated out during inference. We use Gibbs sampling to infer the word alignment posteriors. The inferred word alignments are compared against EM and variational Bayes (VB) inference in terms of their end-to-end translation performance on several language pairs and types of corpora up to 15 million sentence pairs. We show that Bayesian inference outperforms both EM and VB in the majority of test cases. Further analysis reveals that the proposed method effectively addresses the high-fertility rare word problem in EM and unaligned rare word problem in VB, achieves higher agreement and vocabulary coverage rates than both, and leads to smaller phrase tables.

我们在IBM模型1和模型2中提出了一种贝叶斯方法来进行词对齐推理。在最初的方法中，使用期望最大化(EM)算法估计单词翻译概率(即模型参数)。在该方法中，它们是具有先验的随机变量，并在推理过程中被积分。我们使用吉布斯抽样来推断词对齐后验。将推断的词对齐与EM和变分贝叶斯(VB)推理在几种语言对和多达1500万个句子对的语料库上的端到端翻译性能进行了比较。我们表明，贝叶斯推理在大多数测试用例中都优于EM和VB。进一步分析表明，该方法有效地解决了EM中的高生育率罕见词问题和VB中的未对齐罕见词问题，获得了比两者更高的一致性和词汇覆盖率，并且得到了更小的短语表。

引用次数: 20

Nonnegative HMM for Babble Noise Derived From Speech HMM: Application to Speech Enhancement 语音隐马尔可夫衍生的咿呀学语噪声的非负隐马尔可夫:在语音增强中的应用

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2243435

N. Mohammadiha, A. Leijon

Deriving a good model for multitalker babble noise can facilitate different speech processing algorithms, e.g., noise reduction, to reduce the so-called cocktail party difficulty. In the available systems, the fact that the babble waveform is generated as a sum of N different speech waveforms is not exploited explicitly. In this paper, first we develop a gamma hidden Markov model for power spectra of the speech signal, and then formulate it as a sparse nonnegative matrix factorization (NMF). Second, the sparse NMF is extended by relaxing the sparsity constraint, and a novel model for babble noise (gamma nonnegative HMM) is proposed in which the babble basis matrix is the same as the speech basis matrix, and only the activation factors (weights) of the basis vectors are different for the two signals over time. Finally, a noise reduction algorithm is proposed using the derived speech and babble models. All of the stationary model parameters are estimated using the expectation-maximization (EM) algorithm, whereas the time-varying parameters, i.e., the gain parameters of speech and babble signals, are estimated using a recursive EM algorithm. The objective and subjective listening evaluations show that the proposed babble model and the final noise reduction algorithm significantly outperform the conventional methods.

得到一个好的多话人的牙牙学语噪声模型，可以促进不同的语音处理算法，如降噪，以减少所谓的鸡尾酒会困难。在可用的系统中，没有明确地利用作为N种不同语音波形之和产生的呀啊学波形这一事实。本文首先建立了语音信号功率谱的伽玛隐马尔可夫模型，然后将其表述为稀疏非负矩阵分解(NMF)。其次，通过放宽稀疏性约束对稀疏NMF进行扩展，提出了一种新的呀学语噪声模型(gamma非负HMM)，该模型中呀学语基矩阵与语音基矩阵相同，只有两个信号的基向量的激活因子(权重)随时间不同。最后，利用所导出的语音模型和牙牙学语模型提出了一种降噪算法。所有的平稳模型参数都使用期望最大化(EM)算法估计，而时变参数，即语音和咿呀学语信号的增益参数，则使用递归EM算法估计。客观和主观的听力评价表明，所提出的牙牙学语模型和最终的降噪算法明显优于传统的降噪方法。

{"title":"Nonnegative HMM for Babble Noise Derived From Speech HMM: Application to Speech Enhancement","authors":"N. Mohammadiha, A. Leijon","doi":"10.1109/TASL.2013.2243435","DOIUrl":"https://doi.org/10.1109/TASL.2013.2243435","url":null,"abstract":"Deriving a good model for multitalker babble noise can facilitate different speech processing algorithms, e.g., noise reduction, to reduce the so-called cocktail party difficulty. In the available systems, the fact that the babble waveform is generated as a sum of N different speech waveforms is not exploited explicitly. In this paper, first we develop a gamma hidden Markov model for power spectra of the speech signal, and then formulate it as a sparse nonnegative matrix factorization (NMF). Second, the sparse NMF is extended by relaxing the sparsity constraint, and a novel model for babble noise (gamma nonnegative HMM) is proposed in which the babble basis matrix is the same as the speech basis matrix, and only the activation factors (weights) of the basis vectors are different for the two signals over time. Finally, a noise reduction algorithm is proposed using the derived speech and babble models. All of the stationary model parameters are estimated using the expectation-maximization (EM) algorithm, whereas the time-varying parameters, i.e., the gain parameters of speech and babble signals, are estimated using a recursive EM algorithm. The objective and subjective listening evaluations show that the proposed babble model and the final noise reduction algorithm significantly outperform the conventional methods.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2243435","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62885800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Nonlinear Least Squares Methods for Joint DOA and Pitch Estimation 联合DOA和Pitch估计的非线性最小二乘方法

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2239290

J. Jensen, M. G. Christensen, S. H. Jensen

In this paper, we consider the problem of joint direction-of-arrival (DOA) and fundamental frequency estimation. Joint estimation enables robust estimation of these parameters in multi-source scenarios where separate estimators may fail. First, we derive the exact and asymptotic Cramér-Rao bounds for the joint estimation problem. Then, we propose a nonlinear least squares (NLS) and an approximate NLS (aNLS) estimator for joint DOA and fundamental frequency estimation. The proposed estimators are maximum likelihood estimators when: 1) the noise is white Gaussian, 2) the environment is anechoic, and 3) the source of interest is in the far-field. Otherwise, the methods still approximately yield maximum likelihood estimates. Simulations on synthetic data show that the proposed methods have similar or better performance than state-of-the-art methods for DOA and fundamental frequency estimation. Moreover, simulations on real-life data indicate that the NLS and aNLS methods are applicable even when reverberation is present and the noise is not white Gaussian.

本文考虑了到达方向(DOA)和基频估计的联合问题。联合估计可以在单独估计器可能失效的多源场景中对这些参数进行鲁棒估计。首先，我们导出了联合估计问题的精确和渐近cram - rao界。然后，我们提出了一个非线性最小二乘(NLS)和一个近似最小二乘(aNLS)估计器，用于联合DOA和基频估计。提出的估计量是最大似然估计量，当:(1)噪声是高斯白噪声，(2)环境是消声的，以及(3)感兴趣的源在远场。否则，这些方法仍然近似地产生最大似然估计。综合数据的仿真结果表明，本文提出的方法在DOA和基频估计方面具有与现有方法相似或更好的性能。此外，对实际数据的仿真表明，即使存在混响并且噪声不是白高斯噪声，NLS和aNLS方法也适用。

引用次数: 64

Computing MMSE Estimates and Residual Uncertainty Directly in the Feature Domain of ASR using STFT Domain Speech Distortion Models 使用STFT域语音失真模型直接在ASR特征域中计算MMSE估计和残差不确定性

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2244085

Ramón Fernández Astudillo, R. Orglmeister

In this paper we demonstrate how uncertainty propagation allows the computation of minimum mean square error (MMSE) estimates in the feature domain for various feature extraction methods using short-time Fourier transform (STFT) domain distortion models. In addition to this, a measure of estimate reliability is also attained which allows either feature re-estimation or the dynamic compensation of automatic speech recognition (ASR) models. The proposed method transforms the posterior distribution associated to a Wiener filter through the feature extraction using the STFT Uncertainty Propagation formulas. It is also shown that non-linear estimators in the STFT domain like the Ephraim-Malah filters can be seen as special cases of a propagation of the Wiener posterior. The method is illustrated by developing two MMSE-Mel-frequency Cepstral Coefficient (MFCC) estimators and combining them with observation uncertainty techniques. We discuss similarities with other MMSE-MFCC estimators and show how the proposed approach outperforms conventional MMSE estimators in the STFT domain on the AURORA4 robust ASR task.

在本文中，我们展示了不确定性传播如何允许使用短时傅里叶变换(STFT)域失真模型的各种特征提取方法在特征域中计算最小均方误差(MMSE)估计。除此之外，还获得了一个估计可靠性的度量，该度量允许自动语音识别(ASR)模型的特征重新估计或动态补偿。该方法通过使用STFT不确定性传播公式进行特征提取，将后验分布转换为维纳滤波器。本文还证明了STFT域中的非线性估计器，如Ephraim-Malah滤波器，可以看作是Wiener后值传播的特殊情况。通过开发两个mmse - mel频率倒谱系数(MFCC)估计器并将其与观测不确定度技术相结合来说明该方法。我们讨论了与其他MMSE- mfcc估计器的相似之处，并展示了该方法如何在AURORA4鲁棒ASR任务上优于STFT域的传统MMSE估计器。

{"title":"Computing MMSE Estimates and Residual Uncertainty Directly in the Feature Domain of ASR using STFT Domain Speech Distortion Models","authors":"Ramón Fernández Astudillo, R. Orglmeister","doi":"10.1109/TASL.2013.2244085","DOIUrl":"https://doi.org/10.1109/TASL.2013.2244085","url":null,"abstract":"In this paper we demonstrate how uncertainty propagation allows the computation of minimum mean square error (MMSE) estimates in the feature domain for various feature extraction methods using short-time Fourier transform (STFT) domain distortion models. In addition to this, a measure of estimate reliability is also attained which allows either feature re-estimation or the dynamic compensation of automatic speech recognition (ASR) models. The proposed method transforms the posterior distribution associated to a Wiener filter through the feature extraction using the STFT Uncertainty Propagation formulas. It is also shown that non-linear estimators in the STFT domain like the Ephraim-Malah filters can be seen as special cases of a propagation of the Wiener posterior. The method is illustrated by developing two MMSE-Mel-frequency Cepstral Coefficient (MFCC) estimators and combining them with observation uncertainty techniques. We discuss similarities with other MMSE-MFCC estimators and show how the proposed approach outperforms conventional MMSE estimators in the STFT domain on the AURORA4 robust ASR task.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2244085","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62886263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Transactions on Audio Speech and Language Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀