IEEE Transactions on Audio Speech and Language Processing最新文献_第5页

Musical Instrument Recognition in Polyphonic Audio Using Missing Feature Approach 基于缺失特征方法的复调音频乐器识别

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2248720

D. Giannoulis, Anssi Klapuri

A method is described for musical instrument recognition in polyphonic audio signals where several sound sources are active at the same time. The proposed method is based on local spectral features and missing-feature techniques. A novel mask estimation algorithm is described that identifies spectral regions that contain reliable information for each sound source, and bounded marginalization is then used to treat the feature vector elements that are determined to be unreliable. The mask estimation technique is based on the assumption that the spectral envelopes of musical sounds tend to be slowly-varying as a function of log-frequency and unreliable spectral components can therefore be detected as positive deviations from an estimated smooth spectral envelope. A computationally efficient algorithm is proposed for marginalizing the mask in the classification process. In simulations, the proposed method clearly outperforms reference methods for mixture signals. The proposed mask estimation technique leads to a recognition accuracy that is approximately half-way between a trivial all-one mask (all features are assumed reliable) and an ideal “oracle” mask.

描述了一种在多个声源同时处于活动状态的复调音频信号中进行乐器识别的方法。该方法基于局部光谱特征和缺失特征技术。描述了一种新的掩模估计算法，该算法可以识别每个声源包含可靠信息的频谱区域，然后使用有界边缘化来处理被确定为不可靠的特征向量元素。掩模估计技术是基于这样的假设，即音乐声音的频谱包络作为对数频率的函数往往是缓慢变化的，因此可以检测到不可靠的频谱成分与估计的平滑频谱包络的正偏差。提出了一种计算效率高的分类过程中掩码边缘化算法。仿真结果表明，该方法明显优于混合信号的参考方法。所提出的掩码估计技术导致的识别精度大约介于普通的全一掩码(假设所有特征都是可靠的)和理想的“oracle”掩码之间。

{"title":"Musical Instrument Recognition in Polyphonic Audio Using Missing Feature Approach","authors":"D. Giannoulis, Anssi Klapuri","doi":"10.1109/TASL.2013.2248720","DOIUrl":"https://doi.org/10.1109/TASL.2013.2248720","url":null,"abstract":"A method is described for musical instrument recognition in polyphonic audio signals where several sound sources are active at the same time. The proposed method is based on local spectral features and missing-feature techniques. A novel mask estimation algorithm is described that identifies spectral regions that contain reliable information for each sound source, and bounded marginalization is then used to treat the feature vector elements that are determined to be unreliable. The mask estimation technique is based on the assumption that the spectral envelopes of musical sounds tend to be slowly-varying as a function of log-frequency and unreliable spectral components can therefore be detected as positive deviations from an estimated smooth spectral envelope. A computationally efficient algorithm is proposed for marginalizing the mask in the classification process. In simulations, the proposed method clearly outperforms reference methods for mixture signals. The proposed mask estimation technique leads to a recognition accuracy that is approximately half-way between a trivial all-one mask (all features are assumed reliable) and an ideal “oracle” mask.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2248720","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Characterization of Multiple Transient Acoustical Sources From Time-Transform Representations 多瞬态声源的时间变换表征

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2263141

N. Wachowski, M. Azimi-Sadjadi

This paper introduces a new framework for detecting, classifying, and estimating the signatures of multiple transient acoustical sources from a time-transform representation (TTR) of an audio waveform. A TTR is a vector observation sequence containing the coefficients of consecutive windows of data with respect to known sampled basis waveforms. A set of likelihood ratio tests is hierarchically applied to each time slice of a TTR to detect and classify signals in the presence of interference. Since the signatures of each acoustical event typically span several adjacent dependent observations, a Kalman filter is used to generate the parameters necessary for computing the likelihood values. The experimental results of applying the proposed method to a problem of detecting and classifying man-made and natural transient acoustical events in national park soundscape recordings attest to its effectiveness at performing the aforementioned tasks.

本文介绍了一种新的框架，用于从音频波形的时间变换表示(TTR)中检测、分类和估计多个瞬态声源的特征。TTR是一个矢量观测序列，包含相对于已知采样基波形的连续窗口数据的系数。一组似然比测试分层应用于TTR的每个时间片，以检测和分类存在干扰的信号。由于每个声学事件的特征通常跨越几个相邻的依赖观测，因此使用卡尔曼滤波器来生成计算似然值所需的参数。将所提出的方法应用于国家公园音景记录中人为和自然瞬态声学事件的检测和分类问题的实验结果证明了其在执行上述任务方面的有效性。

引用次数: 3

A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction 一种基于多通道mmse的语音源分离与降噪框架

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2263137

M. Souden, S. Araki, K. Kinoshita, T. Nakatani, H. Sawada

We propose a new framework for joint multichannel speech source separation and acoustic noise reduction. In this framework, we start by formulating the minimum-mean-square error (MMSE)-based solution in the context of multiple simultaneous speakers and background noise, and outline the importance of the estimation of the activities of the speakers. The latter is accurately achieved by introducing a latent variable that takes N+1 possible discrete states for a mixture of N speech signals plus additive noise. Each state characterizes the dominance of one of the N+1 signals. We determine the posterior probability of this latent variable, and show how it plays a twofold role in the MMSE-based speech enhancement. First, it allows the extraction of the second order statistics of the noise and each of the speech signals from the noisy data. These statistics are needed to formulate the multichannel Wiener-based filters (including the minimum variance distortionless response). Second, it weighs the outputs of these linear filters to shape the spectral contents of the signals' estimates following the associated target speakers' activities. We use the spatial and spectral cues contained in the multichannel recordings of the sound mixtures to compute the posterior probability of this latent variable. The spatial cue is acquired by using the normalized observation vector whose distribution is well approximated by a Gaussian-mixture-like model, while the spectral cue can be captured by using a pre-trained Gaussian mixture model for the log-spectra of speech. The parameters of the investigated models and the speakers' activities (posterior probabilities of the different states of the latent variable) are estimated via expectation maximization. Experimental results including comparisons with the well-known independent component analysis and masking are provided to demonstrate the efficiency of the proposed framework.

提出了一种新的多通道声源分离与降噪框架。在此框架中，我们首先在多个同时说话者和背景噪声的背景下制定基于最小均方误差(MMSE)的解决方案，并概述了估计说话者活动的重要性。后者是通过引入一个潜在变量来精确实现的，该变量为N个语音信号加上加性噪声的混合物取N+1个可能的离散状态。每种状态都表示N+1信号中的一个占主导地位。我们确定了该潜在变量的后验概率，并展示了它如何在基于mmse的语音增强中发挥双重作用。首先，它允许从噪声数据中提取噪声和每个语音信号的二阶统计量。这些统计数据是制定多通道维纳滤波器(包括最小方差无失真响应)所必需的。其次，它对这些线性滤波器的输出进行加权，以根据相关目标说话者的活动来形成信号估计的频谱内容。我们使用多声道混合声音记录中包含的空间和频谱线索来计算该潜在变量的后验概率。空间线索通过归一化的观测向量获得，该观测向量的分布可以很好地近似于高斯混合模型，而频谱线索可以通过预训练的高斯混合模型捕获，用于语音的对数频谱。通过期望最大化来估计所研究模型的参数和说话人的活动(潜在变量不同状态的后验概率)。实验结果包括与众所周知的独立分量分析和掩蔽的比较，以证明该框架的有效性。

{"title":"A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction","authors":"M. Souden, S. Araki, K. Kinoshita, T. Nakatani, H. Sawada","doi":"10.1109/TASL.2013.2263137","DOIUrl":"https://doi.org/10.1109/TASL.2013.2263137","url":null,"abstract":"We propose a new framework for joint multichannel speech source separation and acoustic noise reduction. In this framework, we start by formulating the minimum-mean-square error (MMSE)-based solution in the context of multiple simultaneous speakers and background noise, and outline the importance of the estimation of the activities of the speakers. The latter is accurately achieved by introducing a latent variable that takes N+1 possible discrete states for a mixture of N speech signals plus additive noise. Each state characterizes the dominance of one of the N+1 signals. We determine the posterior probability of this latent variable, and show how it plays a twofold role in the MMSE-based speech enhancement. First, it allows the extraction of the second order statistics of the noise and each of the speech signals from the noisy data. These statistics are needed to formulate the multichannel Wiener-based filters (including the minimum variance distortionless response). Second, it weighs the outputs of these linear filters to shape the spectral contents of the signals' estimates following the associated target speakers' activities. We use the spatial and spectral cues contained in the multichannel recordings of the sound mixtures to compute the posterior probability of this latent variable. The spatial cue is acquired by using the normalized observation vector whose distribution is well approximated by a Gaussian-mixture-like model, while the spectral cue can be captured by using a pre-trained Gaussian mixture model for the log-spectra of speech. The parameters of the investigated models and the speakers' activities (posterior probabilities of the different states of the latent variable) are estimated via expectation maximization. Experimental results including comparisons with the well-known independent component analysis and masking are provided to demonstrate the efficiency of the proposed framework.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2263137","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 101

Room Impulse Response Synthesis and Validation Using a Hybrid Acoustic Model 使用混合声学模型的房间脉冲响应合成和验证

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2263139

A. Southern, S. Siltanen, D. Murphy, L. Savioja

Synthesizing the room impulse response (RIR) of an arbitrary enclosure may be performed using a number of alternative acoustic modeling methods, each with their own particular advantages and limitations. This article is concerned with obtaining a hybrid RIR derived from both wave and geometric-acoustics based methods, optimized for use across different regions of time or frequency. Consideration is given to how such RIRs can be matched across modeling domains in terms of both amplitude and boundary behavior and the approach is verified using a number of standardised case studies.

合成任意封闭空间的房间脉冲响应(RIR)可以使用许多替代声学建模方法进行，每种方法都有其特定的优点和局限性。这篇文章是关于获得混合RIR从波浪和几何声学为基础的方法，优化使用在不同的时间或频率区域。考虑到这些rir如何在振幅和边界行为方面跨建模域进行匹配，并使用许多标准化案例研究验证了该方法。

引用次数: 43

Vector quantization of LSF parameters with a mixture of dirichlet distributions 混合狄利克雷分布LSF参数的矢量量化

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2238732

Zhanyu Ma, A. Leijon, W. Kleijn

Quantization of the linear predictive coding parameters is an important part in speech coding. Probability density function (PDF)-optimized vector quantization (VQ) has been previously shown to be more efficient than VQ based only on training data. For data with bounded support, some well-defined bounded-support distributions (e.g., the Dirichlet distribution) have been proven to outperform the conventional Gaussian mixture model (GMM), with the same number of free parameters required to describe the model. When exploiting both the boundary and the order properties of the line spectral frequency (LSF) parameters, the distribution of LSF differences LSF can be modelled with a Dirichlet mixture model (DMM). We propose a corresponding DMM based VQ. The elements in a Dirichlet vector variable are highly mutually correlated. Motivated by the Dirichlet vector variable's neutrality property, a practical non-linear transformation scheme for the Dirichlet vector variable can be obtained. Similar to the Karhunen-Loève transform for Gaussian variables, this non-linear transformation decomposes the Dirichlet vector variable into a set of independent beta-distributed variables. Using high rate quantization theory and by the entropy constraint, the optimal inter- and intra-component bit allocation strategies are proposed. In the implementation of scalar quantizers, we use the constrained-resolution coding to approximate the derived constrained-entropy coding. A practical coding scheme for DVQ is designed for the purpose of reducing the quantization error accumulation. The theoretical and practical quantization performance of DVQ is evaluated. Compared to the state-of-the-art GMM-based VQ and recently proposed beta mixture model (BMM) based VQ, DVQ performs better, with even fewer free parameters and lower computational cost

线性预测编码参数的量化是语音编码的重要组成部分。概率密度函数(PDF)优化向量量化(VQ)先前已被证明比仅基于训练数据的矢量量化(VQ)更有效。对于具有有界支持的数据，一些定义良好的有界支持分布(例如Dirichlet分布)已被证明优于传统的高斯混合模型(GMM)，具有相同数量的描述模型所需的自由参数。利用线谱频率参数的边界特性和阶数特性，可以用Dirichlet混合模型(DMM)对线谱频率差的分布进行建模。我们提出了一个相应的基于DMM的VQ。狄利克雷向量变量中的元素是高度相互关联的。利用狄利克雷向量变量的中立性，可以得到狄利克雷向量变量的一种实用的非线性变换方案。类似于高斯变量的karhunen - lo变换，这种非线性变换将狄利克雷向量变量分解为一组独立的β分布变量。利用高速率量化理论和熵约束，提出了组件间和组件内的最佳比特分配策略。在标量量化器的实现中，我们使用约束分辨率编码来近似派生的约束熵编码。为了减少量化误差积累，设计了一种实用的DVQ编码方案。对DVQ的量化性能进行了理论和实践评价。与最先进的基于gmm的VQ和最近提出的基于beta混合模型(BMM)的VQ相比，DVQ的性能更好，自由参数更少，计算成本更低

{"title":"Vector quantization of LSF parameters with a mixture of dirichlet distributions","authors":"Zhanyu Ma, A. Leijon, W. Kleijn","doi":"10.1109/TASL.2013.2238732","DOIUrl":"https://doi.org/10.1109/TASL.2013.2238732","url":null,"abstract":"Quantization of the linear predictive coding parameters is an important part in speech coding. Probability density function (PDF)-optimized vector quantization (VQ) has been previously shown to be more efficient than VQ based only on training data. For data with bounded support, some well-defined bounded-support distributions (e.g., the Dirichlet distribution) have been proven to outperform the conventional Gaussian mixture model (GMM), with the same number of free parameters required to describe the model. When exploiting both the boundary and the order properties of the line spectral frequency (LSF) parameters, the distribution of LSF differences LSF can be modelled with a Dirichlet mixture model (DMM). We propose a corresponding DMM based VQ. The elements in a Dirichlet vector variable are highly mutually correlated. Motivated by the Dirichlet vector variable's neutrality property, a practical non-linear transformation scheme for the Dirichlet vector variable can be obtained. Similar to the Karhunen-Loève transform for Gaussian variables, this non-linear transformation decomposes the Dirichlet vector variable into a set of independent beta-distributed variables. Using high rate quantization theory and by the entropy constraint, the optimal inter- and intra-component bit allocation strategies are proposed. In the implementation of scalar quantizers, we use the constrained-resolution coding to approximate the derived constrained-entropy coding. A practical coding scheme for DVQ is designed for the purpose of reducing the quantization error accumulation. The theoretical and practical quantization performance of DVQ is evaluated. Compared to the state-of-the-art GMM-based VQ and recently proposed beta mixture model (BMM) based VQ, DVQ performs better, with even fewer free parameters and lower computational cost","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2238732","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62885836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Harmonic Adaptive Latent Component Analysis of Audio and Application to Music Transcription 音频谐波自适应潜分量分析及其在音乐转写中的应用

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2260741

Benoit Fuentes, R. Badeau, G. Richard

Recently, new methods for smart decomposition of time-frequency representations of audio have been proposed in order to address the problem of automatic music transcription. However those techniques are not necessarily suitable for notes having variations of both pitch and spectral envelope over time. The HALCA (Harmonic Adaptive Latent Component Analysis) model presented in this article allows considering those two kinds of variations simultaneously. Each note in a constant-Q transform is locally modeled as a weighted sum of fixed narrowband harmonic spectra, spectrally convolved with some impulse that defines the pitch. All parameters are estimated by means of the expectation-maximization (EM) algorithm, in the framework of Probabilistic Latent Component Analysis. Interesting priors over the parameters are also introduced in order to help the EM algorithm converging towards a meaningful solution. We applied this model for automatic music transcription: the onset time, duration and pitch of each note in an audio file are inferred from the estimated parameters. The system has been evaluated on two different databases and obtains very promising results.

近年来，为了解决音乐自动转录的问题，提出了一种智能分解音频时频表示的新方法。然而，这些技术不一定适用于音调和频谱包络随时间变化的音符。本文提出的HALCA(谐波自适应潜成分分析)模型允许同时考虑这两种变化。常数q变换中的每个音符都局部建模为固定窄带谐波谱的加权和，与定义音高的脉冲进行频谱卷积。在概率潜在成分分析的框架下，采用期望最大化(EM)算法对所有参数进行估计。为了帮助EM算法收敛到有意义的解，还引入了参数上的有趣先验。我们将这个模型应用于自动音乐转录:音频文件中每个音符的开始时间、持续时间和音高都是从估计的参数中推断出来的。该系统在两个不同的数据库上进行了测试，取得了很好的效果。

{"title":"Harmonic Adaptive Latent Component Analysis of Audio and Application to Music Transcription","authors":"Benoit Fuentes, R. Badeau, G. Richard","doi":"10.1109/TASL.2013.2260741","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260741","url":null,"abstract":"Recently, new methods for smart decomposition of time-frequency representations of audio have been proposed in order to address the problem of automatic music transcription. However those techniques are not necessarily suitable for notes having variations of both pitch and spectral envelope over time. The HALCA (Harmonic Adaptive Latent Component Analysis) model presented in this article allows considering those two kinds of variations simultaneously. Each note in a constant-Q transform is locally modeled as a weighted sum of fixed narrowband harmonic spectra, spectrally convolved with some impulse that defines the pitch. All parameters are estimated by means of the expectation-maximization (EM) algorithm, in the framework of Probabilistic Latent Component Analysis. Interesting priors over the parameters are also introduced in order to help the EM algorithm converging towards a meaningful solution. We applied this model for automatic music transcription: the onset time, duration and pitch of each note in an audio file are inferred from the estimated parameters. The system has been evaluated on two different databases and obtains very promising results.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260741","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Regularization for Partial Multichannel Equalization for Speech Dereverberation 语音去噪部分多通道均衡的正则化

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2260743

I. Kodrasi, Stefan Goetze, S. Doclo

Acoustic multichannel equalization techniques such as the multiple-input/output inverse theorem (MINT), which aim to equalize the room impulse responses (RIRs) between the source and the microphone array, are known to be highly sensitive to RIR estimation errors. To increase robustness, it has been proposed to incorporate regularization in order to decrease the energy of the equalization filters. In addition, more robust partial multichannel equalization techniques such as relaxed multichannel least-squares (RMCLS) and channel shortening (CS) have recently been proposed. In this paper, we propose a partial multichannel equalization technique based on MINT (P-MINT) which aims to shorten the RIR. Furthermore, we investigate the effectiveness of incorporating regularization to further increase the robustness of P-MINT and the aforementioned partial multichannel equalization techniques, i.e., RMCLS and CS. In addition, we introduce an automatic non-intrusive procedure for determining the regularization parameter based on the L-curve. Simulation results using measured RIRs show that incorporating regularization in P-MINT yields a significant performance improvement in the presence of RIR estimation errors, whereas a smaller performance improvement is observed when incorporating regularization in RMCLS and CS. Furthermore, it is shown that the intrusively regularized P-MINT technique outperforms all other investigated intrusively regularized multichannel equalization techniques in terms of perceptual speech quality (PESQ). Finally, it is shown that the automatic non-intrusive regularization parameter in regularized P-MINT leads to a very similar performance as the intrusively determined optimal regularization parameter, making regularized P-MINT a robust, perceptually advantageous, and practically applicable multichannel equalization technique for speech dereverberation.

声学多通道均衡技术，如多输入/输出逆定理(MINT)，旨在均衡源和麦克风阵列之间的房间脉冲响应(RIR)，已知对RIR估计误差高度敏感。为了提高鲁棒性，提出了加入正则化以降低均衡滤波器的能量。此外，最近还提出了更鲁棒的部分多通道均衡技术，如松弛多通道最小二乘(RMCLS)和信道缩短(CS)。在本文中，我们提出了一种基于MINT (P-MINT)的部分多通道均衡技术，旨在缩短RIR。此外，我们研究了纳入正则化以进一步提高P-MINT和上述部分多通道均衡技术(即RMCLS和CS)的鲁棒性的有效性。此外，我们还介绍了一种基于l曲线确定正则化参数的自动非侵入程序。使用测量RIR的仿真结果表明，在存在RIR估计误差的情况下，在P-MINT中加入正则化可以显著提高性能，而在RMCLS和CS中加入正则化时，可以观察到较小的性能改进。此外，研究表明，在感知语音质量(PESQ)方面，入侵式正则化P-MINT技术优于所有其他研究过的入侵式正则化多通道均衡技术。最后，研究表明，正则化P-MINT中的自动非侵入性正则化参数与入侵确定的最优正则化参数具有非常相似的性能，使正则化P-MINT成为一种鲁棒的、感知上有利的、实际适用的多通道语音去噪均衡技术。

{"title":"Regularization for Partial Multichannel Equalization for Speech Dereverberation","authors":"I. Kodrasi, Stefan Goetze, S. Doclo","doi":"10.1109/TASL.2013.2260743","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260743","url":null,"abstract":"Acoustic multichannel equalization techniques such as the multiple-input/output inverse theorem (MINT), which aim to equalize the room impulse responses (RIRs) between the source and the microphone array, are known to be highly sensitive to RIR estimation errors. To increase robustness, it has been proposed to incorporate regularization in order to decrease the energy of the equalization filters. In addition, more robust partial multichannel equalization techniques such as relaxed multichannel least-squares (RMCLS) and channel shortening (CS) have recently been proposed. In this paper, we propose a partial multichannel equalization technique based on MINT (P-MINT) which aims to shorten the RIR. Furthermore, we investigate the effectiveness of incorporating regularization to further increase the robustness of P-MINT and the aforementioned partial multichannel equalization techniques, i.e., RMCLS and CS. In addition, we introduce an automatic non-intrusive procedure for determining the regularization parameter based on the L-curve. Simulation results using measured RIRs show that incorporating regularization in P-MINT yields a significant performance improvement in the presence of RIR estimation errors, whereas a smaller performance improvement is observed when incorporating regularization in RMCLS and CS. Furthermore, it is shown that the intrusively regularized P-MINT technique outperforms all other investigated intrusively regularized multichannel equalization techniques in terms of perceptual speech quality (PESQ). Finally, it is shown that the automatic non-intrusive regularization parameter in regularized P-MINT leads to a very similar performance as the intrusively determined optimal regularization parameter, making regularized P-MINT a robust, perceptually advantageous, and practically applicable multichannel equalization technique for speech dereverberation.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260743","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 67

Automatic Accent Assessment Using Phonetic Mismatch and Human Perception 基于语音不匹配和人类感知的自动口音评估

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2258011

F. William, A. Sangwan, J. Hansen

In this study, a new algorithm for automatic accent evaluation of native and non-native speakers is presented. The proposed system consists of two main steps: alignment and scoring. In the alignment step, the speech utterance is processed using a Weighted Finite State Transducer (WFST) based technique to automatically estimate the pronunciation mismatches (substitutions, deletions, and insertions). Subsequently, in the scoring step, two scoring systems which utilize the pronunciation mismatches from the alignment phase are proposed: (i) a WFST-scoring system to measure the degree of accentedness on a scale from -1 (non-native like) to +1 (native like), and a (ii) Maximum Entropy (ME) based technique to assign perceptually motivated scores to pronunciation mismatches. The accent scores provided from the WFST-scoring system as well as the ME scoring system are termed as the WFST and P-WFST (perceptual WFST) accent scores, respectively. The proposed systems are evaluated on American English (AE) spoken by native and non-native (native speakers of Mandarin-Chinese) speakers from the CU-Accent corpus. A listener evaluation of 50 Native American English (N-AE) was employed to assist in validating the performance of the proposed accent assessment systems. The proposed P-WFST algorithm shows higher and more consistent correlation with human evaluated accent scores, when compared to the Goodness Of Pronunciation (GOP) measure. The proposed solution for accent classification and assessment based on WFST and P-WFST scores show that an effective advancement is possible which correlates well with human perception.

本研究提出了一种新的母语和非母语使用者口音自动评估算法。提出的系统包括两个主要步骤:对齐和评分。在对齐步骤中，使用基于加权有限状态换能器(WFST)的技术对语音进行处理，自动估计发音不匹配(替换、删除和插入)。随后，在评分步骤中，提出了两种利用对齐阶段的发音不匹配的评分系统:(i) wfst评分系统，用于在-1(非母语相似)到+1(母语相似)的范围内测量口音程度，以及(ii)基于最大熵(ME)的技术，用于为发音不匹配分配感知动机分数。由WFST评分系统和ME评分系统提供的口音分数分别被称为WFST和P-WFST(感知WFST)口音分数。提出的系统对来自CU-Accent语料库的母语和非母语(汉语普通话为母语的人)的美国英语(AE)进行了评估。对50名美国原住民英语(N-AE)的听者进行了评估，以协助验证所提出的口音评估系统的性能。与发音优度(GOP)测量相比，所提出的P-WFST算法与人类评估的口音分数具有更高和更一致的相关性。本文提出的基于WFST和P-WFST分数的口音分类和评估方法表明，这种方法可以有效地提高口音分类和评估的效率，并且与人类的感知密切相关。

{"title":"Automatic Accent Assessment Using Phonetic Mismatch and Human Perception","authors":"F. William, A. Sangwan, J. Hansen","doi":"10.1109/TASL.2013.2258011","DOIUrl":"https://doi.org/10.1109/TASL.2013.2258011","url":null,"abstract":"In this study, a new algorithm for automatic accent evaluation of native and non-native speakers is presented. The proposed system consists of two main steps: alignment and scoring. In the alignment step, the speech utterance is processed using a Weighted Finite State Transducer (WFST) based technique to automatically estimate the pronunciation mismatches (substitutions, deletions, and insertions). Subsequently, in the scoring step, two scoring systems which utilize the pronunciation mismatches from the alignment phase are proposed: (i) a WFST-scoring system to measure the degree of accentedness on a scale from -1 (non-native like) to +1 (native like), and a (ii) Maximum Entropy (ME) based technique to assign perceptually motivated scores to pronunciation mismatches. The accent scores provided from the WFST-scoring system as well as the ME scoring system are termed as the WFST and P-WFST (perceptual WFST) accent scores, respectively. The proposed systems are evaluated on American English (AE) spoken by native and non-native (native speakers of Mandarin-Chinese) speakers from the CU-Accent corpus. A listener evaluation of 50 Native American English (N-AE) was employed to assist in validating the performance of the proposed accent assessment systems. The proposed P-WFST algorithm shows higher and more consistent correlation with human evaluated accent scores, when compared to the Goodness Of Pronunciation (GOP) measure. The proposed solution for accent classification and assessment based on WFST and P-WFST scores show that an effective advancement is possible which correlates well with human perception.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2258011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Nonlinear Acoustic Echo Cancellation Based on a Sliding-Window Leaky Kernel Affine Projection Algorithm 基于滑动窗漏核仿射投影算法的非线性声回波消除

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2260742

Jose Manuel Gil-Cacho, M. Signoretto, T. Waterschoot, M. Moonen, S. H. Jensen

Acoustic echo cancellation (AEC) is used in speech communication systems where the existence of echoes degrades the speech intelligibility. Standard approaches to AEC rely on the assumption that the echo path to be identified can be modeled by a linear filter. However, some elements introduce nonlinear distortion and must be modeled as nonlinear systems. Several nonlinear models have been used with more or less success. The kernel affine projection algorithm (KAPA) has been successfully applied to many areas in signal processing but not yet to nonlinear AEC (NLAEC). The contribution of this paper is three-fold: (1) to apply KAPA to the NLAEC problem, (2) to develop a sliding-window leaky KAPA (SWL-KAPA) that is well suited for NLAEC applications, and (3) to propose a kernel function, consisting of a weighted sum of a linear and a Gaussian kernel. In our experiment set-up, the proposed SWL-KAPA for NLAEC consistently outperforms the linear APA, resulting in up to 12 dB of improvement in ERLE at a computational cost that is only 4.6 times higher. Moreover, it is shown that the SWL-KAPA outperforms, by 4-6 dB, a Volterra-based NLAEC, which itself has a much higher 413 times computational cost than the linear APA.

回声消除技术主要应用于语音通信系统中，回声的存在会降低语音的可理解性。AEC的标准方法依赖于要识别的回波路径可以通过线性滤波器建模的假设。然而，一些元件引入了非线性畸变，必须作为非线性系统建模。一些非线性模型的应用或多或少取得了成功。核仿射投影算法(KAPA)已成功地应用于信号处理的许多领域，但尚未应用于非线性AEC (NLAEC)。本文的贡献有三个方面:(1)将KAPA应用于NLAEC问题，(2)开发了一个非常适合NLAEC应用的滑动窗口泄漏KAPA (SWL-KAPA)，以及(3)提出了一个由线性核和高斯核加权和组成的核函数。在我们的实验设置中，提出的用于NLAEC的SWL-KAPA始终优于线性APA，导致ERLE提高高达12 dB，而计算成本仅高出4.6倍。此外，研究表明，SWL-KAPA比基于volterra的NLAEC性能好4-6 dB，后者本身的计算成本比线性APA高413倍。

{"title":"Nonlinear Acoustic Echo Cancellation Based on a Sliding-Window Leaky Kernel Affine Projection Algorithm","authors":"Jose Manuel Gil-Cacho, M. Signoretto, T. Waterschoot, M. Moonen, S. H. Jensen","doi":"10.1109/TASL.2013.2260742","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260742","url":null,"abstract":"Acoustic echo cancellation (AEC) is used in speech communication systems where the existence of echoes degrades the speech intelligibility. Standard approaches to AEC rely on the assumption that the echo path to be identified can be modeled by a linear filter. However, some elements introduce nonlinear distortion and must be modeled as nonlinear systems. Several nonlinear models have been used with more or less success. The kernel affine projection algorithm (KAPA) has been successfully applied to many areas in signal processing but not yet to nonlinear AEC (NLAEC). The contribution of this paper is three-fold: (1) to apply KAPA to the NLAEC problem, (2) to develop a sliding-window leaky KAPA (SWL-KAPA) that is well suited for NLAEC applications, and (3) to propose a kernel function, consisting of a weighted sum of a linear and a Gaussian kernel. In our experiment set-up, the proposed SWL-KAPA for NLAEC consistently outperforms the linear APA, resulting in up to 12 dB of improvement in ERLE at a computational cost that is only 4.6 times higher. Moreover, it is shown that the SWL-KAPA outperforms, by 4-6 dB, a Volterra-based NLAEC, which itself has a much higher 413 times computational cost than the linear APA.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260742","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Sound Source Distance Estimation in Rooms based on Statistical Properties of Binaural Signals 基于双耳信号统计特性的室内声源距离估计

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2260155

Eleftheria Georganti, T. May, S. Par, J. Mourjopoulos

A novel method for the estimation of the distance of a sound source from binaural speech signals is proposed. The method relies on several statistical features extracted from such signals and their binaural cues. Firstly, the standard deviation of the difference of the magnitude spectra of the left and right binaural signals is used as a feature for this method. In addition, an extended set of additional statistical features that can improve distance detection is extracted from an auditory front-end which models the peripheral processing of the human auditory system. The method incorporates the above features into two classification frameworks based on Gaussian mixture models and Support Vector Machines and the relative merits of those frameworks are evaluated. The proposed method achieves distance detection when tested in various acoustical environments and performs well in unknown environments. Its performance is also compared to an existing binaural distance detection method.

提出了一种从双耳语音信号中估计声源距离的新方法。该方法依赖于从这些信号及其双耳线索中提取的几个统计特征。该方法首先利用左右双耳信号的星等谱差的标准差作为特征;此外，从模拟人类听觉系统外围处理的听觉前端提取了一组扩展的附加统计特征，可以改进距离检测。该方法将上述特征融合到基于高斯混合模型和支持向量机的两种分类框架中，并对两种框架的优劣进行了比较。该方法在各种声环境下均能实现距离检测，在未知环境下也能取得良好的效果。并与现有的双耳距离检测方法进行了比较。

引用次数: 36