IEEE Transactions on Audio Speech and Language Processing最新文献

英文中文

Cross Pattern Coherence Algorithm for Spatial Filtering Applications Utilizing Microphone Arrays 利用传声器阵列进行空间滤波的交叉模式相干算法

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2277928

Symeon Delikaris-Manias, V. Pulkki

A parametric spatial filtering algorithm with a fixed beam direction is proposed in this paper. The algorithm utilizes the normalized cross-spectral density between signals from microphones of different orders as a criterion for focusing in specific directions. The correlation between microphone signals is estimated in the time-frequency domain. A post-filter is calculated from a multichannel input and is used to assign attenuation values to a coincidentally captured audio signal. The proposed algorithm is simple to implement and offers the capability of coping with interfering sources at different azimuthal locations with or without the presence of diffuse sound. It is implemented by using directional microphones placed in the same look direction and have the same magnitude and phase response. Experiments are conducted with simulated and real microphone arrays employing the proposed post-filter and compared to previous coherence-based approaches, such as the McCowan post-filter. A significant improvement is demonstrated in terms of objective quality measures. Formal listening tests conducted to assess the audibility of artifacts of the proposed algorithm in real acoustical scenarios show that no annoying artifacts existed with certain spectral floor values. Examples of the proposed algorithm can be found online at http://www.acoustics.hut.fi/projects/cropac/soundExamples.

提出了一种固定波束方向的参数空间滤波算法。该算法利用不同阶麦克风信号之间的归一化交叉谱密度作为特定方向聚焦的准则。在时频域估计传声器信号之间的相关性。后滤波器是从多通道输入计算出来的，并用于将衰减值分配给巧合捕获的音频信号。该算法实现简单，并提供了在有或没有漫射声存在的情况下处理不同方位位置干扰源的能力。它是通过使用定向麦克风放置在相同的外观方向，具有相同的幅度和相位响应来实现的。采用该后滤波器对模拟和真实麦克风阵列进行了实验，并与先前基于相干的方法(如McCowan后滤波器)进行了比较。在客观质量度量方面有了显著的改进。为评估所提出算法的伪影在真实声学场景中的可听性而进行的正式听力测试表明，不存在具有某些谱底值的令人讨厌的伪影。该算法的示例可以在http://www.acoustics.hut.fi/projects/cropac/soundExamples上找到。

{"title":"Cross Pattern Coherence Algorithm for Spatial Filtering Applications Utilizing Microphone Arrays","authors":"Symeon Delikaris-Manias, V. Pulkki","doi":"10.1109/TASL.2013.2277928","DOIUrl":"https://doi.org/10.1109/TASL.2013.2277928","url":null,"abstract":"A parametric spatial filtering algorithm with a fixed beam direction is proposed in this paper. The algorithm utilizes the normalized cross-spectral density between signals from microphones of different orders as a criterion for focusing in specific directions. The correlation between microphone signals is estimated in the time-frequency domain. A post-filter is calculated from a multichannel input and is used to assign attenuation values to a coincidentally captured audio signal. The proposed algorithm is simple to implement and offers the capability of coping with interfering sources at different azimuthal locations with or without the presence of diffuse sound. It is implemented by using directional microphones placed in the same look direction and have the same magnitude and phase response. Experiments are conducted with simulated and real microphone arrays employing the proposed post-filter and compared to previous coherence-based approaches, such as the McCowan post-filter. A significant improvement is demonstrated in terms of objective quality measures. Formal listening tests conducted to assess the audibility of artifacts of the proposed algorithm in real acoustical scenarios show that no annoying artifacts existed with certain spectral floor values. Examples of the proposed algorithm can be found online at http://www.acoustics.hut.fi/projects/cropac/soundExamples.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2356-2367"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2277928","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Passive Temporal Offset Estimation of Multichannel Recordings of an Ad-Hoc Microphone Array Ad-Hoc麦克风阵列多通道记录的无源时间偏移估计

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASLP.2013.2286921

Pasi Pertilä, M. Hämäläinen, Mikael Mieskolainen

In recent years ad-hoc microphone arrays have become ubiquitous, and the capture hardware and quality is increasingly more sophisticated. Ad-hoc arrays hold a vast potential for audio applications, but they are inherently asynchronous, i.e., temporal offset exists in each channel, and furthermore the device locations are generally unknown. Therefore, the data is not directly suitable for traditional microphone array applications such as source localization and beamforming. This work presents a least squares method for temporal offset estimation of a static ad-hoc microphone array. The method utilizes the captured audio content without the need to emit calibration signals, provided that during the recording a sufficient amount of sound sources surround the array. The Cramer-Rao lower bound of the estimator is given and the effect of limited number of surrounding sources on the solution accuracy is investigated. A practical implementation is then presented using non-linear filtering with automatic parameter adjustment. Simulations over a range of reverberation and noise levels demonstrate the algorithm's robustness. Using smartphones an average RMS error of 3.5 samples (at 48 kHz) was reached when the algorithm's assumptions were met.

近年来，自组织麦克风阵列已经变得无处不在，捕获硬件和质量也越来越复杂。Ad-hoc阵列在音频应用中具有巨大的潜力，但它们本质上是异步的，即每个通道中存在时间偏移，而且设备位置通常是未知的。因此，这些数据并不直接适用于传统的麦克风阵列应用，如源定位和波束形成。本文提出了一种用于静态ad-hoc麦克风阵列时间偏移估计的最小二乘方法。该方法利用所捕获的音频内容，而不需要发射校准信号，只要在记录期间有足够数量的声源环绕所述阵列。给出了估计量的Cramer-Rao下界，并研究了有限数量的周围源对解精度的影响。然后给出了采用非线性滤波和自动参数调整的实际实现。在混响和噪声水平范围内的仿真证明了该算法的鲁棒性。使用智能手机，当算法的假设得到满足时，平均均方根误差为3.5个样本(在48 kHz时)。

{"title":"Passive Temporal Offset Estimation of Multichannel Recordings of an Ad-Hoc Microphone Array","authors":"Pasi Pertilä, M. Hämäläinen, Mikael Mieskolainen","doi":"10.1109/TASLP.2013.2286921","DOIUrl":"https://doi.org/10.1109/TASLP.2013.2286921","url":null,"abstract":"In recent years ad-hoc microphone arrays have become ubiquitous, and the capture hardware and quality is increasingly more sophisticated. Ad-hoc arrays hold a vast potential for audio applications, but they are inherently asynchronous, i.e., temporal offset exists in each channel, and furthermore the device locations are generally unknown. Therefore, the data is not directly suitable for traditional microphone array applications such as source localization and beamforming. This work presents a least squares method for temporal offset estimation of a static ad-hoc microphone array. The method utilizes the captured audio content without the need to emit calibration signals, provided that during the recording a sufficient amount of sound sources surround the array. The Cramer-Rao lower bound of the estimator is given and the effect of limited number of surrounding sources on the solution accuracy is investigated. A practical implementation is then presented using non-linear filtering with automatic parameter adjustment. Simulations over a range of reverberation and noise levels demonstrate the algorithm's robustness. Using smartphones an average RMS error of 3.5 samples (at 48 kHz) was reached when the algorithm's assumptions were met.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2393-2402"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASLP.2013.2286921","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Second Order Methods for Optimizing Convex Matrix Functions and Sparse Covariance Clustering 二阶优化凸矩阵函数和稀疏协方差聚类方法

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2263142

Gillian M. Chin, J. Nocedal, P. Olsen, Steven J. Rennie

A variety of first-order methods have recently been proposed for solving matrix optimization problems arising in machine learning. The premise for utilizing such algorithms is that second order information is too expensive to employ, and so simple first-order iterations are likely to be optimal. In this paper, we argue that second-order information is in fact efficiently accessible in many matrix optimization problems, and can be effectively incorporated into optimization algorithms. We begin by reviewing how certain Hessian operations can be conveniently represented in a wide class of matrix optimization problems, and provide the first proofs for these results. Next we consider a concrete problem, namely the minimization of the ℓ1 regularized Jeffreys divergence, and derive formulae for computing Hessians and Hessian vector products. This allows us to propose various second order methods for solving the Jeffreys divergence problem. We present extensive numerical results illustrating the behavior of the algorithms and apply the methods to a speech recognition problem. We compress full covariance Gaussian mixture models utilized for acoustic models in automatic speech recognition. By discovering clusters of (sparse inverse) covariance matrices, we can compress the number of covariance parameters by a factor exceeding 200, while still outperforming the word error rate (WER) performance of a diagonal covariance model that has 20 times less covariance parameters than the original acoustic model.

最近，人们提出了各种一阶方法来解决机器学习中出现的矩阵优化问题。使用这种算法的前提是二阶信息的使用成本太高，因此简单的一阶迭代可能是最优的。在本文中，我们认为二阶信息在许多矩阵优化问题中实际上是可有效访问的，并且可以有效地纳入优化算法。我们首先回顾了如何在一类广泛的矩阵优化问题中方便地表示某些Hessian操作，并为这些结果提供了第一个证明。接下来我们考虑一个具体的问题，即最小化1正则化Jeffreys散度，并推导出计算Hessians和Hessian向量积的公式。这允许我们提出各种二阶方法来解决杰弗里斯散度问题。我们给出了大量的数值结果来说明算法的行为，并将这些方法应用于语音识别问题。我们压缩了用于自动语音识别声学模型的全协方差高斯混合模型。通过发现(稀疏逆)协方差矩阵簇，我们可以将协方差参数的数量压缩超过200个因子，同时仍然优于协方差参数比原始声学模型少20倍的对角协方差模型的单词错误率(WER)性能。

{"title":"Second Order Methods for Optimizing Convex Matrix Functions and Sparse Covariance Clustering","authors":"Gillian M. Chin, J. Nocedal, P. Olsen, Steven J. Rennie","doi":"10.1109/TASL.2013.2263142","DOIUrl":"https://doi.org/10.1109/TASL.2013.2263142","url":null,"abstract":"A variety of first-order methods have recently been proposed for solving matrix optimization problems arising in machine learning. The premise for utilizing such algorithms is that second order information is too expensive to employ, and so simple first-order iterations are likely to be optimal. In this paper, we argue that second-order information is in fact efficiently accessible in many matrix optimization problems, and can be effectively incorporated into optimization algorithms. We begin by reviewing how certain Hessian operations can be conveniently represented in a wide class of matrix optimization problems, and provide the first proofs for these results. Next we consider a concrete problem, namely the minimization of the ℓ1 regularized Jeffreys divergence, and derive formulae for computing Hessians and Hessian vector products. This allows us to propose various second order methods for solving the Jeffreys divergence problem. We present extensive numerical results illustrating the behavior of the algorithms and apply the methods to a speech recognition problem. We compress full covariance Gaussian mixture models utilized for acoustic models in automatic speech recognition. By discovering clusters of (sparse inverse) covariance matrices, we can compress the number of covariance parameters by a factor exceeding 200, while still outperforming the word error rate (WER) performance of a diagonal covariance model that has 20 times less covariance parameters than the original acoustic model.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"123 1","pages":"2244-2254"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2263142","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Distributional Semantic Models for Affective Text Analysis 情感文本分析的分布语义模型

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2277931

Nikos Malandrakis, A. Potamianos, Elias Iosif, Shrikanth S. Narayanan

We present an affective text analysis model that can directly estimate and combine affective ratings of multi-word terms, with application to the problem of sentence polarity/semantic orientation detection. Starting from a hierarchical compositional method for generating sentence ratings, we expand the model by adding multi-word terms that can capture non-compositional semantics. The method operates similarly to a bigram language model, using bigram terms or backing off to unigrams based on a (degree of) compositionality criterion. The affective ratings for n-gram terms of different orders are estimated via a corpus-based method using distributional semantic similarity metrics between unseen words and a set of seed words. N-gram ratings are then combined into sentence ratings via simple algebraic formulas. The proposed framework produces state-of-the-art results for word-level tasks in English and German and the sentence-level news headlines classification SemEval'07-Task14 task. The inclusion of bigram terms to the model provides significant performance improvement, even if no term selection is applied.

我们提出了一个情感文本分析模型，该模型可以直接估计和组合多词术语的情感等级，并应用于句子极性/语义方向检测问题。从生成句子评级的分层组合方法开始，我们通过添加可以捕获非组合语义的多词术语来扩展模型。该方法的操作类似于双字母语言模型，使用双字母术语或根据(程度)组合性标准退回到单字母。通过基于语料库的方法，使用未见词和一组种子词之间的分布语义相似性度量来估计不同顺序的n-gram词的情感评级。然后通过简单的代数公式将N-gram评级组合成句子评级。该框架为英语和德语的单词级任务以及句子级新闻标题分类SemEval'07-Task14任务提供了最先进的结果。即使没有应用术语选择，在模型中包含双元词也能显著提高性能。

{"title":"Distributional Semantic Models for Affective Text Analysis","authors":"Nikos Malandrakis, A. Potamianos, Elias Iosif, Shrikanth S. Narayanan","doi":"10.1109/TASL.2013.2277931","DOIUrl":"https://doi.org/10.1109/TASL.2013.2277931","url":null,"abstract":"We present an affective text analysis model that can directly estimate and combine affective ratings of multi-word terms, with application to the problem of sentence polarity/semantic orientation detection. Starting from a hierarchical compositional method for generating sentence ratings, we expand the model by adding multi-word terms that can capture non-compositional semantics. The method operates similarly to a bigram language model, using bigram terms or backing off to unigrams based on a (degree of) compositionality criterion. The affective ratings for n-gram terms of different orders are estimated via a corpus-based method using distributional semantic similarity metrics between unseen words and a set of seed words. N-gram ratings are then combined into sentence ratings via simple algebraic formulas. The proposed framework produces state-of-the-art results for word-level tasks in English and German and the sentence-level news headlines classification SemEval'07-Task14 task. The inclusion of bigram terms to the model provides significant performance improvement, even if no term selection is applied.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2379-2392"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2277931","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

Introduction to the Special Section on Large-Scale Optimization for Audio, Speech, and Language Processing 关于音频、语音和语言处理大规模优化的专题介绍

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2283631

D. Kanevsky, Xiaodong He, G. Heigold, Haizhou Li, Stephen J. Wright

The six papers in this special section on large-scale optimization for Audio, Speech, and Language Processing are summarized here.

在这个特殊的部分中，关于音频、语音和语言处理的大规模优化的六篇论文总结在这里。

引用次数: 0

Optimization Techniques to Improve Training Speed of Deep Neural Networks for Large Speech Tasks 提高大型语音任务深度神经网络训练速度的优化技术

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2284378

Tara N. Sainath, Brian Kingsbury, H. Soltau, B. Ramabhadran

While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training these networks is slow. Even to date, the most common approach to train DNNs is via stochastic gradient descent, serially on one machine. Serial training, coupled with the large number of training parameters (i.e., 10-50 million) and speech data set sizes (i.e., 20-100 million training points) makes DNN training very slow for LVCSR tasks. In this work, we explore a variety of different optimization techniques to improve DNN training speed. This includes parallelization of the gradient computation during cross-entropy and sequence training, as well as reducing the number of parameters in the network using a low-rank matrix factorization. Applying the proposed optimization techniques, we show that DNN training can be sped up by a factor of 3 on a 50-hour English Broadcast News (BN) task with no loss in accuracy. Furthermore, using the proposed techniques, we are able to train DNNs on a 300-hr Switchboard (SWB) task and a 400-hr English BN task, showing improvements between 9-30% relative over a state-of-the art GMM/HMM system while the number of parameters of the DNN is smaller than the GMM/HMM system.

虽然深度神经网络(dnn)在大词汇量连续语音识别(LVCSR)任务中取得了巨大的成功，但训练这些网络的速度很慢。即使到目前为止，训练dnn最常见的方法是通过随机梯度下降，在一台机器上连续进行。串行训练，再加上大量的训练参数(即1000 - 5000万个)和语音数据集规模(即2000 -1亿个训练点)，使得DNN训练对于LVCSR任务来说非常缓慢。在这项工作中，我们探索了各种不同的优化技术来提高深度神经网络的训练速度。这包括交叉熵和序列训练期间梯度计算的并行化，以及使用低秩矩阵分解减少网络中参数的数量。应用提出的优化技术，我们表明DNN训练可以在50小时的英语广播新闻(BN)任务中加速3倍，而准确性没有损失。此外，使用所提出的技术，我们能够在300小时的交换机(SWB)任务和400小时的英语BN任务上训练DNN，相对于最先进的GMM/HMM系统，DNN的参数数量比GMM/HMM系统少，显示出9-30%的改进。

{"title":"Optimization Techniques to Improve Training Speed of Deep Neural Networks for Large Speech Tasks","authors":"Tara N. Sainath, Brian Kingsbury, H. Soltau, B. Ramabhadran","doi":"10.1109/TASL.2013.2284378","DOIUrl":"https://doi.org/10.1109/TASL.2013.2284378","url":null,"abstract":"While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training these networks is slow. Even to date, the most common approach to train DNNs is via stochastic gradient descent, serially on one machine. Serial training, coupled with the large number of training parameters (i.e., 10-50 million) and speech data set sizes (i.e., 20-100 million training points) makes DNN training very slow for LVCSR tasks. In this work, we explore a variety of different optimization techniques to improve DNN training speed. This includes parallelization of the gradient computation during cross-entropy and sequence training, as well as reducing the number of parameters in the network using a low-rank matrix factorization. Applying the proposed optimization techniques, we show that DNN training can be sped up by a factor of 3 on a 50-hour English Broadcast News (BN) task with no loss in accuracy. Furthermore, using the proposed techniques, we are able to train DNNs on a 300-hr Switchboard (SWB) task and a 400-hr English BN task, showing improvements between 9-30% relative over a state-of-the art GMM/HMM system while the number of parameters of the DNN is smaller than the GMM/HMM system.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2267-2276"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2284378","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

A Difference of Convex Functions Approach to Large-Scale Log-Linear Model Estimation 大规模对数线性模型估计的凸函数差分法

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2271592

Theodoros Tsiligkaridis, E. Marcheret, V. Goel

We introduce a new class of parameter estimation methods for log-linear models. Our approach relies on the fact that minimizing a rational function of mixtures of exponentials is equivalent to minimizing a difference of convex functions. This allows us to construct convex auxiliary functions by applying the concave-convex procedure (CCCP). We consider a modification of CCCP where a proximal term is added (ProxCCCP), and extend it further by introducing an ℓ1 penalty. For solving the ` convex + ℓ1' auxiliary problem, we propose an approach called SeqGPSR that is based on sequential application of the GPSR procedure. We present convergence analysis of the algorithms, including sufficient conditions for convergence to a critical point of the objective function. We propose an adaptive procedure for varying the strength of the proximal regularization term in each ProxCCCP iteration, and show this procedure (AProxCCCP) is effective in practice and stable under some mild conditions. The CCCP procedure and proposed variants are applied to the task of optimizing the cross-entropy objective function for an audio frame classification problem. Class posteriors are modeled using log-linear models consisting of approximately 6 million parameters. Our results show that CCCP variants achieve a much better cross-entropy objective value as compared to direct optimization of the objective function by a first order gradient based approach, stochastic gradient descent or the L-BFGS procedure.

介绍了一类新的对数线性模型参数估计方法。我们的方法依赖于这样一个事实，即最小化指数混合的有理函数等价于最小化凸函数的差。这允许我们通过应用凹凸过程(CCCP)来构造凸辅助函数。我们考虑了CCCP的一个修改，其中增加了一个近项(ProxCCCP)，并通过引入一个1惩罚进一步扩展了它。为了解决“凸+ 1”辅助问题，我们提出了一种基于GPSR过程的顺序应用的方法，称为SeqGPSR。我们给出了算法的收敛性分析，包括收敛到目标函数临界点的充分条件。我们提出了一种自适应过程来改变每次ProxCCCP迭代中近端正则化项的强度，并证明了该过程(AProxCCCP)在实践中是有效的，在一些温和的条件下是稳定的。将CCCP过程及其提出的变体应用于音频帧分类问题的交叉熵目标函数优化任务。类后验使用由大约600万个参数组成的对数线性模型建模。我们的研究结果表明，与使用基于一阶梯度的方法、随机梯度下降或L-BFGS过程直接优化目标函数相比，CCCP变量获得了更好的交叉熵目标值。

{"title":"A Difference of Convex Functions Approach to Large-Scale Log-Linear Model Estimation","authors":"Theodoros Tsiligkaridis, E. Marcheret, V. Goel","doi":"10.1109/TASL.2013.2271592","DOIUrl":"https://doi.org/10.1109/TASL.2013.2271592","url":null,"abstract":"We introduce a new class of parameter estimation methods for log-linear models. Our approach relies on the fact that minimizing a rational function of mixtures of exponentials is equivalent to minimizing a difference of convex functions. This allows us to construct convex auxiliary functions by applying the concave-convex procedure (CCCP). We consider a modification of CCCP where a proximal term is added (ProxCCCP), and extend it further by introducing an ℓ1 penalty. For solving the ` convex + ℓ1' auxiliary problem, we propose an approach called SeqGPSR that is based on sequential application of the GPSR procedure. We present convergence analysis of the algorithms, including sufficient conditions for convergence to a critical point of the objective function. We propose an adaptive procedure for varying the strength of the proximal regularization term in each ProxCCCP iteration, and show this procedure (AProxCCCP) is effective in practice and stable under some mild conditions. The CCCP procedure and proposed variants are applied to the task of optimizing the cross-entropy objective function for an audio frame classification problem. Class posteriors are modeled using log-linear models consisting of approximately 6 million parameters. Our results show that CCCP variants achieve a much better cross-entropy objective value as compared to direct optimization of the objective function by a first order gradient based approach, stochastic gradient descent or the L-BFGS procedure.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2255-2266"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2271592","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Optimization Algorithms and Applications for Speech and Language Processing 语音和语言处理的优化算法和应用

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2283777

Stephen J. Wright, D. Kanevsky, L. Deng, Xiaodong He, G. Heigold, Haizhou Li

Optimization techniques have been used for many years in the formulation and solution of computational problems arising in speech and language processing. Such techniques are found in the Baum-Welch, extended Baum-Welch (EBW), Rprop, and GIS algorithms, for example. Additionally, the use of regularization terms has been seen in other applications of sparse optimization. This paper outlines a range of problems in which optimization formulations and algorithms play a role, giving some additional details on certain application problems in machine translation, speaker/language recognition, and automatic speech recognition. Several approaches developed in the speech and language processing communities are described in a way that makes them more recognizable as optimization procedures. Our survey is not exhaustive and is complemented by other papers in this volume.

多年来，优化技术一直被用于语音和语言处理中出现的计算问题的表述和解决。例如，这些技术可以在Baum-Welch、扩展的Baum-Welch (EBW)、Rprop和GIS算法中找到。此外，在稀疏优化的其他应用中也可以看到正则化项的使用。本文概述了优化公式和算法发挥作用的一系列问题，并对机器翻译、说话人/语言识别和自动语音识别中的某些应用问题给出了一些额外的细节。在语音和语言处理社区中开发的几种方法以一种使它们更易于识别为优化过程的方式进行了描述。我们的调查并不详尽，并补充了本卷中的其他论文。

引用次数: 28

Room Reverberation Reconstruction: Interpolation of the Early Part Using Compressed Sensing 室内混响重建:使用压缩传感的早期部分插值

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2273662

R. Mignot, L. Daudet, F. Ollivier

This paper deals with the interpolation of the Room Impulse Responses (RIRs) within a whole volume, from as few measurements as possible, and without the knowledge of the geometry of the room. We focus on the early reflections of the RIRs, that have the key property of being sparse in the time domain: this can be exploited in a framework of model-based Compressed Sensing. Starting from a set of RIRs randomly sampled in the spatial domain of interest by a 3D microphone array, we propose a modified Matching Pursuit algorithm to estimate the position of a small set of virtual sources. Then, the reconstruction of the RIRs at interpolated positions is performed using a projection onto a basis of monopoles, which correspond to the estimated virtual sources. An extension of the proposed algorithm allows the interpolation of the positions of both source and receiver, using the acquisition of four different source positions. This approach is validated both by numerical examples, and by experimental measurements using a 3D array with up to 120 microphones.

本文在不了解房间几何形状的情况下，通过尽可能少的测量，处理整个体积内房间脉冲响应(RIRs)的插值。我们专注于rir的早期反射，其在时域中具有稀疏的关键属性:这可以在基于模型的压缩感知框架中利用。从三维麦克风阵列在感兴趣的空间域中随机采样的一组rir开始，我们提出了一种改进的匹配追踪算法来估计一小组虚拟源的位置。然后，利用与估计的虚源相对应的单极子基上的投影来重建插值位置上的rir。该算法的扩展允许源和接收器的位置插值，使用四个不同的源位置的采集。该方法通过数值实例和使用多达120个麦克风的3D阵列的实验测量进行了验证。

引用次数: 50

Diffused Sensing for Sharp Directive Beamforming 锐利指令波束形成的扩散传感

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2274695

K. Niwa, Yusuke Hioka, K. Furuya, Y. Haneda

We generalized our previously proposed diffused sensing for a microphone array design to achieve sharp directive beamforming to enable various filter design methods to be applied. In the conventional microphone array, various filter design methods have been studied to narrow the directivity beam width. However, it is difficult to minimize the power of interference sources in the beamforming output (output interference power) over a broad frequency range since the cross-correlation between transfer functions from sound sources to microphones increases in some frequencies. With the diffused sensing, the cross-correlation is minimized by physically varying the transfer functions. We investigated how a microphone array should be designed in order to minimize the cross-correlation between transfer functions and found that placing the array in a diffuse acoustic field produces optimum results. Because the transfer functions are known a priori, this finding makes it possible to narrow the directivity beam width over a broad frequency range. This observation can be practically achieved by placing microphones inside a reflective enclosure, part of which is open to let sound waves enter. We conducted experiments using 24 microphones and confirmed that the output interference power was reduced over a broad frequency range and the beam width was narrowed by using the diffused sensing.

我们推广了之前提出的扩散传感麦克风阵列设计，以实现尖锐的指示波束形成，使各种滤波器设计方法得以应用。在传统的传声器阵列中，研究了各种滤波器设计方法来缩小指向性波束宽度。然而，在较宽的频率范围内，由于声源到传声器的传递函数之间的相互关系在某些频率上增加，因此很难使波束形成输出中的干扰源功率(输出干扰功率)最小。在扩散传感中，通过物理改变传递函数来最小化相互关系。我们研究了如何设计传声器阵列以最小化传递函数之间的相互关联，并发现将阵列放置在漫射声场中可以产生最佳效果。因为传递函数是已知的先验的，这一发现使得有可能缩小指向性波束宽度在一个很宽的频率范围内。这种观察实际上可以通过将麦克风放置在一个反射罩内来实现，反射罩的一部分是开放的，可以让声波进入。我们用24个麦克风进行了实验，证实了扩散式传感在较宽的频率范围内降低了输出干扰功率，并缩小了波束宽度。

{"title":"Diffused Sensing for Sharp Directive Beamforming","authors":"K. Niwa, Yusuke Hioka, K. Furuya, Y. Haneda","doi":"10.1109/TASL.2013.2274695","DOIUrl":"https://doi.org/10.1109/TASL.2013.2274695","url":null,"abstract":"We generalized our previously proposed diffused sensing for a microphone array design to achieve sharp directive beamforming to enable various filter design methods to be applied. In the conventional microphone array, various filter design methods have been studied to narrow the directivity beam width. However, it is difficult to minimize the power of interference sources in the beamforming output (output interference power) over a broad frequency range since the cross-correlation between transfer functions from sound sources to microphones increases in some frequencies. With the diffused sensing, the cross-correlation is minimized by physically varying the transfer functions. We investigated how a microphone array should be designed in order to minimize the cross-correlation between transfer functions and found that placing the array in a diffuse acoustic field produces optimum results. Because the transfer functions are known a priori, this finding makes it possible to narrow the directivity beam width over a broad frequency range. This observation can be practically achieved by placing microphones inside a reflective enclosure, part of which is open to let sound waves enter. We conducted experiments using 24 microphones and confirmed that the output interference power was reduced over a broad frequency range and the beam width was narrowed by using the diffused sensing.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2346-2355"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2274695","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Transactions on Audio Speech and Language Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀