IEEE Transactions on Audio Speech and Language Processing最新文献

英文中文

Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise 最大限度地提高音素识别精度，提高语音可理解性噪声

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2244089

Petko N. Petkov, G. Henter, W. Kleijn

An effective measure of speech intelligibility is the probability of correct recognition of the transmitted message. We propose a speech pre-enhancement method based on matching the recognized text to the text of the original message. The selected criterion is accurately approximated by the probability of the correct transcription given an estimate of the noisy speech features. In the presence of environment noise, and with a decrease in the signal-to-noise ratio, speech intelligibility declines. We implement a speech pre-enhancement system that optimizes the proposed criterion for the parameters of two distinct speech modification strategies under an energy-preservation constraint. The proposed method requires prior knowledge in the form of a transcription of the transmitted message and acoustic speech models from an automatic speech recognition system. Performance results from an open-set subjective intelligibility test indicate a significant improvement over natural speech and a reference system that optimizes a perceptual-distortion-based objective intelligibility measure. The computational complexity of the approach permits use in on-line applications.

语音可理解性的有效度量是正确识别传输信息的概率。我们提出了一种基于识别文本与原始消息文本匹配的语音预增强方法。所选择的标准是通过给定噪声语音特征估计的正确转录的概率精确地近似。在环境噪声存在的情况下，随着信噪比的降低，语音的可理解度下降。我们实现了一个语音预增强系统，该系统在能量守恒约束下优化了两种不同语音修改策略参数的准则。所提出的方法需要以传输信息的转录形式的先验知识和来自自动语音识别系统的声学语音模型。开放集主观可理解性测试的性能结果表明，与自然语音和优化基于感知扭曲的客观可理解性测量的参考系统相比，有了显著的改进。该方法的计算复杂性允许在在线应用中使用。

{"title":"Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise","authors":"Petko N. Petkov, G. Henter, W. Kleijn","doi":"10.1109/TASL.2013.2244089","DOIUrl":"https://doi.org/10.1109/TASL.2013.2244089","url":null,"abstract":"An effective measure of speech intelligibility is the probability of correct recognition of the transmitted message. We propose a speech pre-enhancement method based on matching the recognized text to the text of the original message. The selected criterion is accurately approximated by the probability of the correct transcription given an estimate of the noisy speech features. In the presence of environment noise, and with a decrease in the signal-to-noise ratio, speech intelligibility declines. We implement a speech pre-enhancement system that optimizes the proposed criterion for the parameters of two distinct speech modification strategies under an energy-preservation constraint. The proposed method requires prior knowledge in the form of a transcription of the transmitted message and acoustic speech models from an automatic speech recognition system. Performance results from an open-set subjective intelligibility test indicate a significant improvement over natural speech and a reference system that optimizes a perceptual-distortion-based objective intelligibility measure. The computational complexity of the approach permits use in on-line applications.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2244089","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62887281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

A Two-Stage Beamforming Approach for Noise Reduction and Dereverberation 用于降噪和去噪的两级波束形成方法

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2239292

Emanuël Habets, J. Benesty

In general, the signal-to-noise ratio as well as the signal-to-reverberation ratio of speech received by a microphone decrease when the distance between the talker and microphone increases. Dereverberation and noise reduction algorithm are essential for many applications such as videoconferencing, hearing aids, and automatic speech recognition to improve the quality and intelligibility of the received desired speech that is corrupted by reverberation and noise. In the last decade, researchers have aimed at estimating the reverberant desired speech signal as received by one of the microphones. Although this approach has let to practical noise reduction algorithms, the spatial diversity of the received desired signal is not exploited to dereverberate the speech signal. In this paper, a two-stage beamforming approach is presented for dereverberation and noise reduction. In the first stage, a signal-independent beamformer is used to generate a reference signal which contains a dereverberated version of the desired speech signal as received at the microphones and residual noise. In the second stage, the filtered microphone signals and the noisy reference signal are used to obtain an estimate of the dereverberated desired speech signal. In this stage, different signal-dependent beamformers can be used depending on the desired operating point in terms of noise reduction and speech distortion. The presented performance evaluation demonstrates the effectiveness of the proposed two-stage approach.

一般来说，当说话者与麦克风之间的距离增加时，麦克风接收到的语音的信噪比和信混响比都会减小。对于视频会议、助听器和自动语音识别等许多应用来说，去噪和降噪算法是必不可少的，它可以提高接收到的受混响和噪声干扰的期望语音的质量和可理解性。在过去的十年里，研究人员一直致力于估计其中一个麦克风接收到的期望的混响语音信号。虽然这种方法已经有了实用的降噪算法，但没有利用接收到的期望信号的空间分集来对语音信号进行去噪。本文提出了一种两级波束形成的降噪降噪方法。在第一阶段，使用信号无关的波束形成器来产生参考信号，该参考信号包含麦克风接收到的所需语音信号的去脱噪版本和残余噪声。在第二阶段，使用滤波后的麦克风信号和带噪声的参考信号来获得去脱噪所需语音信号的估计。在这一阶段，可以根据降噪和语音失真方面的期望工作点使用不同的信号相关波束形成器。所提出的绩效评估证明了所提出的两阶段方法的有效性。

{"title":"A Two-Stage Beamforming Approach for Noise Reduction and Dereverberation","authors":"Emanuël Habets, J. Benesty","doi":"10.1109/TASL.2013.2239292","DOIUrl":"https://doi.org/10.1109/TASL.2013.2239292","url":null,"abstract":"In general, the signal-to-noise ratio as well as the signal-to-reverberation ratio of speech received by a microphone decrease when the distance between the talker and microphone increases. Dereverberation and noise reduction algorithm are essential for many applications such as videoconferencing, hearing aids, and automatic speech recognition to improve the quality and intelligibility of the received desired speech that is corrupted by reverberation and noise. In the last decade, researchers have aimed at estimating the reverberant desired speech signal as received by one of the microphones. Although this approach has let to practical noise reduction algorithms, the spatial diversity of the received desired signal is not exploited to dereverberate the speech signal. In this paper, a two-stage beamforming approach is presented for dereverberation and noise reduction. In the first stage, a signal-independent beamformer is used to generate a reference signal which contains a dereverberated version of the desired speech signal as received at the microphones and residual noise. In the second stage, the filtered microphone signals and the noisy reference signal are used to obtain an estimate of the dereverberated desired speech signal. In this stage, different signal-dependent beamformers can be used depending on the desired operating point in terms of noise reduction and speech distortion. The presented performance evaluation demonstrates the effectiveness of the proposed two-stage approach.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2239292","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62886159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68

Machine Learning Paradigms for Speech Recognition: An Overview 语音识别的机器学习范式:概述

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2244083

L. Deng, Xiao Li

Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem - for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the state-of-the-art in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further cross-pollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semi-supervised, and active learning; adaptive and multi-task learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.

自动语音识别(ASR)一直是许多机器学习(ML)技术背后的驱动力，包括普遍使用的隐马尔可夫模型、判别学习、结构化序列学习、贝叶斯学习和自适应学习。此外，机器学习可以并且偶尔会将ASR作为一个大规模的、现实的应用程序来严格测试给定技术的有效性，并激发由语音固有的顺序和动态特性引起的新问题。另一方面，尽管ASR在某些应用中已经商业化，但它在很大程度上是一个未解决的问题——对于几乎所有的应用，ASR的性能都不能与人类的性能相提并论。来自现代机器学习方法论的新见解显示了推动ASR技术发展的巨大希望。这篇综述文章为读者提供了现代机器学习技术的概述，这些技术在当前和未来的ASR研究和系统中得到了应用。其目的是促进ML和ASR社区之间的进一步交叉授粉，而不是过去发生的。本文是根据主要的ML范例组织的，这些范例要么已经流行，要么有潜力对ASR技术做出重大贡献。本综述中提出和阐述的范式包括:生成性学习和辨别性学习;有监督、无监督、半监督和主动学习;自适应和多任务学习;和贝叶斯学习。这些学习范式是在ASR技术和应用的背景下被激发和讨论的。最后，我们介绍并分析了深度学习和稀疏表示学习的最新发展，重点关注它们与推进ASR技术的直接相关性。

{"title":"Machine Learning Paradigms for Speech Recognition: An Overview","authors":"L. Deng, Xiao Li","doi":"10.1109/TASL.2013.2244083","DOIUrl":"https://doi.org/10.1109/TASL.2013.2244083","url":null,"abstract":"Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem - for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the state-of-the-art in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further cross-pollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semi-supervised, and active learning; adaptive and multi-task learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2244083","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62885937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 350

A Graph-Partitioning Framework for Aligning Hierarchical Topic Structures to Presentations 将分层主题结构与演示对齐的图划分框架

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2244084

Xiao-Dan Zhu, Colin Cherry, Gerald Penn

This paper studies the problem of imposing an existing hierarchical semantic structure onto a corresponding spoken document in which the structures are embedded, with the goal of indexing such documents for easier access. We propose a graph-partitioning framework to solve a semantic tree-to-string alignment problem through optimizing a normalized-cut criterion. We present models with different modeling capabilities and time complexities in this framework and provide experimental evidence of their performance. We relate graph partitioning to conventional dynamic time warping (DTW) as it applies to this problem, and show that the proposed framework can naturally include topic segmentation to accommodate cohesion constraints.

本文研究了将现有的分层语义结构强加到相应的语音文档上的问题，目的是为这些文档建立索引，以便于访问。我们提出了一个图划分框架，通过优化规范化切割标准来解决语义树到字符串的对齐问题。我们在这个框架中提出了具有不同建模能力和时间复杂度的模型，并提供了它们性能的实验证据。我们将图划分与传统的动态时间翘曲(DTW)联系起来，因为它适用于这个问题，并表明所提出的框架可以自然地包括主题分割以适应内聚约束。

引用次数: 0

A New Variable Regularized QR Decomposition-Based Recursive Least M-Estimate Algorithm—Performance Analysis and Acoustic Applications 一种新的基于变量正则化QR分解的递归最小m估计算法——性能分析及声学应用

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2012.2236315

S. Chan, Y. Chu, Z. G. Zhang, K. Tsui

This paper proposes a new variable regularized QR decompPosition (QRD)-based recursive least M-estimate (VR-QRRLM) adaptive filter and studies its convergence performance and acoustic applications. Firstly, variable L2 regularization is introduced to an efficient QRD-based implementation of the conventional RLM algorithm to reduce its variance and improve the numerical stability. Difference equations describing the convergence behavior of this algorithm in Gaussian inputs and additive contaminated Gaussian noises are derived, from which new expressions for the steady-state excess mean square error (EMSE) are obtained. They suggest that regularization can help to reduce the variance, especially when the input covariance matrix is ill-conditioned due to lacking of excitation, with slightly increased bias. Moreover, the advantage of the M-estimation algorithm over its least squares counterpart is analytically quantified. For white Gaussian inputs, a new formula for selecting the regularization parameter is derived from the MSE analysis, which leads to the proposed VR-QRRLM algorithm. Its application to acoustic path identification and active noise control (ANC) problems is then studied where a new filtered-x (FX) VR-QRRLM ANC algorithm is derived. Moreover, the performance of this new ANC algorithm under impulsive noises and regularization can be characterized by the proposed theoretical analysis. Simulation results show that the VR-QRRLM-based algorithms considerably outperform the traditional algorithms when the input signal level is low or in the presence of impulsive noises and the theoretical predictions are in good agreement with simulation results.

提出了一种新的基于变量正则化QR分解(QRD)的递归最小m估计(VR-QRRLM)自适应滤波器，并研究了其收敛性能和声学应用。首先，将变量L2正则化引入到基于qrd的传统RLM算法的有效实现中，以减小其方差并提高数值稳定性。导出了描述该算法在高斯输入和加性高斯噪声下收敛性的差分方程，由此得到了稳态超额均方误差(EMSE)的新表达式。他们认为，正则化可以帮助减少方差，特别是当输入协方差矩阵由于缺乏激励而病态时，偏差略有增加。此外，m估计算法相对于最小二乘算法的优势是分析量化的。对于白高斯输入，通过MSE分析，导出了正则化参数的选择公式，并由此提出了VR-QRRLM算法。然后研究了其在声路识别和主动噪声控制(ANC)问题中的应用，并推导了一种新的滤波-x (FX) VR-QRRLM ANC算法。此外，该算法在脉冲噪声和正则化条件下的性能可以通过理论分析来表征。仿真结果表明，当输入信号电平较低或存在脉冲噪声时，基于vr - qrrlm的算法明显优于传统算法，理论预测与仿真结果吻合较好。

{"title":"A New Variable Regularized QR Decomposition-Based Recursive Least M-Estimate Algorithm—Performance Analysis and Acoustic Applications","authors":"S. Chan, Y. Chu, Z. G. Zhang, K. Tsui","doi":"10.1109/TASL.2012.2236315","DOIUrl":"https://doi.org/10.1109/TASL.2012.2236315","url":null,"abstract":"This paper proposes a new variable regularized QR decompPosition (QRD)-based recursive least M-estimate (VR-QRRLM) adaptive filter and studies its convergence performance and acoustic applications. Firstly, variable L2 regularization is introduced to an efficient QRD-based implementation of the conventional RLM algorithm to reduce its variance and improve the numerical stability. Difference equations describing the convergence behavior of this algorithm in Gaussian inputs and additive contaminated Gaussian noises are derived, from which new expressions for the steady-state excess mean square error (EMSE) are obtained. They suggest that regularization can help to reduce the variance, especially when the input covariance matrix is ill-conditioned due to lacking of excitation, with slightly increased bias. Moreover, the advantage of the M-estimation algorithm over its least squares counterpart is analytically quantified. For white Gaussian inputs, a new formula for selecting the regularization parameter is derived from the MSE analysis, which leads to the proposed VR-QRRLM algorithm. Its application to acoustic path identification and active noise control (ANC) problems is then studied where a new filtered-x (FX) VR-QRRLM ANC algorithm is derived. Moreover, the performance of this new ANC algorithm under impulsive noises and regularization can be characterized by the proposed theoretical analysis. Simulation results show that the VR-QRRLM-based algorithms considerably outperform the traditional algorithms when the input signal level is low or in the presence of impulsive noises and the theoretical predictions are in good agreement with simulation results.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2012.2236315","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62884887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Binaural Integrated Active Noise Control and Noise Reduction in Hearing Aids 助听器双耳综合主动噪声控制与降噪

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2012.2234111

R. Serizel, M. Moonen, J. Wouters, S. H. Jensen

This paper presents a binaural approach to integrated active noise control and noise reduction in hearing aids and aims at demonstrating that a binaural setup indeed provides significant advantages in terms of the number of noise sources that can be compensated for and in terms of the causality margins.

本文提出了一种在助听器中集成主动噪声控制和降噪的双耳方法，旨在证明双耳装置确实在可补偿的噪声源数量和因果余量方面提供了显着优势。

引用次数: 9

Automatic Adaptation of the Time-Frequency Resolution for Sound Analysis and Re-Synthesis 声音分析与再合成时频分辨率的自动适应

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2239989

M. Liuni, A. Röbel, E. Matusiak, M. Romito, X. Rodet

We present an algorithm for sound analysis and re-synthesis with local automatic adaptation of time-frequency resolution. The reconstruction formula we propose is highly efficient, and gives a good approximation of the original signal from analyses with different time-varying resolutions within complementary frequency bands: this is a typical case where perfect reconstruction cannot in general be achieved with fast algorithms, which provides an error to be minimized. We provide a theoretical upper bound for the reconstruction error of our method, and an example of automatic adaptive analysis and re-synthesis of a music sound.

提出了一种基于时频分辨率局部自动适应的声音分析与重合成算法。我们提出的重建公式是高效的，并且在互补频带内以不同时变分辨率进行分析，可以很好地近似原始信号:这是快速算法通常无法实现完美重建的典型情况，这提供了最小化误差。给出了该方法重构误差的理论上限，并给出了一个音乐声音的自动自适应分析和重新合成的例子。

引用次数: 26

Memory and Computation Trade-Offs for Efficient I-Vector Extraction 高效i向量提取的内存和计算权衡

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2239291

Sandro Cumani, P. Laface

This work aims at reducing the memory demand of the data structures that are usually pre-computed and stored for fast computation of the i-vectors, a compact representation of spoken utterances that is used by most state-of-the-art speaker recognition systems. We propose two new approaches allowing accurate i-vector extraction but requiring less memory, showing their relations with the standard computation method introduced for eigenvoices, and with the recently proposed fast eigen-decomposition technique. The first approach computes an i-vector in a Variational Bayes (VB) framework by iterating the estimation of one sub-block of i-vector elements at a time, keeping fixed all the others, and can obtain i-vectors as accurate as the ones obtained by the standard technique but requiring only 25% of its memory. The second technique is based on the Conjugate Gradient solution of a linear system, which is accurate and uses even less memory, but is slower than the VB approach. We analyze and compare the time and memory resources required by all these solutions, which are suited to different applications, and we show that it is possible to get accurate results greatly reducing memory demand compared with the standard solution at almost the same speed.

这项工作旨在减少数据结构的内存需求，这些数据结构通常是为了快速计算i向量而预先计算和存储的，i向量是大多数最先进的说话人识别系统使用的口语的紧凑表示。我们提出了两种新的方法，可以精确地提取i向量，但需要更少的内存，并展示了它们与特征语音的标准计算方法以及最近提出的快速特征分解技术的关系。第一种方法在变分贝叶斯(VB)框架中计算i向量，每次迭代估计i向量元素的一个子块，保持所有其他子块的固定，并且可以获得与标准技术获得的i向量一样准确的i向量，但只需要25%的内存。第二种技术是基于线性系统的共轭梯度解，它是精确的，使用更少的内存，但比VB方法慢。我们分析和比较了所有这些解决方案所需的时间和内存资源，这些解决方案适用于不同的应用程序，我们表明，在几乎相同的速度下，与标准解决方案相比，它有可能得到准确的结果，大大降低了内存需求。

{"title":"Memory and Computation Trade-Offs for Efficient I-Vector Extraction","authors":"Sandro Cumani, P. Laface","doi":"10.1109/TASL.2013.2239291","DOIUrl":"https://doi.org/10.1109/TASL.2013.2239291","url":null,"abstract":"This work aims at reducing the memory demand of the data structures that are usually pre-computed and stored for fast computation of the i-vectors, a compact representation of spoken utterances that is used by most state-of-the-art speaker recognition systems. We propose two new approaches allowing accurate i-vector extraction but requiring less memory, showing their relations with the standard computation method introduced for eigenvoices, and with the recently proposed fast eigen-decomposition technique. The first approach computes an i-vector in a Variational Bayes (VB) framework by iterating the estimation of one sub-block of i-vector elements at a time, keeping fixed all the others, and can obtain i-vectors as accurate as the ones obtained by the standard technique but requiring only 25% of its memory. The second technique is based on the Conjugate Gradient solution of a linear system, which is accurate and uses even less memory, but is slower than the VB approach. We analyze and compare the time and memory resources required by all these solutions, which are suited to different applications, and we show that it is possible to get accurate results greatly reducing memory demand compared with the standard solution at almost the same speed.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2239291","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62886072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Elimination of Impulsive Disturbances From Archive Audio Signals Using Bidirectional Processing 利用双向处理消除存档音频信号中的脉冲干扰

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-05-01 DOI: 10.1109/TASL.2013.2244090

M. Niedźwiecki, M. Ciołek

In this application-oriented paper we consider the problem of elimination of impulsive disturbances, such as clicks, pops and record scratches, from archive audio recordings. The proposed approach is based on bidirectional processing-noise pulses are localized by combining the results of forward-time and backward-time signal analysis. Based on the results of specially designed empirical tests (rather than on the results of theoretical analysis), incorporating real audio files corrupted by real impulsive disturbances, we work out a set of local, case-dependent fusion rules that can be used to combine forward and backward detection alarms. This allows us to localize noise pulses more accurately and more reliably, yielding noticeable performance improvements, compared to the traditional methods, based on unidirectional processing. The proposed approach is carefully validated using both artificially corrupted audio files and real archive gramophone recordings.

在这篇面向应用的论文中，我们考虑了从档案录音中消除脉冲干扰的问题，如咔嗒声、砰砰声和记录划痕。该方法基于双向处理，结合前向和后向信号分析结果对噪声脉冲进行定位。基于专门设计的经验测试结果(而不是理论分析结果)，结合被真实脉冲干扰损坏的真实音频文件，我们制定了一套局部的、依赖于案例的融合规则，可用于组合前向和后向检测警报。这使我们能够更准确、更可靠地定位噪声脉冲，与基于单向处理的传统方法相比，性能得到了显著提高。采用人为损坏的音频文件和真实的存档留声机录音仔细验证了所提出的方法。

引用次数: 31

Single-Channel Speech-Music Separation for Robust ASR With Mixture Models 基于混合模型的鲁棒ASR单通道语音-音乐分离

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-04-01 DOI: 10.1109/TASL.2012.2231072

Cemil Demir, M. Saraçlar, A. Cemgil

In this study, we describe a mixture model based single-channel speech-music separation method. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog. The background music component is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by a probabilistic model, which is similar to the probabilistic interpretation of Non-negative Matrix Factorization (NMF) model. The parameters of the speech model is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. These results are further improved using Markovian prior structures for temporal continuity between the jingle frames. Our test results with real data show that our method increases the speech recognition performance.

在本研究中，我们描述了一种基于混合模型的单通道语音音乐分离方法。给定背景音乐材料的目录，我们提出了一个叠加语音和音乐谱图的生成模型。假设背景音乐信号是由目录中的叮当声产生的。背景音乐成分用一个表示叮当声的比例条件混合模型来建模。语音信号通过概率模型建模，该模型类似于非负矩阵分解(NMF)模型的概率解释。从混合信号中以半监督的方式估计语音模型的参数。该方法用泊松和复高斯观测模型进行了测试，分别对应于Kullback-Leibler (KL)和Itakura-Saito (is)散度度量。我们的实验表明，所提出的混合模型在语音音乐分离和自动语音识别(ASR)任务中都优于标准的NMF方法。这些结果进一步改进使用马尔可夫先验结构之间的时间连续性叮当声帧。实际数据的测试结果表明，该方法提高了语音识别性能。

{"title":"Single-Channel Speech-Music Separation for Robust ASR With Mixture Models","authors":"Cemil Demir, M. Saraçlar, A. Cemgil","doi":"10.1109/TASL.2012.2231072","DOIUrl":"https://doi.org/10.1109/TASL.2012.2231072","url":null,"abstract":"In this study, we describe a mixture model based single-channel speech-music separation method. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog. The background music component is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by a probabilistic model, which is similar to the probabilistic interpretation of Non-negative Matrix Factorization (NMF) model. The parameters of the speech model is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. These results are further improved using Markovian prior structures for temporal continuity between the jingle frames. Our test results with real data show that our method increases the speech recognition performance.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2012.2231072","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62884384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Transactions on Audio Speech and Language Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀