首页 > 最新文献

IEEE Transactions on Audio Speech and Language Processing最新文献

英文 中文
Coding-Based Informed Source Separation: Nonnegative Tensor Factorization Approach 基于编码的信息源分离:非负张量分解方法
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2260153
A. Ozerov, A. Liutkus, R. Badeau, G. Richard
Informed source separation (ISS) aims at reliably recovering sources from a mixture. To this purpose, it relies on the assumption that the original sources are available during an encoding stage. Given both sources and mixture, a side-information may be computed and transmitted along with the mixture, whereas the original sources are not available any longer. During a decoding stage, both mixture and side-information are processed to recover the sources. ISS is motivated by a number of specific applications including active listening and remixing of music, karaoke, audio gaming, etc. Most ISS techniques proposed so far rely on a source separation strategy and cannot achieve better results than oracle estimators. In this study, we introduce Coding-based ISS (CISS) and draw the connection between ISS and source coding. CISS amounts to encode the sources using not only a model as in source coding but also the observation of the mixture. This strategy has several advantages over conventional ISS methods. First, it can reach any quality, provided sufficient bandwidth is available as in source coding. Second, it makes use of the mixture in order to reduce the bitrate required to transmit the sources, as in classical ISS. Furthermore, we introduce Nonnegative Tensor Factorization as a very efficient model for CISS and report rate-distortion results that strongly outperform the state of the art.
信息源分离(ISS)旨在可靠地从混合物中回收源。为此,它依赖于原始源在编码阶段可用的假设。在给定源和混合源的情况下,当原始源不再可用时,可以计算并随混合源一起传输副信息。在解码阶段,混合信息和副信息都被处理以恢复源。ISS的动机是一些具体的应用,包括主动聆听和混音音乐,卡拉ok,音频游戏等。目前提出的大多数ISS技术都依赖于源分离策略,无法获得比oracle估计器更好的结果。在本研究中,我们介绍了基于编码的国际空间站(CISS),并提出了国际空间站与源编码之间的联系。CISS相当于对源进行编码,不仅使用源编码中的模型,而且还使用混合观测。与传统的国际空间站方法相比,这种策略有几个优点。首先,它可以达到任何质量,只要在源编码中有足够的带宽可用。其次,它利用混合来降低传输源所需的比特率,就像在经典的ISS中一样。此外,我们引入了非负张量分解(non -负Tensor Factorization)作为CISS的一个非常有效的模型,并报告了明显优于当前技术水平的率失真结果。
{"title":"Coding-Based Informed Source Separation: Nonnegative Tensor Factorization Approach","authors":"A. Ozerov, A. Liutkus, R. Badeau, G. Richard","doi":"10.1109/TASL.2013.2260153","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260153","url":null,"abstract":"Informed source separation (ISS) aims at reliably recovering sources from a mixture. To this purpose, it relies on the assumption that the original sources are available during an encoding stage. Given both sources and mixture, a side-information may be computed and transmitted along with the mixture, whereas the original sources are not available any longer. During a decoding stage, both mixture and side-information are processed to recover the sources. ISS is motivated by a number of specific applications including active listening and remixing of music, karaoke, audio gaming, etc. Most ISS techniques proposed so far rely on a source separation strategy and cannot achieve better results than oracle estimators. In this study, we introduce Coding-based ISS (CISS) and draw the connection between ISS and source coding. CISS amounts to encode the sources using not only a model as in source coding but also the observation of the mixture. This strategy has several advantages over conventional ISS methods. First, it can reach any quality, provided sufficient bandwidth is available as in source coding. Second, it makes use of the mixture in order to reduce the bitrate required to transmit the sources, as in classical ISS. Furthermore, we introduce Nonnegative Tensor Factorization as a very efficient model for CISS and report rate-distortion results that strongly outperform the state of the art.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1699-1712"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260153","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Reverberation and Noise Robust Feature Compensation Based on IMM 基于IMM的混响和噪声鲁棒特征补偿
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2256893
C. Han, S. Kang, N. Kim
In this paper, we propose a novel feature compensation approach based on the interacting multiple model (IMM) algorithm specially designed for joint processing of background noise and acoustic reverberation. Our approach to cope with the time-varying environmental parameters is to establish a switching linear dynamic model for the additive and convolutive distortions, such as the background noise and acoustic reverberation, in the log-spectral domain. We construct multiple state space models with the speech corruption process in which the log spectra of clean speech and log frequency response of acoustic reverberation are jointly handled as the state of our interest. The proposed approach shows significant improvements in the Aurora-5 automatic speech recognition (ASR) task which was developed to investigate the influence on the performance of ASR for a hands-free speech input in noisy room environments.
本文提出了一种基于交互多模型(IMM)算法的特征补偿方法,该算法是专门针对背景噪声和混响的联合处理而设计的。我们处理时变环境参数的方法是在对数谱域中建立加性和卷积性失真(如背景噪声和混响)的切换线性动态模型。我们利用语音腐败过程构建了多个状态空间模型,其中干净语音的对数频谱和声混响的对数频率响应共同处理为我们感兴趣的状态。提出的方法在Aurora-5自动语音识别(ASR)任务中显示出显著的改进,该任务是为了研究嘈杂房间环境中免提语音输入对ASR性能的影响而开发的。
{"title":"Reverberation and Noise Robust Feature Compensation Based on IMM","authors":"C. Han, S. Kang, N. Kim","doi":"10.1109/TASL.2013.2256893","DOIUrl":"https://doi.org/10.1109/TASL.2013.2256893","url":null,"abstract":"In this paper, we propose a novel feature compensation approach based on the interacting multiple model (IMM) algorithm specially designed for joint processing of background noise and acoustic reverberation. Our approach to cope with the time-varying environmental parameters is to establish a switching linear dynamic model for the additive and convolutive distortions, such as the background noise and acoustic reverberation, in the log-spectral domain. We construct multiple state space models with the speech corruption process in which the log spectra of clean speech and log frequency response of acoustic reverberation are jointly handled as the state of our interest. The proposed approach shows significant improvements in the Aurora-5 automatic speech recognition (ASR) task which was developed to investigate the influence on the performance of ASR for a hands-free speech input in noisy room environments.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1598-1611"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2256893","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
MDCT Sinusoidal Analysis for Audio Signals Analysis and Processing MDCT正弦分析音频信号的分析与处理
Pub Date : 2013-07-01 DOI: 10.1109/TASL.2013.2250963
Shuhua Zhang, W. Dou, Huazhong Yang
The Modified Discrete Cosine Transform (MDCT) is widely used in audio signals compression, but mostly limited to representing audio signals. This is because the MDCT is a real transform: Phase information is missing and spectral power varies frame to frame even for pure sine waves. We have a key observation concerning the structure of the MDCT spectrum of a sine wave: Across frames, the complete spectrum changes substantially, but if separated into even and odd subspectra, neither changes except scaling. Inspired by this observation, we find that the MDCT spectrum of a sine wave can be represented as an envelope factor times a phase-modulation factor. The first one is shift-invariant and depends only on the sine wave's amplitude and frequency, thus stays constant over frames. The second one has the form of sinθ for all odd bins and cosθ for all even bins, leading to subspectra's constant shapes. But this θ depends on the start point of a transform frame, therefore, changes at each new frame, and then changes the whole spectrum. We apply this formulation of the MDCT spectral structure to frequency estimation in the MDCT domain, both for pure sine waves and sine waves with noises. Compared to existing methods, ours are more accurate and more general (not limited to the sine window). We also apply the spectral structure to stereo coding. A pure tone or tone-dominant stereo signal may have very different left and right MDCT spectra, but their subspectra have similar shapes. One ratio for even bins and one ratio for odd bins will be enough to reconstruct the right from the left, saving half bitrate. This scheme is simple and at the same time more efficient than the traditional Intensity Stereo (IS).
改进离散余弦变换(MDCT)在音频信号压缩中得到了广泛的应用,但大多局限于对音频信号的表示。这是因为MDCT是一个真实的变换:相位信息缺失,即使是纯正弦波,频谱功率也会逐帧变化。我们有一个关于正弦波的MDCT频谱结构的关键观察:跨帧,完整的频谱发生了实质性的变化,但如果分为偶数和奇数子频谱,除了缩放外,没有任何变化。受此观察的启发,我们发现正弦波的MDCT频谱可以表示为包络因子乘以相位调制因子。第一个是移位不变的,只取决于正弦波的振幅和频率,因此在帧中保持不变。第二种形式是,对于所有奇仓都是sinθ,对于所有偶仓都是cost θ,从而得到子光谱的恒定形状。但是θ依赖于变换帧的起始点,因此,在每一帧都改变,然后改变整个频谱。我们将这种MDCT频谱结构公式应用于MDCT域的频率估计,包括纯正弦波和带噪声的正弦波。与现有方法相比,我们的方法更准确,更通用(不限于正弦窗口)。我们还将光谱结构应用于立体编码。纯音或音调为主的立体声信号可能具有非常不同的左右MDCT频谱,但它们的子频谱具有相似的形状。一个偶数桶的比率和一个奇数桶的比率将足以从左边重建右边,节省一半比特率。该方案简单,同时比传统的强度立体(is)更有效。
{"title":"MDCT Sinusoidal Analysis for Audio Signals Analysis and Processing","authors":"Shuhua Zhang, W. Dou, Huazhong Yang","doi":"10.1109/TASL.2013.2250963","DOIUrl":"https://doi.org/10.1109/TASL.2013.2250963","url":null,"abstract":"The Modified Discrete Cosine Transform (MDCT) is widely used in audio signals compression, but mostly limited to representing audio signals. This is because the MDCT is a real transform: Phase information is missing and spectral power varies frame to frame even for pure sine waves. We have a key observation concerning the structure of the MDCT spectrum of a sine wave: Across frames, the complete spectrum changes substantially, but if separated into even and odd subspectra, neither changes except scaling. Inspired by this observation, we find that the MDCT spectrum of a sine wave can be represented as an envelope factor times a phase-modulation factor. The first one is shift-invariant and depends only on the sine wave's amplitude and frequency, thus stays constant over frames. The second one has the form of sinθ for all odd bins and cosθ for all even bins, leading to subspectra's constant shapes. But this θ depends on the start point of a transform frame, therefore, changes at each new frame, and then changes the whole spectrum. We apply this formulation of the MDCT spectral structure to frequency estimation in the MDCT domain, both for pure sine waves and sine waves with noises. Compared to existing methods, ours are more accurate and more general (not limited to the sine window). We also apply the spectral structure to stereo coding. A pure tone or tone-dominant stereo signal may have very different left and right MDCT spectra, but their subspectra have similar shapes. One ratio for even bins and one ratio for odd bins will be enough to reconstruct the right from the left, saving half bitrate. This scheme is simple and at the same time more efficient than the traditional Intensity Stereo (IS).","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1403-1414"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2250963","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
A Symmetric Kernel Partial Least Squares Framework for Speaker Recognition 一种对称核偏最小二乘框架用于说话人识别
Pub Date : 2013-07-01 DOI: 10.1109/TASL.2013.2253096
Balaji Vasan Srinivasan, Yuancheng Luo, D. Garcia-Romero, D. Zotkin, R. Duraiswami
I-vectors are concise representations of speaker characteristics. Recent progress in i-vectors related research has utilized their ability to capture speaker and channel variability to develop efficient automatic speaker verification (ASV) systems. Inter-speaker relationships in the i-vector space are non-linear. Accomplishing effective speaker verification requires a good modeling of these non-linearities and can be cast as a machine learning problem. Kernel partial least squares (KPLS) can be used for discriminative training in the i-vector space. However, this framework suffers from training data imbalance and asymmetric scoring. We use “one shot similarity scoring” (OSS) to address this. The resulting ASV system (OSS-KPLS) is tested across several conditions of the NIST SRE 2010 extended core data set and compared against state-of-the-art systems: Joint Factor Analysis (JFA), Probabilistic Linear Discriminant Analysis (PLDA), and Cosine Distance Scoring (CDS) classifiers. Improvements are shown.
i向量是说话人特征的简洁表示。近年来,在i向量相关的研究中,利用它们捕捉说话人和通道变化的能力来开发高效的自动说话人验证(ASV)系统。i-向量空间中说话人之间的关系是非线性的。完成有效的说话人验证需要对这些非线性进行良好的建模,并且可以将其视为机器学习问题。核偏最小二乘(KPLS)可以用于i向量空间的判别训练。然而,该框架存在训练数据不平衡和评分不对称的问题。我们使用“一次性相似性评分”(OSS)来解决这个问题。最终的ASV系统(OSS-KPLS)在NIST SRE 2010扩展核心数据集的几个条件下进行了测试,并与最先进的系统进行了比较:联合因子分析(JFA)、概率线性判别分析(PLDA)和余弦距离评分(CDS)分类器。改进显示。
{"title":"A Symmetric Kernel Partial Least Squares Framework for Speaker Recognition","authors":"Balaji Vasan Srinivasan, Yuancheng Luo, D. Garcia-Romero, D. Zotkin, R. Duraiswami","doi":"10.1109/TASL.2013.2253096","DOIUrl":"https://doi.org/10.1109/TASL.2013.2253096","url":null,"abstract":"I-vectors are concise representations of speaker characteristics. Recent progress in i-vectors related research has utilized their ability to capture speaker and channel variability to develop efficient automatic speaker verification (ASV) systems. Inter-speaker relationships in the i-vector space are non-linear. Accomplishing effective speaker verification requires a good modeling of these non-linearities and can be cast as a machine learning problem. Kernel partial least squares (KPLS) can be used for discriminative training in the i-vector space. However, this framework suffers from training data imbalance and asymmetric scoring. We use “one shot similarity scoring” (OSS) to address this. The resulting ASV system (OSS-KPLS) is tested across several conditions of the NIST SRE 2010 extended core data set and compared against state-of-the-art systems: Joint Factor Analysis (JFA), Probabilistic Linear Discriminant Analysis (PLDA), and Cosine Distance Scoring (CDS) classifiers. Improvements are shown.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1415-1423"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2253096","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Model-Based Inversion of Dynamic Range Compression 基于模型的动态范围压缩反演
Pub Date : 2013-07-01 DOI: 10.1109/TASL.2013.2253099
Stanislaw Gorlow, J. Reiss
In this work it is shown how a dynamic nonlinear time-variant operator, such as a dynamic range compressor, can be inverted using an explicit signal model. By knowing the model parameters that were used for compression one is able to recover the original uncompressed signal from a “broadcast” signal with high numerical accuracy and very low computational complexity. A compressor-decompressor scheme is worked out and described in detail. The approach is evaluated on real-world audio material with great success.
在这项工作中,它显示了如何动态非线性时变算子,如动态范围压缩器,可以使用显式信号模型进行反转。通过了解用于压缩的模型参数,可以从“广播”信号中恢复原始未压缩信号,具有很高的数值精度和非常低的计算复杂度。提出了一种压缩-减压方案,并对其进行了详细描述。该方法在实际音频材料上进行了测试,取得了很大的成功。
{"title":"Model-Based Inversion of Dynamic Range Compression","authors":"Stanislaw Gorlow, J. Reiss","doi":"10.1109/TASL.2013.2253099","DOIUrl":"https://doi.org/10.1109/TASL.2013.2253099","url":null,"abstract":"In this work it is shown how a dynamic nonlinear time-variant operator, such as a dynamic range compressor, can be inverted using an explicit signal model. By knowing the model parameters that were used for compression one is able to recover the original uncompressed signal from a “broadcast” signal with high numerical accuracy and very low computational complexity. A compressor-decompressor scheme is worked out and described in detail. The approach is evaluated on real-world audio material with great success.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"32 1","pages":"1434-1444"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2253099","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
CLOSE—A Data-Driven Approach to Speech Separation 基于数据驱动的语音分离方法
Pub Date : 2013-07-01 DOI: 10.1109/TASL.2013.2250959
J. Ming, R. Srinivasan, D. Crookes, Ayeh Jafari
This paper studies single-channel speech separation, assuming unknown, arbitrary temporal dynamics for the speech signals to be separated. A data-driven approach is described, which matches each mixed speech segment against a composite training segment to separate the underlying clean speech segments. To advance the separation accuracy, the new approach seeks and separates the longest mixed speech segments with matching composite training segments. Lengthening the mixed speech segments to match reduces the uncertainty of the constituent training segments, and hence the error of separation. For convenience, we call the new approach Composition of Longest Segments, or CLOSE. The CLOSE method includes a data-driven approach to model long-range temporal dynamics of speech signals, and a statistical approach to identify the longest mixed speech segments with matching composite training segments. Experiments are conducted on the Wall Street Journal database, for separating mixtures of two simultaneous large-vocabulary speech utterances spoken by two different speakers. The results are evaluated using various objective and subjective measures, including the challenge of large-vocabulary continuous speech recognition. It is shown that the new separation approach leads to significant improvement in all these measures.
本文研究单通道语音分离,假设待分离语音信号的时间动态未知。描述了一种数据驱动的方法,该方法将每个混合语音段与复合训练语音段进行匹配,以分离底层的干净语音段。为了提高分离精度,该方法通过匹配的复合训练片段寻找最长的混合语音片段并进行分离。通过延长混合语音片段的长度来匹配,减少了组成训练片段的不确定性,从而减少了分离的误差。为方便起见,我们称这种新方法为最长分段组合(Composition of Longest Segments, CLOSE)。CLOSE方法包括一种数据驱动的方法来模拟语音信号的长时间动态,以及一种统计方法来识别具有匹配的复合训练片段的最长混合语音片段。实验在《华尔街日报》数据库上进行,用于分离两个不同说话者同时说的两个大词汇的混合语音。使用各种客观和主观测量来评估结果,包括大词汇量连续语音识别的挑战。结果表明,新的分离方法使这些指标得到了显著改善。
{"title":"CLOSE—A Data-Driven Approach to Speech Separation","authors":"J. Ming, R. Srinivasan, D. Crookes, Ayeh Jafari","doi":"10.1109/TASL.2013.2250959","DOIUrl":"https://doi.org/10.1109/TASL.2013.2250959","url":null,"abstract":"This paper studies single-channel speech separation, assuming unknown, arbitrary temporal dynamics for the speech signals to be separated. A data-driven approach is described, which matches each mixed speech segment against a composite training segment to separate the underlying clean speech segments. To advance the separation accuracy, the new approach seeks and separates the longest mixed speech segments with matching composite training segments. Lengthening the mixed speech segments to match reduces the uncertainty of the constituent training segments, and hence the error of separation. For convenience, we call the new approach Composition of Longest Segments, or CLOSE. The CLOSE method includes a data-driven approach to model long-range temporal dynamics of speech signals, and a statistical approach to identify the longest mixed speech segments with matching composite training segments. Experiments are conducted on the Wall Street Journal database, for separating mixtures of two simultaneous large-vocabulary speech utterances spoken by two different speakers. The results are evaluated using various objective and subjective measures, including the challenge of large-vocabulary continuous speech recognition. It is shown that the new separation approach leads to significant improvement in all these measures.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1355-1368"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2250959","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
A Perceptual Study on Velvet Noise and Its Variants at Different Pulse Densities 不同脉冲密度下天鹅绒噪声及其变体的感知研究
Pub Date : 2013-07-01 DOI: 10.1109/TASL.2013.2255281
V. Välimäki, Heidi-Maria Lehtonen, M. Takanen
This paper investigates sparse noise sequences, including the previously proposed velvet noise and its novel variants defined here. All sequences consist of sample values minus one, zero, and plus one only, and the location and the sign of each impulse is randomly chosen. Two of the proposed algorithms are direct variants of the original velvet noise requiring two random number sequences for determining the impulse locations and signs. In one of the proposed algorithms the impulse locations and signs are drawn from the same random number sequence, which is advantageous in terms of implementation. Moreover, two of the new sequences include known regions of zeros. The perceived smoothness of the proposed sequences was studied with a listening test in which test subjects compared the noise sequences against a reference signal that was a Gaussian white noise. The results show that the original velvet noise sounds smoother than the reference at 2000 impulses per second. At 4000 impulses per second, also three of the proposed algorithms are perceived smoother than the Gaussian noise sequence. These observations can be exploited in the synthesis of noisy sounds and in artificial reverberation.
本文研究了稀疏噪声序列,包括先前提出的天鹅绒噪声及其新变体。所有序列都只包含- 1、0和+ 1的采样值,并且每个脉冲的位置和符号是随机选择的。提出的两种算法是原始天鹅绒噪声的直接变体,需要两个随机数序列来确定脉冲的位置和符号。在其中一种算法中,脉冲的位置和符号是从同一随机数序列中提取的,这在实现上是有利的。此外,其中两个新序列包含已知的零区域。在听力测试中,测试对象将噪声序列与参考信号高斯白噪声进行比较,从而研究了所提出序列的感知平滑性。结果表明,在每秒2000个脉冲时,原始丝绒噪声比参考噪声听起来更平滑。在每秒4000个脉冲时,也有三种算法被认为比高斯噪声序列更平滑。这些观察结果可用于嘈杂声音的合成和人工混响。
{"title":"A Perceptual Study on Velvet Noise and Its Variants at Different Pulse Densities","authors":"V. Välimäki, Heidi-Maria Lehtonen, M. Takanen","doi":"10.1109/TASL.2013.2255281","DOIUrl":"https://doi.org/10.1109/TASL.2013.2255281","url":null,"abstract":"This paper investigates sparse noise sequences, including the previously proposed velvet noise and its novel variants defined here. All sequences consist of sample values minus one, zero, and plus one only, and the location and the sign of each impulse is randomly chosen. Two of the proposed algorithms are direct variants of the original velvet noise requiring two random number sequences for determining the impulse locations and signs. In one of the proposed algorithms the impulse locations and signs are drawn from the same random number sequence, which is advantageous in terms of implementation. Moreover, two of the new sequences include known regions of zeros. The perceived smoothness of the proposed sequences was studied with a listening test in which test subjects compared the noise sequences against a reference signal that was a Gaussian white noise. The results show that the original velvet noise sounds smoother than the reference at 2000 impulses per second. At 4000 impulses per second, also three of the proposed algorithms are perceived smoother than the Gaussian noise sequence. These observations can be exploited in the synthesis of noisy sounds and in artificial reverberation.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1481-1488"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2255281","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Stochastic-Deterministic MMSE STFT Speech Enhancement With General A Priori Information 基于一般先验信息的随机确定性MMSE STFT语音增强
Pub Date : 2013-07-01 DOI: 10.1109/TASL.2013.2253100
Matthew C. McCallum, B. Guillemin
A wide range of Bayesian short-time spectral amplitude (STSA) speech enhancement algorithms exist, varying in both the statistical model used for speech and the cost functions considered. Current algorithms of this class consistently assume that the distribution of clean speech short time Fourier transform (STFT) samples are either randomly distributed with zero mean or deterministic. No single distribution function has been considered that captures both deterministic and random signal components. In this paper a Bayesian STSA algorithm is proposed under a stochastic-deterministic (SD) speech model that makes provision for the inclusion of a priori information by considering a non-zero mean. Analytical expressions are derived for the speech STFT magnitude in the MMSE sense, and phase in the maximum-likelihood sense. Furthermore, a practical method of estimating the a priori SD speech model parameters is described based on explicit consideration of harmonically related sinusoidal components in each STFT frame, and variations in both the magnitude and phase of these components between successive STFT frames. Objective tests using the PESQ measure indicate that the proposed algorithm results in superior speech quality when compared to several other speech enhancement algorithms. In particular it is clear that the proposed algorithm has an improved capability to retain low amplitude voiced speech components in low SNR conditions.
贝叶斯短时谱幅度(STSA)语音增强算法种类繁多,在语音的统计模型和所考虑的代价函数上都有所不同。当前该类算法始终假设干净语音短时傅里叶变换(STFT)样本的分布要么是随机分布,平均值为零,要么是确定性的。没有一个单一的分布函数被认为可以同时捕获确定性和随机信号分量。本文提出了一种随机确定性(SD)语音模型下的贝叶斯STSA算法,该模型通过考虑非零均值来考虑先验信息的包含。推导了语音STFT幅度在MMSE意义上的解析表达式,以及相位在最大似然意义上的解析表达式。此外,基于明确考虑每个STFT帧中谐波相关的正弦分量,以及这些分量在连续STFT帧之间的幅度和相位变化,描述了一种估计先验SD语音模型参数的实用方法。使用PESQ测量的客观测试表明,与其他几种语音增强算法相比,所提算法的语音质量更好。特别是,很明显,所提出的算法具有在低信噪比条件下保留低幅度浊音语音成分的改进能力。
{"title":"Stochastic-Deterministic MMSE STFT Speech Enhancement With General A Priori Information","authors":"Matthew C. McCallum, B. Guillemin","doi":"10.1109/TASL.2013.2253100","DOIUrl":"https://doi.org/10.1109/TASL.2013.2253100","url":null,"abstract":"A wide range of Bayesian short-time spectral amplitude (STSA) speech enhancement algorithms exist, varying in both the statistical model used for speech and the cost functions considered. Current algorithms of this class consistently assume that the distribution of clean speech short time Fourier transform (STFT) samples are either randomly distributed with zero mean or deterministic. No single distribution function has been considered that captures both deterministic and random signal components. In this paper a Bayesian STSA algorithm is proposed under a stochastic-deterministic (SD) speech model that makes provision for the inclusion of a priori information by considering a non-zero mean. Analytical expressions are derived for the speech STFT magnitude in the MMSE sense, and phase in the maximum-likelihood sense. Furthermore, a practical method of estimating the a priori SD speech model parameters is described based on explicit consideration of harmonically related sinusoidal components in each STFT frame, and variations in both the magnitude and phase of these components between successive STFT frames. Objective tests using the PESQ measure indicate that the proposed algorithm results in superior speech quality when compared to several other speech enhancement algorithms. In particular it is clear that the proposed algorithm has an improved capability to retain low amplitude voiced speech components in low SNR conditions.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1445-1457"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2253100","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Towards Scaling Up Classification-Based Speech Separation 扩大基于分类的语音分离
Pub Date : 2013-07-01 DOI: 10.1109/TASL.2013.2250961
Yuxuan Wang, Deliang Wang
Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments remains a big challenge. A simple yet effective method to cope with the mismatch is to include many different acoustic conditions into the training set. However, large-scale training is almost intractable for kernel machines due to computational complexity. To enable training on relatively large datasets, we propose to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. For feature learning, we employ standard pre-trained deep neural networks (DNNs). The proposed DNN-SVM system is trained on a variety of acoustic conditions within a reasonable amount of time. Experiments on various test mixtures demonstrate good generalization to unseen speakers and background noises.
将语音分离作为一个二元分类问题已被证明是有效的。虽然使用核支持向量机(svm)在匹配测试条件下获得了良好的分离性能,但在涉及新扬声器和环境的非匹配条件下的分离仍然是一个很大的挑战。一种简单而有效的方法是在训练集中加入许多不同的声学条件。然而,由于计算的复杂性,大规模训练对于核机来说几乎是难以处理的。为了能够在相对较大的数据集上进行训练,我们建议从原始声学特征中学习更多的线性可分和判别特征,并训练线性支持向量机,这比核支持向量机更容易和更快地训练。对于特征学习,我们使用标准的预训练深度神经网络(dnn)。提出的DNN-SVM系统在合理的时间内对各种声学条件进行了训练。对各种测试混合物的实验表明,对未见的说话者和背景噪声具有良好的泛化性。
{"title":"Towards Scaling Up Classification-Based Speech Separation","authors":"Yuxuan Wang, Deliang Wang","doi":"10.1109/TASL.2013.2250961","DOIUrl":"https://doi.org/10.1109/TASL.2013.2250961","url":null,"abstract":"Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments remains a big challenge. A simple yet effective method to cope with the mismatch is to include many different acoustic conditions into the training set. However, large-scale training is almost intractable for kernel machines due to computational complexity. To enable training on relatively large datasets, we propose to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. For feature learning, we employ standard pre-trained deep neural networks (DNNs). The proposed DNN-SVM system is trained on a variety of acoustic conditions within a reasonable amount of time. Experiments on various test mixtures demonstrate good generalization to unseen speakers and background noises.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1381-1390"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2250961","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 435
Sparse Reverberant Audio Source Separation via Reweighted Analysis 基于重加权分析的稀疏混响音源分离
Pub Date : 2013-07-01 DOI: 10.1109/TASL.2013.2250962
S. Arberet, P. Vandergheynst, R. Carrillo, J. Thiran, Y. Wiaux
We propose a novel algorithm for source signals estimation from an underdetermined convolutive mixture assuming known mixing filters. Most of the state-of-the-art methods are dealing with anechoic or short reverberant mixture, assuming a synthesis sparse prior in the time-frequency domain and a narrowband approximation of the convolutive mixing process. In this paper, we address the source estimation of convolutive mixtures with a new algorithm based on i) an analysis sparse prior, ii) a reweighting scheme so as to increase the sparsity, iii) a wideband data-fidelity term in a constrained form. We show, through theoretical discussions and simulations, that this algorithm is particularly well suited for source separation of realistic reverberation mixtures. Particularly, the proposed algorithm outperforms state-of-the-art methods on reverberant mixtures of audio sources by more than 2 dB of signal-to-distortion ratio on the BSS Oracle dataset.
我们提出了一种基于已知混合滤波器的欠定卷积混合源信号估计算法。大多数最先进的方法都是处理消声或短混响混合,假设时频域的合成稀疏先验和卷积混合过程的窄带近似。在本文中,我们用一种新的算法来解决卷积混合的源估计问题,该算法基于i)分析稀疏先验,ii)重新加权以增加稀疏性,iii)约束形式的宽带数据保真度项。我们通过理论讨论和模拟表明,该算法特别适合于现实混响混合的源分离。特别是,在BSS Oracle数据集上,所提出的算法在混响音源混合上优于最先进的方法,信号失真比超过2db。
{"title":"Sparse Reverberant Audio Source Separation via Reweighted Analysis","authors":"S. Arberet, P. Vandergheynst, R. Carrillo, J. Thiran, Y. Wiaux","doi":"10.1109/TASL.2013.2250962","DOIUrl":"https://doi.org/10.1109/TASL.2013.2250962","url":null,"abstract":"We propose a novel algorithm for source signals estimation from an underdetermined convolutive mixture assuming known mixing filters. Most of the state-of-the-art methods are dealing with anechoic or short reverberant mixture, assuming a synthesis sparse prior in the time-frequency domain and a narrowband approximation of the convolutive mixing process. In this paper, we address the source estimation of convolutive mixtures with a new algorithm based on i) an analysis sparse prior, ii) a reweighting scheme so as to increase the sparsity, iii) a wideband data-fidelity term in a constrained form. We show, through theoretical discussions and simulations, that this algorithm is particularly well suited for source separation of realistic reverberation mixtures. Particularly, the proposed algorithm outperforms state-of-the-art methods on reverberant mixtures of audio sources by more than 2 dB of signal-to-distortion ratio on the BSS Oracle dataset.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1391-1402"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2250962","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
IEEE Transactions on Audio Speech and Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1