首页 > 最新文献

IEEE Trans. Speech Audio Process.最新文献

英文 中文
Blind single channel deconvolution using nonstationary signal processing 采用非平稳信号处理的盲单通道反卷积
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815522
J. Hopgood, P. Rayner
Blind deconvolution is fundamental in signal processing applications and, in particular, the single channel case remains a challenging and formidable problem. This paper considers single channel blind deconvolution in the case where the degraded observed signal may be modeled as the convolution of a nonstationary source signal with a stationary distortion operator. The important feature that the source is nonstationary while the channel is stationary facilitates the unambiguous identification of either the source or channel, and deconvolution is possible, whereas if the source and channel are both stationary, identification is ambiguous. The parameters for the channel are estimated by modeling the source as a time-varyng AR process and the distortion by an all-pole filter, and using the Bayesian framework for parameter estimation. This estimate can then be used to deconvolve the observed signal. In contrast to the classical histogram approach for estimating the channel poles, where the technique merely relies on the fact that the channel is actually stationary rather than modeling it as so, the proposed Bayesian method does take account for the channel's stationarity in the model and, consequently, is more robust. The properties of this model are investigated, and the advantage of utilizing the nonstationarity of a system rather than considering it as a curse is discussed.
盲反卷积是信号处理应用的基础,特别是单通道情况仍然是一个具有挑战性和艰巨的问题。本文考虑退化观测信号可建模为非平稳源信号与平稳失真算子的卷积的单通道盲反卷积。源是非平稳的,而通道是平稳的,这一重要特征有助于对源或通道进行明确的识别,并且可以进行反卷积,而如果源和通道都是平稳的,则识别是模糊的。通过将源建模为时变AR过程,将失真建模为全极滤波器,并使用贝叶斯框架进行参数估计,估计了信道的参数。这个估计可以用来对观察到的信号进行反卷积。与用于估计信道极点的经典直方图方法相反,该技术仅仅依赖于信道实际上是平稳的事实,而不是像这样建模,所提出的贝叶斯方法确实考虑了模型中信道的平稳性,因此更健壮。研究了该模型的性质,讨论了利用系统的非平稳性而不是将其视为一种缺陷的优点。
{"title":"Blind single channel deconvolution using nonstationary signal processing","authors":"J. Hopgood, P. Rayner","doi":"10.1109/TSA.2003.815522","DOIUrl":"https://doi.org/10.1109/TSA.2003.815522","url":null,"abstract":"Blind deconvolution is fundamental in signal processing applications and, in particular, the single channel case remains a challenging and formidable problem. This paper considers single channel blind deconvolution in the case where the degraded observed signal may be modeled as the convolution of a nonstationary source signal with a stationary distortion operator. The important feature that the source is nonstationary while the channel is stationary facilitates the unambiguous identification of either the source or channel, and deconvolution is possible, whereas if the source and channel are both stationary, identification is ambiguous. The parameters for the channel are estimated by modeling the source as a time-varyng AR process and the distortion by an all-pole filter, and using the Bayesian framework for parameter estimation. This estimate can then be used to deconvolve the observed signal. In contrast to the classical histogram approach for estimating the channel poles, where the technique merely relies on the fact that the channel is actually stationary rather than modeling it as so, the proposed Bayesian method does take account for the channel's stationarity in the model and, consequently, is more robust. The properties of this model are investigated, and the advantage of utilizing the nonstationarity of a system rather than considering it as a curse is discussed.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"469 1","pages":"476-488"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77508201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
A new approach to utterance verification based on neighborhood information in model space 基于模型空间邻域信息的话语验证新方法
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815821
Hui Jiang, Chin-Hui Lee
We propose to use neighborhood information in model space to perform utterance verification (UV). At first, we present a nested-neighborhood structure for each underlying model in model space and assume the underlying model's competing models sit in one of these neighborhoods, which is used to model alternative hypothesis in UV. Bayes factors (BF) is first introduced to UV and used as a major tool to calculate confidence measures based on the above idea. Experimental results in the Bell Labs communicator system show that the new method has dramatically improved verification performance when verifying correct words against mis-recognized words in the recognizer's output, relatively more than 20% reduction in equal error rate (EER) when comparing with the standard approach based on likelihood ratio testing and anti-models.
我们提出利用模型空间中的邻域信息进行话语验证(UV)。首先,我们为模型空间中的每个底层模型提出了一个嵌套邻域结构,并假设底层模型的竞争模型位于这些邻域中的一个,并将其用于UV中的替代假设建模。首先将贝叶斯因子(BF)引入UV,并将其作为基于上述思想计算置信度度量的主要工具。在Bell实验室通信系统中的实验结果表明,新方法在验证识别器输出的正确单词和错误单词时,显著提高了验证性能,与基于似然比测试和反模型的标准方法相比,等效错误率(EER)降低了20%以上。
{"title":"A new approach to utterance verification based on neighborhood information in model space","authors":"Hui Jiang, Chin-Hui Lee","doi":"10.1109/TSA.2003.815821","DOIUrl":"https://doi.org/10.1109/TSA.2003.815821","url":null,"abstract":"We propose to use neighborhood information in model space to perform utterance verification (UV). At first, we present a nested-neighborhood structure for each underlying model in model space and assume the underlying model's competing models sit in one of these neighborhoods, which is used to model alternative hypothesis in UV. Bayes factors (BF) is first introduced to UV and used as a major tool to calculate confidence measures based on the above idea. Experimental results in the Bell Labs communicator system show that the new method has dramatically improved verification performance when verifying correct words against mis-recognized words in the recognizer's output, relatively more than 20% reduction in equal error rate (EER) when comparing with the standard approach based on likelihood ratio testing and anti-models.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"6 1","pages":"425-434"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87640258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
A perceptually motivated approach for speech enhancement 基于感知动机的语音增强方法
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815936
Y. Hu, P. Loizou
A new perceptually motivated approach is proposed for enhancement of speech corrupted by colored noise. The proposed approach takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise. This new perceptual method is incorporated into a frequency-domain speech enhancement method and a subspace-based speech enhancement method. A better power spectrum/autocorrelation function estimator was also developed to improve the performance of the proposed algorithms. Objective measures and informal listening tests demonstrated significant improvements over other methods when tested with TIMIT sentences corrupted by various types of colored noise.
提出了一种新的感知激发方法,用于彩色噪声干扰语音的增强。该方法充分考虑了人类听觉系统的频率掩蔽特性,降低了残余噪声的感知效应。这种新的感知方法被整合到频域语音增强方法和基于子空间的语音增强方法中。为了提高算法的性能,还开发了一种更好的功率谱/自相关函数估计器。客观测量和非正式听力测试表明,当使用被各种类型的彩色噪声破坏的TIMIT句子进行测试时,比其他方法有显著的改进。
{"title":"A perceptually motivated approach for speech enhancement","authors":"Y. Hu, P. Loizou","doi":"10.1109/TSA.2003.815936","DOIUrl":"https://doi.org/10.1109/TSA.2003.815936","url":null,"abstract":"A new perceptually motivated approach is proposed for enhancement of speech corrupted by colored noise. The proposed approach takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise. This new perceptual method is incorporated into a frequency-domain speech enhancement method and a subspace-based speech enhancement method. A better power spectrum/autocorrelation function estimator was also developed to improve the performance of the proposed algorithms. Objective measures and informal listening tests demonstrated significant improvements over other methods when tested with TIMIT sentences corrupted by various types of colored noise.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"171 1","pages":"457-465"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79442675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 99
Audio source separation of convolutive mixtures 卷积混合音频源分离
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815820
N. Mitianoudis, M. Davies
The problem of separation of audio sources recorded in a real world situation is well established in modern literature. A method to solve this problem is blind source separation (BSS) using independent component analysis (ICA). The recording environment is usually modeled as convolutive. Previous research on ICA of instantaneous mixtures provided solid background for the separation of convolved mixtures. The authors revise current approaches on the subject and propose a fast frequency domain ICA framework, providing a solution for the apparent permutation problem encountered in these methods.
在现代文学中,在现实世界中记录的音源分离问题已经得到了很好的证实。一种解决这一问题的方法是利用独立分量分析(ICA)进行盲源分离。记录环境通常被建模为卷积。以往瞬态混合物ICA的研究为卷积混合物的分离提供了坚实的背景。作者修改了现有的方法,提出了一个快速的频域ICA框架,为这些方法中遇到的明显排列问题提供了解决方案。
{"title":"Audio source separation of convolutive mixtures","authors":"N. Mitianoudis, M. Davies","doi":"10.1109/TSA.2003.815820","DOIUrl":"https://doi.org/10.1109/TSA.2003.815820","url":null,"abstract":"The problem of separation of audio sources recorded in a real world situation is well established in modern literature. A method to solve this problem is blind source separation (BSS) using independent component analysis (ICA). The recording environment is usually modeled as convolutive. Previous research on ICA of instantaneous mixtures provided solid background for the separation of convolved mixtures. The authors revise current approaches on the subject and propose a fast frequency domain ICA framework, providing a solution for the apparent permutation problem encountered in these methods.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"59 5","pages":"489-497"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91432369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 153
Fast model selection based speaker adaptation for nonnative speech 基于快速模型选择的非母语语音说话人自适应
Pub Date : 2003-07-28 DOI: 10.1109/TSA.2003.814379
Xiaodong He, Yunxin Zhao
The problem of adapting acoustic models of native English speech to nonnative speakers is addressed from a perspective of adaptive model complexity selection. The goal is to select model complexity dynamically for each nonnative talker so as to optimize the balance between model robustness to pronunciation variations and model detailedness for discrimination of speech sounds. A maximum expected likelihood (MEL) based technique is proposed to enable reliable complexity selection when adaptation data are sparse, where expectation of log-likelihood (EL) of adaptation data is computed based on distributions of mismatch biases between model and data, and model complexity is selected to maximize EL. The MEL based complexity selection is further combined with MLLR (maximum likelihood linear regression) to enable adaptation of both complexity and parameters of acoustic models. Experiments were performed on WSJ1 data of speakers with a wide range of foreign accents. Results show that the MEL based complexity selection is feasible when using as little as one adaptation utterance, and it is able to select dynamically the proper model complexity as the adaptation data increases. Compared with the standard MLLR, the MEL+MLLR method leads to consistent and significant improvement to recognition accuracy on nonnative speakers, without performance degradation on native speakers.
从自适应模型复杂性选择的角度探讨了英语母语语音模型对非英语母语语音的适应性问题。目标是为每个非母语说话者动态选择模型复杂度,以优化模型对语音变化的鲁棒性和对语音识别的模型细节性之间的平衡。提出了一种基于最大期望似然(MEL)的自适应数据稀疏化复杂度选择方法,根据模型与数据不匹配偏差的分布计算自适应数据的对数似然(EL)期望,选择模型复杂度使EL最大化。基于MEL的复杂性选择进一步与最大似然线性回归(MLLR)相结合,以实现声学模型的复杂性和参数的自适应。对不同口音说话人的WSJ1数据进行了实验。结果表明,在最小使用一个自适应话语的情况下,基于MEL的复杂度选择是可行的,并且能够随着自适应数据的增加而动态选择合适的模型复杂度。与标准MLLR相比,MEL+MLLR方法对非母语人士的识别准确率有一致且显著的提高,对母语人士的识别性能没有下降。
{"title":"Fast model selection based speaker adaptation for nonnative speech","authors":"Xiaodong He, Yunxin Zhao","doi":"10.1109/TSA.2003.814379","DOIUrl":"https://doi.org/10.1109/TSA.2003.814379","url":null,"abstract":"The problem of adapting acoustic models of native English speech to nonnative speakers is addressed from a perspective of adaptive model complexity selection. The goal is to select model complexity dynamically for each nonnative talker so as to optimize the balance between model robustness to pronunciation variations and model detailedness for discrimination of speech sounds. A maximum expected likelihood (MEL) based technique is proposed to enable reliable complexity selection when adaptation data are sparse, where expectation of log-likelihood (EL) of adaptation data is computed based on distributions of mismatch biases between model and data, and model complexity is selected to maximize EL. The MEL based complexity selection is further combined with MLLR (maximum likelihood linear regression) to enable adaptation of both complexity and parameters of acoustic models. Experiments were performed on WSJ1 data of speakers with a wide range of foreign accents. Results show that the MEL based complexity selection is feasible when using as little as one adaptation utterance, and it is able to select dynamically the proper model complexity as the adaptation data increases. Compared with the standard MLLR, the MEL+MLLR method leads to consistent and significant improvement to recognition accuracy on nonnative speakers, without performance degradation on native speakers.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"17 1 1","pages":"298-307"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82933228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
A new duration modeling approach for Mandarin speech 一种新的汉语语音时长的建模方法
Pub Date : 2003-07-28 DOI: 10.1109/TSA.2003.814377
Sin-Horng Chen, Wen-Hsing Lai, Yih-Ru Wang
A new duration modeling approach for Mandarin speech is proposed. It explicitly takes several major affecting factors, such as multiplicative companding factors (CFs), and estimates all model parameters by an EM algorithm. The three basic Tone 3 patterns (i.e., full tone, half tone and sandhi tone) are also properly considered using three different CFs to separate how they affect syllable duration. Experimental results show that the variance of the syllable duration is greatly reduced from 180.17 to 2.52 frame/sup 2/ (1 frame = 5 ms) by the syllable duration modeling to eliminate effects from those affecting factors. Moreover, the estimated CFs of those affecting factors agree well with our prior linguistic knowledge. Two extensions of the duration modeling method are also performed. One is the use of the same technique to model initial and final durations. The other is to replace the multiplicative model with an additive one. Lastly, a preliminary study of applying the proposed model to predict syllable duration for TTS (text-to-speech) is also performed. Experimental results show that it outperforms the conventional regressive prediction method.
提出了一种新的汉语语音时长的建模方法。它明确地考虑了几个主要的影响因素,如乘法扩展因子(CFs),并通过EM算法估计所有模型参数。三种基本的Tone 3模式(即全音、半音和连读音)也适当地被考虑使用三种不同的cf来区分它们对音节持续时间的影响。实验结果表明,通过音节时长建模,音节时长的方差从180.17帧/sup 2/(1帧= 5 ms)大大降低到2.52帧/sup 2/(1帧= 5 ms)。此外,这些影响因素的估计CFs与我们先前的语言知识非常吻合。对持续时间建模方法进行了两种扩展。一种是使用相同的技术对初始和最终持续时间进行建模。另一种是用加法模型代替乘法模型。最后,本文还进行了应用该模型预测TTS(文本到语音)音节时长的初步研究。实验结果表明,该方法优于传统的回归预测方法。
{"title":"A new duration modeling approach for Mandarin speech","authors":"Sin-Horng Chen, Wen-Hsing Lai, Yih-Ru Wang","doi":"10.1109/TSA.2003.814377","DOIUrl":"https://doi.org/10.1109/TSA.2003.814377","url":null,"abstract":"A new duration modeling approach for Mandarin speech is proposed. It explicitly takes several major affecting factors, such as multiplicative companding factors (CFs), and estimates all model parameters by an EM algorithm. The three basic Tone 3 patterns (i.e., full tone, half tone and sandhi tone) are also properly considered using three different CFs to separate how they affect syllable duration. Experimental results show that the variance of the syllable duration is greatly reduced from 180.17 to 2.52 frame/sup 2/ (1 frame = 5 ms) by the syllable duration modeling to eliminate effects from those affecting factors. Moreover, the estimated CFs of those affecting factors agree well with our prior linguistic knowledge. Two extensions of the duration modeling method are also performed. One is the use of the same technique to model initial and final durations. The other is to replace the multiplicative model with an additive one. Lastly, a preliminary study of applying the proposed model to predict syllable duration for TTS (text-to-speech) is also performed. Experimental results show that it outperforms the conventional regressive prediction method.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"74 1","pages":"308-320"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83361525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
High-fidelity multichannel audio coding with Karhunen-Loeve transform 高保真多声道音频编码与Karhunen-Loeve变换
Pub Date : 2003-07-28 DOI: 10.1109/TSA.2003.814375
Dai Yang, H. Ai, C. Kyriakakis, C.-C. Jay Kuo
A new quality-scalable high-fidelity multichannel audio compression algorithm based on MPEG-2 advanced audio coding (AAC) is presented. The Karhunen-Loeve transform (KLT) is applied to multichannel audio signals in the preprocessing stage to remove interchannel redundancy. Then, signals in decorrelated channels are compressed by a modified AAC main profile encoder. Finally, a channel transmission control mechanism is used to re-organize the bitstream so that the multichannel audio bitstream has a quality scalable property when it is transmitted over a heterogeneous network. Experimental results show that, compared with AAC, the proposed algorithm achieves a better performance while maintaining a similar computational complexity at the regular bit rate of 64 kbit/sec/ch. When the bitstream is transmitted to narrowband end users at a lower bit rate, packets in some channels can be dropped, and slightly degraded, yet full-channel, audio can still be reconstructed in a reasonable fashion without any additional computational cost.
提出了一种基于MPEG-2高级音频编码(AAC)的高保真多通道音频压缩算法。在预处理阶段对多声道音频信号应用Karhunen-Loeve变换(KLT)去除道间冗余。然后,用改进的AAC主剖面编码器对去相关信道中的信号进行压缩。最后,采用通道传输控制机制对比特流进行重新组织,使多通道音频比特流在异构网络上传输时具有质量可扩展性。实验结果表明,在64 kbit/sec/ch的常规比特率下,与AAC相比,该算法在保持相似计算复杂度的情况下取得了更好的性能。当比特流以较低的比特率传输到窄带终端用户时,某些通道中的数据包可能会被丢弃,并且稍微降级,但是全通道音频仍然可以以合理的方式重建,而不需要任何额外的计算成本。
{"title":"High-fidelity multichannel audio coding with Karhunen-Loeve transform","authors":"Dai Yang, H. Ai, C. Kyriakakis, C.-C. Jay Kuo","doi":"10.1109/TSA.2003.814375","DOIUrl":"https://doi.org/10.1109/TSA.2003.814375","url":null,"abstract":"A new quality-scalable high-fidelity multichannel audio compression algorithm based on MPEG-2 advanced audio coding (AAC) is presented. The Karhunen-Loeve transform (KLT) is applied to multichannel audio signals in the preprocessing stage to remove interchannel redundancy. Then, signals in decorrelated channels are compressed by a modified AAC main profile encoder. Finally, a channel transmission control mechanism is used to re-organize the bitstream so that the multichannel audio bitstream has a quality scalable property when it is transmitted over a heterogeneous network. Experimental results show that, compared with AAC, the proposed algorithm achieves a better performance while maintaining a similar computational complexity at the regular bit rate of 64 kbit/sec/ch. When the bitstream is transmitted to narrowband end users at a lower bit rate, packets in some channels can be dropped, and slightly degraded, yet full-channel, audio can still be reconstructed in a reasonable fashion without any additional computational cost.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"29 1","pages":"365-380"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77844797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Perceptual phase quantization of speech 语音的感知相位量化
Pub Date : 2003-07-28 DOI: 10.1109/TSA.2003.814409
Doh-Suk Kim
It is essential to incorporate perceptual characteristics of human hearing in modern speech/audio coding systems. However, the focus has been confined only to the magnitude information of speech, and little attention has been paid to phase information. A quantitative study on the characteristics of human phase perception is presented and a novel method is proposed for the quantization of phase information in speech/audio signals. First, the just-noticeable difference (JND) of phase for each harmonic in flat-spectrum periodic tones is measured for several different fundamental frequencies. Then, a mathematical model of JND is established, based on measured data, to form a weighting function for phase quantization. Since the proposed weighting function is derived from psychoacoustic measurements, it provides a novel quantization method by which more bits are assigned to perceptually important phase components at the sacrifice of less important ones, resulting in a quantized signal perceptually closer to the original one. Experimental results on five vowel speech signals demonstrate that the proposed weighting function is very effective for the quantization of phase information.
在现代语音/音频编码系统中纳入人类听觉的感知特征是至关重要的。然而,研究的重点仅限于语音的大小信息,而很少关注语音的相位信息。对人的相位感知特性进行了定量研究,提出了一种语音/音频信号中相位信息量化的新方法。首先,在几个不同的基频下,测量了平谱周期音调中每个谐波的相位差。然后,根据实测数据建立JND的数学模型,形成相位量化的加权函数。由于所提出的加权函数来源于心理声学测量,因此它提供了一种新的量化方法,通过这种方法,在牺牲不太重要的相位分量的情况下,将更多的比特分配给感知上重要的相位分量,从而使量化后的信号在感知上更接近原始信号。对5个元音语音信号的实验结果表明,该加权函数对相位信息的量化是非常有效的。
{"title":"Perceptual phase quantization of speech","authors":"Doh-Suk Kim","doi":"10.1109/TSA.2003.814409","DOIUrl":"https://doi.org/10.1109/TSA.2003.814409","url":null,"abstract":"It is essential to incorporate perceptual characteristics of human hearing in modern speech/audio coding systems. However, the focus has been confined only to the magnitude information of speech, and little attention has been paid to phase information. A quantitative study on the characteristics of human phase perception is presented and a novel method is proposed for the quantization of phase information in speech/audio signals. First, the just-noticeable difference (JND) of phase for each harmonic in flat-spectrum periodic tones is measured for several different fundamental frequencies. Then, a mathematical model of JND is established, based on measured data, to form a weighting function for phase quantization. Since the proposed weighting function is derived from psychoacoustic measurements, it provides a novel quantization method by which more bits are assigned to perceptually important phase components at the sacrifice of less important ones, resulting in a quantized signal perceptually closer to the original one. Experimental results on five vowel speech signals demonstrate that the proposed weighting function is very effective for the quantization of phase information.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"3 1","pages":"355-364"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84935583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
A generalized subspace approach for enhancing speech corrupted by colored noise 一种增强有色噪声语音的广义子空间方法
Pub Date : 2003-07-28 DOI: 10.1109/TSA.2003.814458
Y. Hu, P. Loizou
A generalized subspace approach is proposed for enhancement of speech corrupted by colored noise. A nonunitary transform, based on the simultaneous diagonalization of the clean speech and noise covariance matrices, is used to project the noisy signal onto a signal-plus-noise subspace and a noise subspace. The clean signal is estimated by nulling the signal components in the noise subspace and retaining the components in the signal subspace. The applied transform has built-in prewhitening and can therefore be used in general for colored noise. The proposed approach is shown to be a generalization of the approach proposed by Y. Ephraim and H.L. Van Trees (see ibid., vol.3, p.251-66, 1995) for white noise. Two estimators are derived based on the nonunitary transform, one based on time-domain constraints and one based on spectral domain constraints. Objective and subjective measures demonstrate improvements over other subspace-based methods when tested with TIMIT sentences corrupted with speech-shaped noise and multi-talker babble.
提出了一种广义子空间方法来增强有色噪声干扰下的语音。基于清洁语音和噪声协方差矩阵的同时对角化,采用非酉变换将噪声信号投影到信号加噪声子空间和噪声子空间上。通过消除噪声子空间中的信号分量并保留信号子空间中的分量来估计干净信号。应用的变换有内置的预白,因此可以用于有色噪声。所提出的方法被证明是Y. Ephraim和H.L. Van Trees提出的白噪声方法的推广(同上,第3卷,第251-66页,1995)。推导了基于非酉变换的两个估计量,一个基于时域约束,一个基于谱域约束。在使用被语音形状噪声和多说话者胡言乱语损坏的TIMIT句子进行测试时,客观和主观测量都比其他基于子空间的方法有所改进。
{"title":"A generalized subspace approach for enhancing speech corrupted by colored noise","authors":"Y. Hu, P. Loizou","doi":"10.1109/TSA.2003.814458","DOIUrl":"https://doi.org/10.1109/TSA.2003.814458","url":null,"abstract":"A generalized subspace approach is proposed for enhancement of speech corrupted by colored noise. A nonunitary transform, based on the simultaneous diagonalization of the clean speech and noise covariance matrices, is used to project the noisy signal onto a signal-plus-noise subspace and a noise subspace. The clean signal is estimated by nulling the signal components in the noise subspace and retaining the components in the signal subspace. The applied transform has built-in prewhitening and can therefore be used in general for colored noise. The proposed approach is shown to be a generalization of the approach proposed by Y. Ephraim and H.L. Van Trees (see ibid., vol.3, p.251-66, 1995) for white noise. Two estimators are derived based on the nonunitary transform, one based on time-domain constraints and one based on spectral domain constraints. Objective and subjective measures demonstrate improvements over other subspace-based methods when tested with TIMIT sentences corrupted with speech-shaped noise and multi-talker babble.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"46 1","pages":"334-341"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86859736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 406
Joint filterbanks for echo cancellation and audio coding 联合滤波器组回声消除和音频编码
Pub Date : 2003-07-28 DOI: 10.1109/TSA.2003.814798
P. Eneroth
Joint structures for audio coding and echo cancellation are investigated, utilizing standard audio coders. Two types of audio coders are considered, coders based on cosine modulated filterbanks and coders based on the modified discrete cosine transform (MDCT). For the first coder type, two methods for combining such a coder with a subband echo canceller are proposed. The two methods are: a modified audio coder filterbank that is suitable for echo cancellation but still generates the same final decomposition as the standard audio coder filterbank, and another that converts subband signals between an audio coder filterbank and a filterbank designed for echo cancellation. For the MDCT based audio coder, a joint structure with a frequency-domain adaptive filter based echo canceller is considered. Computational complexity and transmission delay for the different coder/echo canceller combinations are presented. Convergence properties of the proposed echo canceller structures are shown using simulations with real-life recorded speech.
利用标准音频编码器,研究了音频编码和回波消除的联合结构。考虑了两种类型的音频编码器,基于余弦调制滤波器组的编码器和基于改进离散余弦变换(MDCT)的编码器。对于第一种编码器类型,提出了两种将这种编码器与子带回波消除器相结合的方法。这两种方法是:一种改进的音频编码器滤波器组,它适用于回波消除,但仍然产生与标准音频编码器滤波器组相同的最终分解;另一种方法是在音频编码器滤波器组和设计用于回波消除的滤波器组之间转换子带信号。对于基于MDCT的音频编码器,考虑了一种与频域自适应滤波器相结合的结构。给出了不同编码器/回波消除器组合的计算复杂度和传输延迟。通过对真实语音记录的仿真显示了所提出的回声消除结构的收敛特性。
{"title":"Joint filterbanks for echo cancellation and audio coding","authors":"P. Eneroth","doi":"10.1109/TSA.2003.814798","DOIUrl":"https://doi.org/10.1109/TSA.2003.814798","url":null,"abstract":"Joint structures for audio coding and echo cancellation are investigated, utilizing standard audio coders. Two types of audio coders are considered, coders based on cosine modulated filterbanks and coders based on the modified discrete cosine transform (MDCT). For the first coder type, two methods for combining such a coder with a subband echo canceller are proposed. The two methods are: a modified audio coder filterbank that is suitable for echo cancellation but still generates the same final decomposition as the standard audio coder filterbank, and another that converts subband signals between an audio coder filterbank and a filterbank designed for echo cancellation. For the MDCT based audio coder, a joint structure with a frequency-domain adaptive filter based echo canceller is considered. Computational complexity and transmission delay for the different coder/echo canceller combinations are presented. Convergence properties of the proposed echo canceller structures are shown using simulations with real-life recorded speech.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"45 1","pages":"342-354"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75097897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
IEEE Trans. Speech Audio Process.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1