首页 > 最新文献

IEEE Trans. Speech Audio Process.最新文献

英文 中文
A study on model-based error rate estimation for automatic speech recognition 基于模型的语音自动识别错误率估计研究
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818030
C. Huang, Hsiao-Chuan Wang, Chin-Hui Lee
A model-based framework of classification error rate estimation is proposed for speech and speaker recognition. It aims at predicting the run-time performance of a hidden Markov model (HMM) based recognition system for a given task vocabulary and grammar without the need of running recognition experiments using a separate set of testing samples. This is highly desirable both in theory and in practice. However, the error rate expression in HMM-based speech recognition systems has no closed form solution due to the complexity of the multi-class comparison process and the need for dynamic time warping to handle various speech patterns. To alleviate the difficulty, we propose a one-dimensional model-based misclassification measure to evaluate the distance between a particular model of interest and a combination of many of its competing models. The error rate for a class characterized by the HMM is then the value of a smoothed zero-one error function given the misclassification measure. The overall error rate of the task vocabulary could then be computed as a function of all the available class error rates. The key here is to evaluate the misclassification measure in terms of the parameters of environmental-matched models without running recognition experiments, where the models are adapted by very limited data that could be just the testing utterance itself. In this paper, we show how the misclassification measure could be approximated by first computing the distance between two mixture Gaussian densities, then between two HMMs with mixture Gaussian state observation densities and finally between two sequences of HMMs. The misclassification measure is then converted into classification error rate. When comparing the error rate obtained in running actual experiments and that of the new framework, the proposed algorithm accurately estimates the classification error rate for many types of speech and speaker recognition problems. Based on the same framework, it is also demonstrated that the error rate of a recognition system in a noisy environment could also be predicted.
提出了一种基于模型的语音和说话人识别分类错误率估计框架。它旨在预测基于隐马尔可夫模型(HMM)的识别系统对给定任务词汇和语法的运行时性能,而无需使用单独的测试样本集运行识别实验。这在理论上和实践中都是非常可取的。然而,基于hmm的语音识别系统的错误率表达式由于多类比较过程的复杂性和需要动态时间规整来处理各种语音模式而没有封闭形式的解。为了减轻这一困难,我们提出了一种基于一维模型的错误分类度量来评估特定感兴趣的模型与其许多竞争模型的组合之间的距离。然后,由HMM表征的类的错误率是给定错误分类度量的平滑的0 - 1误差函数的值。然后,任务词汇表的总体错误率可以作为所有可用类错误率的函数来计算。这里的关键是在不运行识别实验的情况下,根据环境匹配模型的参数来评估错误分类措施,其中模型被非常有限的数据所适应,这些数据可能只是测试话语本身。在本文中,我们展示了如何通过首先计算两个混合高斯密度之间的距离,然后计算两个具有混合高斯状态观测密度的hmm之间的距离,最后计算两个hmm序列之间的距离来近似误分类度量。然后将误分类度量转换为分类错误率。将实际实验的错误率与新框架的错误率进行比较,该算法能够准确地估计出多种类型语音和说话人识别问题的分类错误率。基于相同的框架,还证明了在噪声环境下识别系统的错误率也可以预测。
{"title":"A study on model-based error rate estimation for automatic speech recognition","authors":"C. Huang, Hsiao-Chuan Wang, Chin-Hui Lee","doi":"10.1109/TSA.2003.818030","DOIUrl":"https://doi.org/10.1109/TSA.2003.818030","url":null,"abstract":"A model-based framework of classification error rate estimation is proposed for speech and speaker recognition. It aims at predicting the run-time performance of a hidden Markov model (HMM) based recognition system for a given task vocabulary and grammar without the need of running recognition experiments using a separate set of testing samples. This is highly desirable both in theory and in practice. However, the error rate expression in HMM-based speech recognition systems has no closed form solution due to the complexity of the multi-class comparison process and the need for dynamic time warping to handle various speech patterns. To alleviate the difficulty, we propose a one-dimensional model-based misclassification measure to evaluate the distance between a particular model of interest and a combination of many of its competing models. The error rate for a class characterized by the HMM is then the value of a smoothed zero-one error function given the misclassification measure. The overall error rate of the task vocabulary could then be computed as a function of all the available class error rates. The key here is to evaluate the misclassification measure in terms of the parameters of environmental-matched models without running recognition experiments, where the models are adapted by very limited data that could be just the testing utterance itself. In this paper, we show how the misclassification measure could be approximated by first computing the distance between two mixture Gaussian densities, then between two HMMs with mixture Gaussian state observation densities and finally between two sequences of HMMs. The misclassification measure is then converted into classification error rate. When comparing the error rate obtained in running actual experiments and that of the new framework, the proposed algorithm accurately estimates the classification error rate for many types of speech and speaker recognition problems. Based on the same framework, it is also demonstrated that the error rate of a recognition system in a noisy environment could also be predicted.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"18 1","pages":"581-589"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74614774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Coloration perception depending on sound direction 颜色感知取决于声音的方向
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818032
Y. Seki, Kiyohide Ito
Coloration is a phenomenon in which timbre changes when reflected and direct sounds are mixed. We studied the relationship between the perception of coloration and direction for two sounds. Our psychological experiments using 11 subjects suggested that a 50% threshold of coloration appears to have no difference depending on direction. When the level ratio of two sounds is closer to 0 dB, a difference appears: If direct sound comes from a lateral direction and reflected sound comes from the opposite direction, coloration perception does not increase monotonically even if the ratio approaches 0 dB. We assumed that the difference depending on direction resulted from the directional dependence of the spectrum including the head-related transfer function (HRTF) and proposed a numerical model for predicting psychological results using the comb structure on the spectrum observed at the eardrum. We measured spectra using a head and torso simulator (HATS) and calculated the area, eventually finding a quantitative relationship between the area and psychological results and proposing a prediction model based on this relationship.
颜色是一种现象,当反射和直接的声音混合在一起时,音色会发生变化。我们研究了两种声音的颜色感知和方向感知之间的关系。我们对11名受试者进行的心理实验表明,50%的颜色阈值似乎没有因方向而异。当两种声音的电平比接近0 dB时,出现了差异:如果直接声音来自侧面,反射声音来自相反方向,即使比例接近0 dB,颜色感知也不会单调增加。我们假设这种依赖于方向的差异是由频谱(包括头部相关传递函数(HRTF))的方向依赖性造成的,并提出了一个利用耳膜上观察到的频谱上的梳状结构来预测心理结果的数值模型。我们使用头部和躯干模拟器(HATS)测量光谱并计算面积,最终找到了面积与心理结果之间的定量关系,并提出了基于这种关系的预测模型。
{"title":"Coloration perception depending on sound direction","authors":"Y. Seki, Kiyohide Ito","doi":"10.1109/TSA.2003.818032","DOIUrl":"https://doi.org/10.1109/TSA.2003.818032","url":null,"abstract":"Coloration is a phenomenon in which timbre changes when reflected and direct sounds are mixed. We studied the relationship between the perception of coloration and direction for two sounds. Our psychological experiments using 11 subjects suggested that a 50% threshold of coloration appears to have no difference depending on direction. When the level ratio of two sounds is closer to 0 dB, a difference appears: If direct sound comes from a lateral direction and reflected sound comes from the opposite direction, coloration perception does not increase monotonically even if the ratio approaches 0 dB. We assumed that the difference depending on direction resulted from the directional dependence of the spectrum including the head-related transfer function (HRTF) and proposed a numerical model for predicting psychological results using the comb structure on the spectrum observed at the eardrum. We measured spectra using a head and torso simulator (HATS) and calculated the area, eventually finding a quantitative relationship between the area and psychological results and proposing a prediction model based on this relationship.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"52 1","pages":"817-825"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79821828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Interpolated rectangular 3-D digital waveguide mesh algorithms with frequency warping 具有频率翘曲的插值矩形三维数字波导网格算法
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818028
L. Savioja, V. Välimäki
Various interpolated three-dimensional (3-D) digital waveguide mesh algorithms are elaborated. We introduce an optimized technique that improves a formerly proposed trilinearly interpolated 3-D mesh and renders the mesh more homogeneous in different directions. Furthermore, various sparse versions of the interpolated mesh algorithm are investigated, which reduce the computational complexity at the expense of accuracy. Frequency-warping techniques are used to shift the frequencies of the output signal of the mesh in order to cancel the effect of dispersion error. The extensions improve the accuracy of 3-D digital waveguide mesh simulations enough so that in the future it can be used for acoustical simulations needed in the design of listening rooms, for example.
阐述了各种插值三维数字波导网格算法。我们介绍了一种优化技术,改进了以前提出的三线性插值三维网格,并使网格在不同方向上更加均匀。此外,研究了各种稀疏版本的插值网格算法,这些算法以牺牲精度为代价降低了计算复杂度。为了消除频散误差的影响,采用频率扭曲技术对网格输出信号的频率进行偏移。扩展提高了3-D数字波导网格模拟的精度,因此将来它可以用于听觉室设计所需的声学模拟,例如。
{"title":"Interpolated rectangular 3-D digital waveguide mesh algorithms with frequency warping","authors":"L. Savioja, V. Välimäki","doi":"10.1109/TSA.2003.818028","DOIUrl":"https://doi.org/10.1109/TSA.2003.818028","url":null,"abstract":"Various interpolated three-dimensional (3-D) digital waveguide mesh algorithms are elaborated. We introduce an optimized technique that improves a formerly proposed trilinearly interpolated 3-D mesh and renders the mesh more homogeneous in different directions. Furthermore, various sparse versions of the interpolated mesh algorithm are investigated, which reduce the computational complexity at the expense of accuracy. Frequency-warping techniques are used to shift the frequencies of the output signal of the mesh in order to cancel the effect of dispersion error. The extensions improve the accuracy of 3-D digital waveguide mesh simulations enough so that in the future it can be used for acoustical simulations needed in the design of listening rooms, for example.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"3 1","pages":"783-790"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88902457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Incorporating the human hearing properties in the signal subspace approach for speech enhancement 结合人类听觉特性的信号子空间语音增强方法
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818031
F. Jabloun, B. Champagne
The major drawback of most noise reduction methods in speech applications is the annoying residual noise known as musical noise. A potential solution to this artifact is the incorporation of a human hearing model in the suppression filter design. However, since the available models are usually developed in the frequency domain, it is not clear how they can be applied in the signal subspace approach for speech enhancement. In this paper, we present a Frequency to Eigendomain Transformation (FET) which permits to calculate a perceptually based eigenfilter. This filter yields an improved result where better shaping of the residual noise, from a perceptual perspective, is achieved. The proposed method can also be used with the general case of colored noise. Spectrogram illustrations and listening test results are given to show the superiority of the proposed method over the conventional signal subspace approach.
在语音应用中,大多数降噪方法的主要缺点是恼人的残余噪声,即音乐噪声。一个潜在的解决方案是在抑制滤波器设计中加入人类听觉模型。然而,由于可用的模型通常是在频域中开发的,因此如何将它们应用于语音增强的信号子空间方法尚不清楚。在本文中,我们提出了一个频率到特征域变换(FET),它允许计算一个基于感知的特征滤波器。从感知的角度来看,该滤波器产生了更好的残余噪声整形的改进结果。该方法也适用于一般情况下的有色噪声。给出了频谱图图解和听力测试结果,表明了该方法相对于传统的信号子空间方法的优越性。
{"title":"Incorporating the human hearing properties in the signal subspace approach for speech enhancement","authors":"F. Jabloun, B. Champagne","doi":"10.1109/TSA.2003.818031","DOIUrl":"https://doi.org/10.1109/TSA.2003.818031","url":null,"abstract":"The major drawback of most noise reduction methods in speech applications is the annoying residual noise known as musical noise. A potential solution to this artifact is the incorporation of a human hearing model in the suppression filter design. However, since the available models are usually developed in the frequency domain, it is not clear how they can be applied in the signal subspace approach for speech enhancement. In this paper, we present a Frequency to Eigendomain Transformation (FET) which permits to calculate a perceptually based eigenfilter. This filter yields an improved result where better shaping of the residual noise, from a perceptual perspective, is achieved. The proposed method can also be used with the general case of colored noise. Spectrogram illustrations and listening test results are given to show the superiority of the proposed method over the conventional signal subspace approach.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"2070 1","pages":"700-708"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91329942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 189
Automatic phonetic segmentation 自动语音切分
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.813579
D. Toledano, L. A. H. Gómez, Luis Villarrubia Grande
This paper presents the results and conclusions of a thorough study on automatic phonetic segmentation. It starts with a review of the state of the art in this field. Then, it analyzes the most frequently used approach-based on a modified Hidden Markov Model (HMM) phonetic recognizer. For this approach, a statistical correction procedure is proposed to compensate for the systematic errors produced by context-dependent HMMs, and the use of speaker adaptation techniques is considered to increase segmentation precision. Finally, this paper explores the possibility of locally refining the boundaries obtained with the former techniques. A general framework is proposed for the local refinement of boundaries, and the performance of several pattern classification approaches (fuzzy logic, neural networks and Gaussian mixture models) is compared within this framework. The resulting phonetic segmentation scheme was able to increase the performance of a baseline HMM segmentation tool from 27.12%, 79.27%, and 97.75% of automatic boundary marks with errors smaller than 5, 20, and 50 ms, respectively, to 65.86%, 96.01%, and 99.31% in speaker-dependent mode, which is a reasonably good approximation to manual segmentation.
本文介绍了语音自动切分的研究成果和结论。首先回顾一下这一领域的最新进展。然后,分析了基于改进隐马尔可夫模型(HMM)语音识别器的最常用方法。针对该方法,提出了一种统计校正程序来补偿上下文相关hmm产生的系统误差,并考虑使用说话人自适应技术来提高分割精度。最后,本文探讨了用上述方法局部细化边界的可能性。提出了边界局部细化的一般框架,并在此框架下比较了几种模式分类方法(模糊逻辑、神经网络和高斯混合模型)的性能。所得到的语音分割方案能够将基线HMM分割工具的性能分别从误差小于5、20和50 ms的自动边界标记的27.12%、79.27%和97.75%提高到基于说话人的模式下的65.86%、96.01%和99.31%,这是一个相当好的近似人工分割。
{"title":"Automatic phonetic segmentation","authors":"D. Toledano, L. A. H. Gómez, Luis Villarrubia Grande","doi":"10.1109/TSA.2003.813579","DOIUrl":"https://doi.org/10.1109/TSA.2003.813579","url":null,"abstract":"This paper presents the results and conclusions of a thorough study on automatic phonetic segmentation. It starts with a review of the state of the art in this field. Then, it analyzes the most frequently used approach-based on a modified Hidden Markov Model (HMM) phonetic recognizer. For this approach, a statistical correction procedure is proposed to compensate for the systematic errors produced by context-dependent HMMs, and the use of speaker adaptation techniques is considered to increase segmentation precision. Finally, this paper explores the possibility of locally refining the boundaries obtained with the former techniques. A general framework is proposed for the local refinement of boundaries, and the performance of several pattern classification approaches (fuzzy logic, neural networks and Gaussian mixture models) is compared within this framework. The resulting phonetic segmentation scheme was able to increase the performance of a baseline HMM segmentation tool from 27.12%, 79.27%, and 97.75% of automatic boundary marks with errors smaller than 5, 20, and 50 ms, respectively, to 65.86%, 96.01%, and 99.31% in speaker-dependent mode, which is a reasonably good approximation to manual segmentation.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"445 1","pages":"617-625"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77852288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 192
Pitch adaptive windows for improved excitation coding in low-rate CELP coders 在低速率CELP编码器中改进激励编码的音调自适应窗
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.815530
A. Rao, S. Ahmadi, J. Linden, A. Gersho, V. Cuperman, R. Heidari
A novel paradigm based on pitch-adaptive windows is proposed for solving the problem of encoding the fixed codebook excitation in low bit-rate CELP coders. In this method, the nonzero excitation in the fixed codebook is substantially localized to a set of time intervals called windows. The positions of the windows are adaptive to the pitch peaks in the linear prediction residual signal. Thus, high coding efficiency is achieved by allocating most of the available FCB bits to the perceptually important segments of the excitation signal. The pitch-adaptive method is adopted in the design of a novel multimode variable-rate speech coder applicable to CDMA-based cellular telephony. Results demonstrate that the adaptive windows method yields excellent voice quality and intelligibility at average bit-rates in the range of 2.5-4.0 kbps.
针对低比特率CELP编码器中固定码本激励的编码问题,提出了一种基于间距自适应窗的编码方法。在这种方法中,固定码本中的非零激励基本上被定位到一组称为窗口的时间间隔。窗口的位置与线性预测残差信号中的基音峰值相适应。因此,通过将大多数可用的FCB位分配到激励信号的感知重要段,可以实现高编码效率。采用音高自适应方法设计了一种适用于基于cdma的蜂窝电话的多模可变速率语音编码器。结果表明,自适应窗口方法在2.5-4.0 kbps的平均比特率范围内产生了良好的语音质量和清晰度。
{"title":"Pitch adaptive windows for improved excitation coding in low-rate CELP coders","authors":"A. Rao, S. Ahmadi, J. Linden, A. Gersho, V. Cuperman, R. Heidari","doi":"10.1109/TSA.2003.815530","DOIUrl":"https://doi.org/10.1109/TSA.2003.815530","url":null,"abstract":"A novel paradigm based on pitch-adaptive windows is proposed for solving the problem of encoding the fixed codebook excitation in low bit-rate CELP coders. In this method, the nonzero excitation in the fixed codebook is substantially localized to a set of time intervals called windows. The positions of the windows are adaptive to the pitch peaks in the linear prediction residual signal. Thus, high coding efficiency is achieved by allocating most of the available FCB bits to the perceptually important segments of the excitation signal. The pitch-adaptive method is adopted in the design of a novel multimode variable-rate speech coder applicable to CDMA-based cellular telephony. Results demonstrate that the adaptive windows method yields excellent voice quality and intelligibility at average bit-rates in the range of 2.5-4.0 kbps.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"33 1","pages":"648-659"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82933760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Near-field broadband beamformer design via multidimensional semi-infinite-linear programming techniques 基于多维半无限线性规划技术的近场宽带波束形成器设计
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.815527
K. Yiu, Xiaoqi Yang, S. Nordholm, K. Teo
Broadband microphone arrays has important applications such as hands-free mobile telephony, voice interface to personal computers and video conference equipment. This problem can be tackled in different ways. In this paper, a general broadband beamformer design problem is considered. The problem is posed as a Chebyshev minimax problem. Using the l/sub 1/-norm measure or the real rotation theorem, we show that it can be converted into a semi-infinite linear programming problem. A numerical scheme using a set of adaptive grids is applied. The scheme is proven to be convergent when a certain grid refinement is used. The method can be applied to the design of multidimensional digital finite-impulse response (FIR) filters with arbitrarily specified amplitude and phase.
宽带麦克风阵列在免提移动电话、个人电脑语音接口和视频会议设备等方面有着重要的应用。这个问题可以用不同的方法来解决。本文研究了宽带波束形成器的一般设计问题。该问题是一个切比雪夫极大极小问题。利用l/下标1/-范数测度或实旋转定理,证明了它可以转化为半无限线性规划问题。采用了一组自适应网格的数值格式。当使用一定的网格细化时,证明了该方案是收敛的。该方法可用于设计任意幅值和相位的多维数字有限脉冲响应(FIR)滤波器。
{"title":"Near-field broadband beamformer design via multidimensional semi-infinite-linear programming techniques","authors":"K. Yiu, Xiaoqi Yang, S. Nordholm, K. Teo","doi":"10.1109/TSA.2003.815527","DOIUrl":"https://doi.org/10.1109/TSA.2003.815527","url":null,"abstract":"Broadband microphone arrays has important applications such as hands-free mobile telephony, voice interface to personal computers and video conference equipment. This problem can be tackled in different ways. In this paper, a general broadband beamformer design problem is considered. The problem is posed as a Chebyshev minimax problem. Using the l/sub 1/-norm measure or the real rotation theorem, we show that it can be converted into a semi-infinite linear programming problem. A numerical scheme using a set of adaptive grids is applied. The scheme is proven to be convergent when a certain grid refinement is used. The method can be applied to the design of multidimensional digital finite-impulse response (FIR) filters with arbitrarily specified amplitude and phase.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"11 1","pages":"725-732"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89495175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition 基于迭代随机逼近的非平稳噪声递归估计鲁棒语音识别
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818076
L. Deng, J. Droppo, A. Acero
We describe a novel algorithm for recursive estimation of nonstationary acoustic noise which corrupts clean speech, and a successful application of the algorithm in the speech feature enhancement framework of noise-normalized SPLICE for robust speech recognition. The noise estimation algorithm makes use of a nonlinear model of the acoustic environment in the cepstral domain. Central to the algorithm is the innovative iterative stochastic approximation technique that improves piecewise linear approximation to the nonlinearity involved and that subsequently increases the accuracy for noise estimation. We report comprehensive experiments on SPLICE-based, noise-robust speech recognition for the AURORA2 task using the results of iterative stochastic approximation. The effectiveness of the new technique is demonstrated in comparison with a more traditional, MMSE noise estimation algorithm under otherwise identical conditions. The word error rate reduction achieved by iterative stochastic approximation for recursive noise estimation in the framework of noise-normalized SPLICE is 27.9% for the multicondition training mode, and 67.4% for the clean-only training mode, respectively, compared with the results using the standard cepstra with no speech enhancement and using the baseline HMM supplied by AURORA2. These represent the best performance in the clean-training category of the September-2001 AURORA2 evaluation. The relative error rate reduction achieved by using the same noise estimate is increased to 48.40% and 76.86%, respectively, for the two training modes after using a better designed HMM system. The experimental results demonstrated the crucial importance of using the newly introduced iterations in improving the earlier stochastic approximation technique, and showed sensitivity of the noise estimation algorithm's performance to the forgetting factor embedded in the algorithm.
本文提出了一种新的递归估计非平稳噪声的算法,并成功地将该算法应用于噪声归一化SPLICE的语音特征增强框架中,用于鲁棒语音识别。噪声估计算法利用了倒谱域声环境的非线性模型。该算法的核心是创新的迭代随机逼近技术,它改进了对所涉及的非线性的分段线性逼近,从而提高了噪声估计的准确性。我们报告了基于splice的综合实验,使用迭代随机逼近的结果对AURORA2任务进行噪声鲁棒性语音识别。在其他条件相同的情况下,与传统的MMSE噪声估计算法进行了比较,证明了新技术的有效性。在噪声归一化SPLICE框架下,采用迭代随机逼近进行递归噪声估计,在多条件训练模式下,与不加语音增强的标准倒频谱和AURORA2提供的基线HMM相比,错误率分别降低了27.9%和67.4%。这些是2001年9月AURORA2评价的清洁培训类别中表现最好的。在设计更好的HMM系统后,两种训练模式使用相同噪声估计的相对误差率分别提高到48.40%和76.86%。实验结果表明,使用新引入的迭代对改进早期的随机逼近技术至关重要,并显示了噪声估计算法的性能对算法中嵌入的遗忘因子的敏感性。
{"title":"Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition","authors":"L. Deng, J. Droppo, A. Acero","doi":"10.1109/TSA.2003.818076","DOIUrl":"https://doi.org/10.1109/TSA.2003.818076","url":null,"abstract":"We describe a novel algorithm for recursive estimation of nonstationary acoustic noise which corrupts clean speech, and a successful application of the algorithm in the speech feature enhancement framework of noise-normalized SPLICE for robust speech recognition. The noise estimation algorithm makes use of a nonlinear model of the acoustic environment in the cepstral domain. Central to the algorithm is the innovative iterative stochastic approximation technique that improves piecewise linear approximation to the nonlinearity involved and that subsequently increases the accuracy for noise estimation. We report comprehensive experiments on SPLICE-based, noise-robust speech recognition for the AURORA2 task using the results of iterative stochastic approximation. The effectiveness of the new technique is demonstrated in comparison with a more traditional, MMSE noise estimation algorithm under otherwise identical conditions. The word error rate reduction achieved by iterative stochastic approximation for recursive noise estimation in the framework of noise-normalized SPLICE is 27.9% for the multicondition training mode, and 67.4% for the clean-only training mode, respectively, compared with the results using the standard cepstra with no speech enhancement and using the baseline HMM supplied by AURORA2. These represent the best performance in the clean-training category of the September-2001 AURORA2 evaluation. The relative error rate reduction achieved by using the same noise estimate is increased to 48.40% and 76.86%, respectively, for the two training modes after using a better designed HMM system. The experimental results demonstrated the crucial importance of using the newly introduced iterations in improving the earlier stochastic approximation technique, and showed sensitivity of the noise estimation algorithm's performance to the forgetting factor embedded in the algorithm.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"110 1","pages":"568-580"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84247680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 121
Binaural cue coding-Part II: Schemes and applications 双耳线索编码-第2部分:方案和应用
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818108
C. Faller, F. Baumgarte
Binaural Cue Coding (BCC) is a method for multichannel spatial rendering based on one down-mixed audio channel and side information. The companion paper (Part I) covers the psychoacoustic fundamentals of this method and outlines principles for the design of BCC schemes. The BCC analysis and synthesis methods of Part I are motivated and presented in the framework of stereophonic audio coding. This paper, Part II, generalizes the basic BCC schemes presented in Part I. It includes BCC for multichannel signals and employs an enhanced set of perceptual spatial cues for BCC synthesis. A scheme for multichannel audio coding is presented. Moreover, a modified scheme is derived that allows flexible rendering of the spatial image at the receiver supporting dynamic control. All aspects of complete BCC encoder and decoder implementations are discussed, such as down-mixing of the input signals, low complexity estimation of the spatial cues, and quantization and coding of the side information. Application examples are given and the performance of the coder implementations are evaluated and discussed based on subjective listening test results.
双耳线索编码(BCC)是一种基于一个下混音频通道和边信息的多通道空间渲染方法。配套论文(第一部分)涵盖了这种方法的心理声学基础,并概述了BCC方案设计的原则。第一部分的BCC分析和合成方法是在立体声音频编码的框架下进行的。本文第二部分概括了第一部分中提出的基本BCC方案,包括多通道信号的BCC,并采用一组增强的感知空间线索进行BCC合成。提出了一种多通道音频编码方案。此外,还推导了一种改进方案,允许在支持动态控制的接收器上灵活地呈现空间图像。讨论了完整的BCC编码器和解码器实现的所有方面,例如输入信号的下混频,空间线索的低复杂度估计,以及侧信息的量化和编码。给出了应用实例,并根据主观听力测试结果对编码器实现的性能进行了评价和讨论。
{"title":"Binaural cue coding-Part II: Schemes and applications","authors":"C. Faller, F. Baumgarte","doi":"10.1109/TSA.2003.818108","DOIUrl":"https://doi.org/10.1109/TSA.2003.818108","url":null,"abstract":"Binaural Cue Coding (BCC) is a method for multichannel spatial rendering based on one down-mixed audio channel and side information. The companion paper (Part I) covers the psychoacoustic fundamentals of this method and outlines principles for the design of BCC schemes. The BCC analysis and synthesis methods of Part I are motivated and presented in the framework of stereophonic audio coding. This paper, Part II, generalizes the basic BCC schemes presented in Part I. It includes BCC for multichannel signals and employs an enhanced set of perceptual spatial cues for BCC synthesis. A scheme for multichannel audio coding is presented. Moreover, a modified scheme is derived that allows flexible rendering of the spatial image at the receiver supporting dynamic control. All aspects of complete BCC encoder and decoder implementations are discussed, such as down-mixing of the input signals, low complexity estimation of the spatial cues, and quantization and coding of the side information. Application examples are given and the performance of the coder implementations are evaluated and discussed based on subjective listening test results.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"65 1","pages":"520-531"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90485495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 237
Particle filtering algorithms for tracking an acoustic source in a reverberant environment 在混响环境中跟踪声源的粒子滤波算法
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818112
D. Ward, E. Lehmann, R. C. Williamson
Traditional acoustic source localization algorithms attempt to find the current location of the acoustic source using data collected at an array of sensors at the current time only. In the presence of strong multipath, these traditional algorithms often erroneously locate a multipath reflection rather than the true source location. A recently proposed approach that appears promising in overcoming this drawback of traditional algorithms, is a state-space approach using particle filtering. In this paper we formulate a general framework for tracking an acoustic source using particle filters. We discuss four specific algorithms that fit within this framework, and demonstrate their performance using both simulated reverberant data and data recorded in a moderately reverberant office room (with a measured reverberation time of 0.39 s). The results indicate that the proposed family of algorithms are able to accurately track a moving source in a moderately reverberant room.
传统的声源定位算法试图仅使用在当前时间从传感器阵列收集的数据来找到声源的当前位置。在强多径存在的情况下,这些传统算法往往会错误地定位多径反射而不是真实的源位置。最近提出的一种方法似乎有望克服传统算法的这一缺点,即使用粒子滤波的状态空间方法。在本文中,我们制定了一个使用粒子滤波器跟踪声源的一般框架。我们讨论了适合该框架的四种特定算法,并使用模拟混响数据和在中等混响办公室(测量混响时间为0.39秒)中记录的数据演示了它们的性能。结果表明,所提出的算法家族能够准确地跟踪中等混响房间中的移动源。
{"title":"Particle filtering algorithms for tracking an acoustic source in a reverberant environment","authors":"D. Ward, E. Lehmann, R. C. Williamson","doi":"10.1109/TSA.2003.818112","DOIUrl":"https://doi.org/10.1109/TSA.2003.818112","url":null,"abstract":"Traditional acoustic source localization algorithms attempt to find the current location of the acoustic source using data collected at an array of sensors at the current time only. In the presence of strong multipath, these traditional algorithms often erroneously locate a multipath reflection rather than the true source location. A recently proposed approach that appears promising in overcoming this drawback of traditional algorithms, is a state-space approach using particle filtering. In this paper we formulate a general framework for tracking an acoustic source using particle filters. We discuss four specific algorithms that fit within this framework, and demonstrate their performance using both simulated reverberant data and data recorded in a moderately reverberant office room (with a measured reverberation time of 0.39 s). The results indicate that the proposed family of algorithms are able to accurately track a moving source in a moderately reverberant room.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"83 1","pages":"826-836"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77281096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 358
期刊
IEEE Trans. Speech Audio Process.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1