首页 > 最新文献

IEEE Trans. Speech Audio Process.最新文献

英文 中文
Prosodic and accentual information for automatic speech recognition 韵律和重音信息自动语音识别
Pub Date : 2003-07-28 DOI: 10.1109/TSA.2003.814368
Diego H. Milone, A. Rubio
Various aspects relating to the human production and perception of speech have gradually been incorporated into automatic speech recognition systems. Nevertheless, the set of speech prosodic features has not yet been used in an explicit way in the recognition process itself. This study presents an analysis of prosody's three most important parameters, namely energy, fundamental frequency and duration, together with a method for incorporating this information into automatic speech recognition. On the basis of a preliminary analysis, a design is proposed for a prosodic feature classifier in which these parameters are associated with orthographic accentuation. Prosodic-accentual features are incorporated in a hidden Markov model recognizer; their theoretical formulation and experimental setup are then presented. Several experiments were conducted to show how the method performs with a Spanish continuous-speech database. Using this approach to process other database subsets, we obtained a word recognition error reduction rate of 28.91%.
与人类产生和感知语音有关的各个方面已逐渐被纳入自动语音识别系统。然而,语音韵律特征集尚未在识别过程中以明确的方式使用。本文分析了韵律的三个最重要的参数,即能量、基本频率和持续时间,并提出了一种将这些信息纳入语音自动识别的方法。在初步分析的基础上,提出了一种韵律特征分类器的设计,其中这些参数与正音重读相关联。韵律重音特征被整合到隐马尔可夫模型识别器中;然后给出了它们的理论公式和实验装置。几个实验显示了该方法在西班牙语连续语音数据库中的表现。使用该方法处理其他数据库子集,我们获得了28.91%的单词识别错误率。
{"title":"Prosodic and accentual information for automatic speech recognition","authors":"Diego H. Milone, A. Rubio","doi":"10.1109/TSA.2003.814368","DOIUrl":"https://doi.org/10.1109/TSA.2003.814368","url":null,"abstract":"Various aspects relating to the human production and perception of speech have gradually been incorporated into automatic speech recognition systems. Nevertheless, the set of speech prosodic features has not yet been used in an explicit way in the recognition process itself. This study presents an analysis of prosody's three most important parameters, namely energy, fundamental frequency and duration, together with a method for incorporating this information into automatic speech recognition. On the basis of a preliminary analysis, a design is proposed for a prosodic feature classifier in which these parameters are associated with orthographic accentuation. Prosodic-accentual features are incorporated in a hidden Markov model recognizer; their theoretical formulation and experimental setup are then presented. Several experiments were conducted to show how the method performs with a Spanish continuous-speech database. Using this approach to process other database subsets, we obtained a word recognition error reduction rate of 28.91%.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"2016 1","pages":"321-333"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86125549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Performance limits in subband beamforming 子带波束形成的性能限制
Pub Date : 2003-07-09 DOI: 10.1109/TSA.2003.811543
S. Nordholm, I. Claesson, N. Grbic
This paper analyzes subband beamforming schemes mainly aimed at speech enhancement and acoustic echo suppression applications such as hands-free telephony for both mobile and office environments, Internet telephony and video conferencing. Analytical descriptions of both causal finite-length and noncausal infinite-length subband microphone array structures are given. More specifically, this paper compares finite Wiener filter performance with the noncausal Wiener solution, giving a comprehensive theoretical suppression limit. It is shown that even short filters will yield a good approximation of the infinite solution, provided that the element spacing and temporal sampling is matched to the frequency band of interest. Typically, 10-20 FIR taps are sufficient in each subband.
本文分析了子带波束形成方案,主要针对语音增强和声回波抑制应用,如移动和办公环境的免提电话、互联网电话和视频会议。给出了因果有限长和非因果无限长子带传声器阵列结构的解析描述。更具体地说,本文比较了有限维纳滤波器与非因果维纳解的性能,给出了一个全面的理论抑制极限。结果表明,只要单元间距和时间采样与感兴趣的频带相匹配,即使是短滤波器也能很好地近似于无限解。通常,在每个子带中10-20个FIR抽头就足够了。
{"title":"Performance limits in subband beamforming","authors":"S. Nordholm, I. Claesson, N. Grbic","doi":"10.1109/TSA.2003.811543","DOIUrl":"https://doi.org/10.1109/TSA.2003.811543","url":null,"abstract":"This paper analyzes subband beamforming schemes mainly aimed at speech enhancement and acoustic echo suppression applications such as hands-free telephony for both mobile and office environments, Internet telephony and video conferencing. Analytical descriptions of both causal finite-length and noncausal infinite-length subband microphone array structures are given. More specifically, this paper compares finite Wiener filter performance with the noncausal Wiener solution, giving a comprehensive theoretical suppression limit. It is shown that even short filters will yield a good approximation of the infinite solution, provided that the element spacing and temporal sampling is matched to the frequency band of interest. Typically, 10-20 FIR taps are sufficient in each subband.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"20 1","pages":"193-203"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80592352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Generalized digital waveguide networks 广义数字波导网络
Pub Date : 2003-07-09 DOI: 10.1109/TSA.2003.811541
D. Rocchesso, J. Smith
Digital waveguides are generalized to the multivariable case with the goal of maximizing generality while retaining robust numerical properties and simplicity of realization. Multivariable complex power is defined, and conditions for "medium passivity" are presented. Multivariable complex wave impedances, such as those deriving from multivariable lossy waveguides, are used to construct scattering junctions which yield frequency dependent scattering coefficients which can be implemented in practice using digital filters. The general form for the scattering matrix at a junction of multivariable waveguides is derived. An efficient class of loss-modeling filters is derived, including a rule for checking validity of the small-loss assumption. An example application in musical acoustics is given.
将数字波导推广到多变量情况,其目标是在保持稳健性和简单实现的同时最大限度地提高通用性。定义了多变量复功率,给出了“中等无源性”的条件。多变量复波阻抗,例如来自多变量损耗波导的复波阻抗,被用来构造散射结,产生频率相关的散射系数,可以在实践中使用数字滤波器实现。导出了多变量波导交界处散射矩阵的一般形式。导出了一类有效的损失建模滤波器,包括检验小损失假设有效性的规则。给出了在音乐声学中的应用实例。
{"title":"Generalized digital waveguide networks","authors":"D. Rocchesso, J. Smith","doi":"10.1109/TSA.2003.811541","DOIUrl":"https://doi.org/10.1109/TSA.2003.811541","url":null,"abstract":"Digital waveguides are generalized to the multivariable case with the goal of maximizing generality while retaining robust numerical properties and simplicity of realization. Multivariable complex power is defined, and conditions for \"medium passivity\" are presented. Multivariable complex wave impedances, such as those deriving from multivariable lossy waveguides, are used to construct scattering junctions which yield frequency dependent scattering coefficients which can be implemented in practice using digital filters. The general form for the scattering matrix at a junction of multivariable waveguides is derived. An efficient class of loss-modeling filters is derived, including a rule for checking validity of the small-loss assumption. An example application in musical acoustics is given.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"66 1","pages":"242-254"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73738358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Finite difference schemes and digital waveguide networks for the wave equation: stability, passivity, and numerical dispersion 波动方程的有限差分格式和数字波导网络:稳定性、无源性和数值色散
Pub Date : 2003-07-09 DOI: 10.1109/TSA.2003.811535
S. Bilbao, J. Smith
In this paper, some simple families of explicit two-step finite difference methods for solving the wave equation in two and three spatial dimensions are examined. These schemes depend on several free parameters, and can be associated with so-called interpolated digital waveguide meshes. Special attention is paid to the stability properties of these schemes (in particular the bounds on the space-step/time-step ratio) and their relationship with the passivity condition on the related digital waveguide networks. Boundary conditions are also discussed. An analysis of the directional numerical dispersion properties of these schemes is provided, and minimally directionally-dispersive interpolated digital waveguide meshes are constructed.
本文讨论了求解二维和三维波动方程的显式两步有限差分法的一些简单族。这些方案依赖于几个自由参数,并且可以与所谓的内插数字波导网格相关联。特别注意了这些方案的稳定性(特别是空间步长/时间步长比的边界)及其与相关数字波导网络无源条件的关系。还讨论了边界条件。分析了这些方案的方向性数值色散特性,构造了具有最小方向性色散的插值数字波导网格。
{"title":"Finite difference schemes and digital waveguide networks for the wave equation: stability, passivity, and numerical dispersion","authors":"S. Bilbao, J. Smith","doi":"10.1109/TSA.2003.811535","DOIUrl":"https://doi.org/10.1109/TSA.2003.811535","url":null,"abstract":"In this paper, some simple families of explicit two-step finite difference methods for solving the wave equation in two and three spatial dimensions are examined. These schemes depend on several free parameters, and can be associated with so-called interpolated digital waveguide meshes. Special attention is paid to the stability properties of these schemes (in particular the bounds on the space-step/time-step ratio) and their relationship with the passivity condition on the related digital waveguide networks. Boundary conditions are also discussed. An analysis of the directional numerical dispersion properties of these schemes is provided, and minimally directionally-dispersive interpolated digital waveguide meshes are constructed.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"2009 1","pages":"255-266"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78581867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Distortion discriminant analysis for audio fingerprinting 音频指纹失真判别分析
Pub Date : 2003-07-09 DOI: 10.1109/TSA.2003.811538
C. Burges, John C. Platt, S. Jana
Mapping audio data to feature vectors for the classification, retrieval or identification tasks presents four principal challenges. The dimensionality of the input must be significantly reduced; the resulting features must be robust to likely distortions of the input; the features must be informative for the task at hand; and the feature extraction operation must be computationally efficient. We propose distortion discriminant analysis (DDA), which fulfills all four of these requirements. DDA constructs a linear, convolutional neural network out of layers, each of which performs an oriented PCA dimensional reduction. We demonstrate the effectiveness of DDA on two audio fingerprinting tasks: searching for 500 audio clips in 36 h of audio test data; and playing over 10 days of audio against a database with approximately 240 000 fingerprints. We show that the system is robust to kinds of noise that are not present in the training procedure. In the large test, the system gives a false positive rate of 1.5 /spl times/ 10/sup -8/ per audio clip, per fingerprint, at a false negative rate of 0.2% per clip.
将音频数据映射到特征向量,用于分类、检索或识别任务,提出了四个主要挑战。输入的维数必须显著降低;得到的特征必须对可能的输入失真具有鲁棒性;特征必须对手头的任务提供信息;特征提取操作必须计算效率高。我们提出失真判别分析(DDA),它满足所有这四个要求。DDA构建了一个线性的卷积神经网络,每一层都执行一个定向的PCA降维。我们在两个音频指纹任务上证明了DDA的有效性:在36小时的音频测试数据中搜索500个音频片段;在数据库中比对了超过10天的音频和大约24万个指纹。我们证明了该系统对训练过程中不存在的各种噪声具有鲁棒性。在大型测试中,系统给出的假阳性率为每个音频片段,每个指纹1.5 /spl乘以/ 10/sup -8/,每个片段的假阴性率为0.2%。
{"title":"Distortion discriminant analysis for audio fingerprinting","authors":"C. Burges, John C. Platt, S. Jana","doi":"10.1109/TSA.2003.811538","DOIUrl":"https://doi.org/10.1109/TSA.2003.811538","url":null,"abstract":"Mapping audio data to feature vectors for the classification, retrieval or identification tasks presents four principal challenges. The dimensionality of the input must be significantly reduced; the resulting features must be robust to likely distortions of the input; the features must be informative for the task at hand; and the feature extraction operation must be computationally efficient. We propose distortion discriminant analysis (DDA), which fulfills all four of these requirements. DDA constructs a linear, convolutional neural network out of layers, each of which performs an oriented PCA dimensional reduction. We demonstrate the effectiveness of DDA on two audio fingerprinting tasks: searching for 500 audio clips in 36 h of audio test data; and playing over 10 days of audio against a database with approximately 240 000 fingerprints. We show that the system is robust to kinds of noise that are not present in the training procedure. In the large test, the system gives a false positive rate of 1.5 /spl times/ 10/sup -8/ per audio clip, per fingerprint, at a false negative rate of 0.2% per clip.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"33 1","pages":"165-174"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80139364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 152
SNR estimation based on amplitude modulation analysis with applications to noise suppression 基于调幅分析的信噪比估计及其在噪声抑制中的应用
Pub Date : 2003-07-09 DOI: 10.1109/TSA.2003.811542
J. Tchorz, B. Kollmeier
A single-microphone noise suppression algorithm is described that is based on a novel approach for the estimation of the signal-to-noise ratio (SNR) in different frequency channels: The input signal is transformed into neurophysiologically-motivated spectro-temporal input features. These patterns are called amplitude modulation spectrograms (AMS), as they contain information of both center frequencies and modulation frequencies within each 32 ms-analysis frame. The different representations of speech and noise in AMS patterns are detected by a neural network, which estimates the present SNR in each frequency channel. Quantitative experiments show a reliable estimation of the SNR for most types of nonspeech background noise. For noise suppression, the frequency bands are attenuated according to the estimated present SNR using a Wiener filter approach. Objective speech quality measures, informal listening tests, and the results of automatic speech recognition experiments indicate a substantial benefit from AMS-based noise suppression, in comparison to unprocessed noisy speech.
提出了一种单麦克风噪声抑制算法,该算法基于一种估计不同频率信道信噪比(SNR)的新方法:将输入信号转换为神经生理驱动的光谱-时间输入特征。这些模式被称为调幅谱图(AMS),因为它们包含每个32毫秒分析帧内的中心频率和调制频率的信息。在AMS模式中,语音和噪声的不同表示由神经网络检测,该网络估计每个频率通道中的当前信噪比。定量实验表明,对于大多数类型的非语音背景噪声,该方法都能可靠地估计出信噪比。为了抑制噪声,使用维纳滤波方法根据估计的当前信噪比对频带进行衰减。客观的语音质量测量、非正式的听力测试和自动语音识别实验结果表明,与未经处理的噪声语音相比,基于ams的噪声抑制具有实质性的好处。
{"title":"SNR estimation based on amplitude modulation analysis with applications to noise suppression","authors":"J. Tchorz, B. Kollmeier","doi":"10.1109/TSA.2003.811542","DOIUrl":"https://doi.org/10.1109/TSA.2003.811542","url":null,"abstract":"A single-microphone noise suppression algorithm is described that is based on a novel approach for the estimation of the signal-to-noise ratio (SNR) in different frequency channels: The input signal is transformed into neurophysiologically-motivated spectro-temporal input features. These patterns are called amplitude modulation spectrograms (AMS), as they contain information of both center frequencies and modulation frequencies within each 32 ms-analysis frame. The different representations of speech and noise in AMS patterns are detected by a neural network, which estimates the present SNR in each frequency channel. Quantitative experiments show a reliable estimation of the SNR for most types of nonspeech background noise. For noise suppression, the frequency bands are attenuated according to the estimated present SNR using a Wiener filter approach. Objective speech quality measures, informal listening tests, and the results of automatic speech recognition experiments indicate a substantial benefit from AMS-based noise suppression, in comparison to unprocessed noisy speech.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"1 1","pages":"184-192"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85986442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
Recursive identification of acoustic echo systems using orthonormal basis functions 基于正交基函数的声回波系统递归识别
Pub Date : 2003-07-09 DOI: 10.1109/TSA.2003.811536
Lester S. H. Ngia
In hands-free telephone or video conference application, there exists an acoustic feedback coupling between the loudspeaker and microphone in an enclosed environment, which creates the acoustic echo. FIR filters are commonly used in acoustic echo cancellers because of their simple structure. However, in this paper, the Kautz and Laguerre filter structures are shown to be more efficient echo cancellers than the FIR filters, because they can describe accurately the acoustic echo system with fewer parameters. These filters are built from their respective orthonormal Kautz and Laguerre basis functions. The proposal is motivated by some theoretical and numerical results that the time-varying acoustic echo path is basically due to its time-varying zeros and not its time-invariant acoustical poles. Therefore, the poles of the Kautz and the Laguerre filters are estimated, and can be kept fixed or updated occasionally if required. The poles are estimated by a batch Gauss-Newton algorithm. Then, the coefficients of the Kautz and Laguerre filters can be estimated by most recursive algorithms that are suitable for linear regression models, e.g., the normalized LMS algorithm. Generally, it is shown that the proposed Kautz and Laguerre filters, as the filter structures in an acoustic echo canceller, have better convergence and tracking properties than the FIR and IIR filters.
在免提电话或视频会议应用中,在封闭环境中,扬声器和麦克风之间存在声反馈耦合,从而产生声回波。FIR滤波器因其结构简单而被广泛应用于声学回波消除器中。然而,在本文中,Kautz和Laguerre滤波器结构被证明是比FIR滤波器更有效的回波消除器,因为它们可以用更少的参数准确地描述声回波系统。这些滤波器是由它们各自的标准正交Kautz和Laguerre基函数构建的。理论和数值结果表明,时变声回波路径主要是由其时变的零点而不是时变的声学极点引起的。因此,估计了Kautz和Laguerre滤波器的极点,并且可以保持固定或在需要时偶尔更新。用批处理高斯-牛顿算法估计极点。然后,可以用大多数适合线性回归模型的递归算法来估计Kautz和Laguerre滤波器的系数,例如归一化LMS算法。研究表明,所提出的Kautz和Laguerre滤波器作为声回波消除器中的滤波器结构,比FIR和IIR滤波器具有更好的收敛和跟踪性能。
{"title":"Recursive identification of acoustic echo systems using orthonormal basis functions","authors":"Lester S. H. Ngia","doi":"10.1109/TSA.2003.811536","DOIUrl":"https://doi.org/10.1109/TSA.2003.811536","url":null,"abstract":"In hands-free telephone or video conference application, there exists an acoustic feedback coupling between the loudspeaker and microphone in an enclosed environment, which creates the acoustic echo. FIR filters are commonly used in acoustic echo cancellers because of their simple structure. However, in this paper, the Kautz and Laguerre filter structures are shown to be more efficient echo cancellers than the FIR filters, because they can describe accurately the acoustic echo system with fewer parameters. These filters are built from their respective orthonormal Kautz and Laguerre basis functions. The proposal is motivated by some theoretical and numerical results that the time-varying acoustic echo path is basically due to its time-varying zeros and not its time-invariant acoustical poles. Therefore, the poles of the Kautz and the Laguerre filters are estimated, and can be kept fixed or updated occasionally if required. The poles are estimated by a batch Gauss-Newton algorithm. Then, the coefficients of the Kautz and Laguerre filters can be estimated by most recursive algorithms that are suitable for linear regression models, e.g., the normalized LMS algorithm. Generally, it is shown that the proposed Kautz and Laguerre filters, as the filter structures in an acoustic echo canceller, have better convergence and tracking properties than the FIR and IIR filters.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"44 1","pages":"278-293"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74181143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A multipitch tracking algorithm for noisy speech 噪声语音的多音高跟踪算法
Pub Date : 2003-07-09 DOI: 10.1109/TSA.2003.811539
Mingyang Wu, Deliang Wang, Guy J. Brown
An effective multipitch tracking algorithm for noisy speech is critical for acoustic signal processing. However, the performance of existing algorithms is not satisfactory. We present a robust algorithm for multipitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new method for extracting periodicity information across different channels, and a hidden Markov model (HMM) for forming continuous pitch tracks. The resulting algorithm can reliably track single and double pitch tracks in a noisy environment. We suggest a pitch error measure for the multipitch situation. The proposed algorithm is evaluated on a database of speech utterances mixed with various types of interference. Quantitative comparisons show that our algorithm significantly outperforms existing ones.
一种有效的多音高跟踪算法是声信号处理的关键。然而,现有算法的性能并不令人满意。提出了一种鲁棒的多音高跟踪算法。我们的方法集成了一种改进的信道和峰值选择方法,一种跨不同信道提取周期性信息的新方法,以及一种用于形成连续音轨的隐马尔可夫模型(HMM)。所得到的算法可以在噪声环境下可靠地跟踪单双音轨。我们提出了一种多螺距情况下的螺距误差测量方法。在混合了各种干扰的语音数据库上对该算法进行了评估。定量比较表明,我们的算法明显优于现有算法。
{"title":"A multipitch tracking algorithm for noisy speech","authors":"Mingyang Wu, Deliang Wang, Guy J. Brown","doi":"10.1109/TSA.2003.811539","DOIUrl":"https://doi.org/10.1109/TSA.2003.811539","url":null,"abstract":"An effective multipitch tracking algorithm for noisy speech is critical for acoustic signal processing. However, the performance of existing algorithms is not satisfactory. We present a robust algorithm for multipitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new method for extracting periodicity information across different channels, and a hidden Markov model (HMM) for forming continuous pitch tracks. The resulting algorithm can reliably track single and double pitch tracks in a noisy environment. We suggest a pitch error measure for the multipitch situation. The proposed algorithm is evaluated on a database of speech utterances mixed with various types of interference. Quantitative comparisons show that our algorithm significantly outperforms existing ones.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"31 1","pages":"229-241"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81004633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 304
An enhanced dynamic time warping model for improved estimation of DTW parameters 一种改进的DTW参数估计的动态时间规整模型
Pub Date : 2003-07-09 DOI: 10.1109/TSA.2003.811540
R. Yaniv, D. Burshtein
We introduce an enhanced dynamic time warping model (EDTW) which, unlike conventional dynamic time warping (DTW), considers all possible alignment paths for recognition as well as for parameter estimation. The model, for which DTW and the hidden Markov model (HMM) are special cases, is based on a well-defined quality measure. We extend the derivation of the Forward and Viterbi algorithms for HMMs, in order to obtain efficient solutions for the problems of recognition and optimal path alignment in the new proposed model. We then extend the Baum-Welch (1972) estimation algorithm for HMMs and obtain an iterative method for estimating the model parameters of the new model based on the Baum inequality. This estimation method efficiently considers all possible alignment paths between the training data and the current model. A standard segmental K-means estimation algorithm is also derived for EDTW. We compare the performance of the two training algorithms, with various path movement constraints, in two isolated letter recognition tasks. The new estimation algorithm was found to improve performance over segmental K-means in most experiments.
我们引入了一种增强的动态时间扭曲模型(EDTW),与传统的动态时间扭曲(DTW)不同,它考虑了所有可能的对齐路径来识别和参数估计。该模型基于定义良好的质量度量,其中DTW和隐马尔可夫模型(HMM)是特例。我们扩展了hmm的Forward和Viterbi算法的推导,以便在新提出的模型中获得识别和最优路径对齐问题的有效解。然后,我们扩展了Baum- welch(1972)的hmm估计算法,并获得了一种基于Baum不等式估计新模型模型参数的迭代方法。这种估计方法有效地考虑了训练数据和当前模型之间所有可能的对齐路径。本文还推导了一种适用于EDTW的标准分段k均值估计算法。在两个孤立的字母识别任务中,我们比较了两种训练算法在不同路径运动约束下的性能。在大多数实验中发现,新的估计算法比分段K-means性能更好。
{"title":"An enhanced dynamic time warping model for improved estimation of DTW parameters","authors":"R. Yaniv, D. Burshtein","doi":"10.1109/TSA.2003.811540","DOIUrl":"https://doi.org/10.1109/TSA.2003.811540","url":null,"abstract":"We introduce an enhanced dynamic time warping model (EDTW) which, unlike conventional dynamic time warping (DTW), considers all possible alignment paths for recognition as well as for parameter estimation. The model, for which DTW and the hidden Markov model (HMM) are special cases, is based on a well-defined quality measure. We extend the derivation of the Forward and Viterbi algorithms for HMMs, in order to obtain efficient solutions for the problems of recognition and optimal path alignment in the new proposed model. We then extend the Baum-Welch (1972) estimation algorithm for HMMs and obtain an iterative method for estimating the model parameters of the new model based on the Baum inequality. This estimation method efficiently considers all possible alignment paths between the training data and the current model. A standard segmental K-means estimation algorithm is also derived for EDTW. We compare the performance of the two training algorithms, with various path movement constraints, in two isolated letter recognition tasks. The new estimation algorithm was found to improve performance over segmental K-means in most experiments.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"6 1","pages":"216-228"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73673552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Natural language spoken interface control using data-driven semantic inference 使用数据驱动语义推理的自然语言口语接口控制
Pub Date : 2003-07-09 DOI: 10.1109/TSA.2003.811534
J. Bellegarda, Kim E. A. Silverman
Spoken interaction tasks are typically approached using a formal grammar as language model. While ensuring good system performance, this imposes a rigid framework on users, by implicitly forcing them to conform to a pre-defined interaction structure. This paper introduces the concept of data-driven semantic inference, which in principle allows for any word constructs in command/query formulation. Each unconstrained word string is automatically mapped onto the intended action through a semantic classification against the set of supported actions. As a result, it is no longer necessary for users to memorize the exact syntax of every command. The underlying (latent semantic analysis) framework relies on co-occurrences between words and commands, as observed in a training corpus. A suitable extension can also handle commands that are ambiguous at the word level. The behavior of semantic inference is characterized using a desktop user interface control task involving 113 different actions. Under realistic usage conditions, this approach exhibits a 2 to 5% classification error rate. Various training scenarios of increasing scope are considered to assess the influence of coverage on performance. Sufficient semantic knowledge about the task domain is found to be captured at a level of coverage as low as 70%. This illustrates the good generalization properties of semantic inference.
口语交互任务通常使用形式化语法作为语言模型来处理。在确保良好的系统性能的同时,通过隐式地强迫用户遵守预定义的交互结构,这给用户强加了一个严格的框架。本文介绍了数据驱动语义推理的概念,该概念原则上允许命令/查询公式中的任何单词结构。每个不受约束的字串通过针对支持的操作集的语义分类自动映射到预期的操作。因此,用户不再需要记住每个命令的确切语法。底层(潜在语义分析)框架依赖于单词和命令之间的共现,正如在训练语料库中观察到的那样。合适的扩展还可以处理在单词级别上有歧义的命令。使用包含113种不同动作的桌面用户界面控制任务来表征语义推理的行为。在实际使用条件下,这种方法显示出2%到5%的分类错误率。考虑了各种范围不断扩大的训练场景,以评估覆盖率对性能的影响。发现在低至70%的覆盖率水平上捕获了关于任务域的足够的语义知识。这说明了语义推理的良好泛化特性。
{"title":"Natural language spoken interface control using data-driven semantic inference","authors":"J. Bellegarda, Kim E. A. Silverman","doi":"10.1109/TSA.2003.811534","DOIUrl":"https://doi.org/10.1109/TSA.2003.811534","url":null,"abstract":"Spoken interaction tasks are typically approached using a formal grammar as language model. While ensuring good system performance, this imposes a rigid framework on users, by implicitly forcing them to conform to a pre-defined interaction structure. This paper introduces the concept of data-driven semantic inference, which in principle allows for any word constructs in command/query formulation. Each unconstrained word string is automatically mapped onto the intended action through a semantic classification against the set of supported actions. As a result, it is no longer necessary for users to memorize the exact syntax of every command. The underlying (latent semantic analysis) framework relies on co-occurrences between words and commands, as observed in a training corpus. A suitable extension can also handle commands that are ambiguous at the word level. The behavior of semantic inference is characterized using a desktop user interface control task involving 113 different actions. Under realistic usage conditions, this approach exhibits a 2 to 5% classification error rate. Various training scenarios of increasing scope are considered to assess the influence of coverage on performance. Sufficient semantic knowledge about the task domain is found to be captured at a level of coverage as low as 70%. This illustrates the good generalization properties of semantic inference.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"28 1","pages":"267-277"},"PeriodicalIF":0.0,"publicationDate":"2003-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75942789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
期刊
IEEE Trans. Speech Audio Process.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1