首页 > 最新文献

IEEE Trans. Speech Audio Process.最新文献

英文 中文
Multigrained modeling with pattern specific maximum likelihood transformations for text-independent speaker recognition 具有模式特定的最大似然变换的多粒度建模,用于文本无关的说话人识别
Pub Date : 2003-02-19 DOI: 10.1109/TSA.2003.809121
U. Chaudhari, Jirí Navrátil, Stephane H Maes
We present a transformation-based, multigrained data modeling technique in the context of text independent speaker recognition, aimed at mitigating difficulties caused by sparse training and test data. Both identification and verification are addressed, where we view the entire population as divided into the target population and its complement, which we refer to as the background population. First, we present our development of maximum likelihood transformation based recognition with diagonally constrained Gaussian mixture models and show its robustness to data scarcity with results on identification. Then for each target and background speaker, a multigrained model is constructed using the transformation based extension as a building block. The training data is labeled with an HMM based phone labeler. We then make use of a graduated phone class structure to train the speaker model at various levels of detail. This structure is a tree with the root node containing all the phones. Subsequent levels partition the phones into increasingly finer grained linguistic classes. This method affords the use of fine detail where possible, i.e., as reflected in the amount of training data distributed to each tree node. We demonstrate the effectiveness of the modeling with verification experiments in matched and mismatched conditions.
我们提出了一种基于转换的多粒度数据建模技术,用于文本独立的说话人识别,旨在减轻稀疏训练和测试数据带来的困难。讨论了鉴定和核查问题,我们把整个人口分为目标人口及其补充人口,我们称之为背景人口。首先,我们介绍了基于对角约束高斯混合模型的最大似然变换识别的发展,并通过识别结果证明了其对数据稀缺性的鲁棒性。然后,对于每个目标和背景说话者,使用基于转换的扩展作为构建块构建多粒度模型。使用基于HMM的电话标注器对训练数据进行标注。然后,我们使用一个毕业的电话类结构来训练扬声器模型在不同的细节水平。这个结构是一个包含所有电话的根节点的树。随后的级别将电话划分为越来越细粒度的语言类。这种方法提供了在可能的情况下使用精细的细节,即,正如分布到每个树节点的训练数据量所反映的那样。通过匹配和不匹配条件下的验证实验,验证了该模型的有效性。
{"title":"Multigrained modeling with pattern specific maximum likelihood transformations for text-independent speaker recognition","authors":"U. Chaudhari, Jirí Navrátil, Stephane H Maes","doi":"10.1109/TSA.2003.809121","DOIUrl":"https://doi.org/10.1109/TSA.2003.809121","url":null,"abstract":"We present a transformation-based, multigrained data modeling technique in the context of text independent speaker recognition, aimed at mitigating difficulties caused by sparse training and test data. Both identification and verification are addressed, where we view the entire population as divided into the target population and its complement, which we refer to as the background population. First, we present our development of maximum likelihood transformation based recognition with diagonally constrained Gaussian mixture models and show its robustness to data scarcity with results on identification. Then for each target and background speaker, a multigrained model is constructed using the transformation based extension as a building block. The training data is labeled with an HMM based phone labeler. We then make use of a graduated phone class structure to train the speaker model at various levels of detail. This structure is a tree with the root node containing all the phones. Subsequent levels partition the phones into increasingly finer grained linguistic classes. This method affords the use of fine detail where possible, i.e., as reflected in the amount of training data distributed to each tree node. We demonstrate the effectiveness of the modeling with verification experiments in matched and mismatched conditions.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"10 1","pages":"61-69"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78707140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Multichannel affine and fast affine projection algorithms for active noise control and acoustic equalization systems 用于主动噪声控制和声学均衡系统的多通道仿射和快速仿射投影算法
Pub Date : 2003-02-19 DOI: 10.1109/TSA.2002.805642
M. Bouchard
In the field of adaptive signal processing, it is well known that affine projection algorithms or their low-computational implementations fast affine projection algorithms can produce a good tradeoff between convergence speed and computational complexity. Although these algorithms typically do not provide the same convergence speed as recursive-least-squares algorithms, they can provide a much improved convergence speed compared to stochastic gradient descent algorithms, without the high increase of the computational load or the instability often found in recursive-least-squares algorithms. In this paper, multichannel affine and fast affine projection algorithms are introduced for active noise control or acoustic equalization. Multichannel fast affine projection algorithms have been previously published for acoustic echo cancellation, but the problem of active noise control or acoustic equalization is a very different one, leading to different structures, as explained in the paper. The computational complexity of the new algorithms is evaluated, and it is shown through simulations that not only can the new algorithms provide the expected tradeoff between convergence performance and computational complexity, they can also provide the best convergence performance (even over recursive-least-squares algorithms) when nonideal noisy acoustic plant models are used in the adaptive systems.
在自适应信号处理领域,众所周知,仿射投影算法或其低计算量实现快速仿射投影算法可以在收敛速度和计算复杂度之间取得很好的平衡。虽然这些算法通常不能提供与递归最小二乘算法相同的收敛速度,但与随机梯度下降算法相比,它们可以提供大大提高的收敛速度,而不会增加计算负荷或递归最小二乘算法中经常发现的不稳定性。本文介绍了多通道仿射和快速仿射投影算法,用于主动噪声控制或声学均衡。多通道快速仿射投影算法先前已发表用于声学回波消除,但如本文所述,主动噪声控制或声学均衡问题是一个非常不同的问题,导致不同的结构。对新算法的计算复杂度进行了评估,并通过仿真表明,新算法不仅可以在收敛性能和计算复杂度之间提供预期的折衷,而且当非理想噪声声学植物模型用于自适应系统时,它们还可以提供最佳的收敛性能(甚至优于递归最小二乘算法)。
{"title":"Multichannel affine and fast affine projection algorithms for active noise control and acoustic equalization systems","authors":"M. Bouchard","doi":"10.1109/TSA.2002.805642","DOIUrl":"https://doi.org/10.1109/TSA.2002.805642","url":null,"abstract":"In the field of adaptive signal processing, it is well known that affine projection algorithms or their low-computational implementations fast affine projection algorithms can produce a good tradeoff between convergence speed and computational complexity. Although these algorithms typically do not provide the same convergence speed as recursive-least-squares algorithms, they can provide a much improved convergence speed compared to stochastic gradient descent algorithms, without the high increase of the computational load or the instability often found in recursive-least-squares algorithms. In this paper, multichannel affine and fast affine projection algorithms are introduced for active noise control or acoustic equalization. Multichannel fast affine projection algorithms have been previously published for acoustic echo cancellation, but the problem of active noise control or acoustic equalization is a very different one, leading to different structures, as explained in the paper. The computational complexity of the new algorithms is evaluated, and it is shown through simulations that not only can the new algorithms provide the expected tradeoff between convergence performance and computational complexity, they can also provide the best convergence performance (even over recursive-least-squares algorithms) when nonideal noisy acoustic plant models are used in the adaptive systems.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"67 1","pages":"54-60"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74812819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 127
Bounded support Gaussian mixture modeling of speech spectra 语音频谱的有界支持高斯混合建模
Pub Date : 2003-02-19 DOI: 10.1109/TSA.2002.805639
J. Lindblom, J. Samuelsson
Lately, Gaussian mixture (GM) models have found new applications in speech processing, and particularly in speech coding. This paper provides a review of GM based quantization and prediction. The main contribution is a discussion on GM model optimization. Two previously presented algorithms of EM-type are analyzed in some detail, and models are estimated and evaluated experimentally using theoretical measures as well as GM based speech spectrum coding and prediction. It has been argued that since many sources have a bounded support, this should be utilized in both the choice of model, and the optimization algorithm. By low-dimensional modeling examples, illustrating the behavior of the two algorithms graphically, and by full-scale evaluation of GM based systems, the advantages of a bounded support approach are quantified. For all evaluation techniques in the study, model accuracy is improved when the bounded support approach is adopted. The gains are typically largest for models with diagonal covariance matrices.
近年来,高斯混合(GM)模型在语音处理,尤其是语音编码方面有了新的应用。本文综述了基于遗传算法的量化和预测方法。主要贡献是对GM模型的优化问题进行了讨论。对已有的两种基于GM的语音频谱编码和预测算法进行了详细的分析,并利用理论度量和基于GM的语音频谱编码和预测对模型进行了实验估计和评价。有人认为,由于许多源具有有限支持,因此在选择模型和优化算法时都应利用这一点。通过低维建模实例,图解地说明了两种算法的行为,并通过基于GM的系统的全面评估,量化了有界支持方法的优点。在研究的所有评估技术中,采用有界支持方法可以提高模型精度。对于具有对角协方差矩阵的模型,增益通常是最大的。
{"title":"Bounded support Gaussian mixture modeling of speech spectra","authors":"J. Lindblom, J. Samuelsson","doi":"10.1109/TSA.2002.805639","DOIUrl":"https://doi.org/10.1109/TSA.2002.805639","url":null,"abstract":"Lately, Gaussian mixture (GM) models have found new applications in speech processing, and particularly in speech coding. This paper provides a review of GM based quantization and prediction. The main contribution is a discussion on GM model optimization. Two previously presented algorithms of EM-type are analyzed in some detail, and models are estimated and evaluated experimentally using theoretical measures as well as GM based speech spectrum coding and prediction. It has been argued that since many sources have a bounded support, this should be utilized in both the choice of model, and the optimization algorithm. By low-dimensional modeling examples, illustrating the behavior of the two algorithms graphically, and by full-scale evaluation of GM based systems, the advantages of a bounded support approach are quantified. For all evaluation techniques in the study, model accuracy is improved when the bounded support approach is adopted. The gains are typically largest for models with diagonal covariance matrices.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"9 1","pages":"88-99"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81506759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
From the editor-in-chief 来自总编辑
Pub Date : 2003-01-01 DOI: 10.1109/TSA.2003.815277
I. Trancoso
{"title":"From the editor-in-chief","authors":"I. Trancoso","doi":"10.1109/TSA.2003.815277","DOIUrl":"https://doi.org/10.1109/TSA.2003.815277","url":null,"abstract":"","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"17 1","pages":"297"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72658562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stereophonic acoustic echo cancellation using lattice orthogonalization 利用点阵正交法消除立体声回声
Pub Date : 2002-12-10 DOI: 10.1109/TSA.2002.804537
K. Mayyas
Stereophonic teleconferencing provides more natural acoustic perception by virtue of its enhanced sound localization. Of paramount importance is stereo acoustic echo cancellation (SAEC) that poses a difficult challenge to low complexity adaptive algorithms to achieve acceptable AEC due, mainly, to the strong cross-correlation between the two-channel input signals. This paper proposes a transform domain two-channel lattice algorithm that inherently decorrelates the stereo signals. The algorithm, however, bears a high computational complexity for large filter orders, N. A low complexity O(4N) algorithm is developed based on employing the functionality of the two-channel lattice cell in the previous algorithm in a weighted subband scheme. The algorithm is capable of producing complete orthogonal subbands of the stereo signals, and also allows for a tradeoff between performance and complexity. The performance of the proposed algorithms is compared with other existing algorithms via simulations and using actual teleconferencing room impulse responses.
立体声电话会议通过其增强的声音定位提供了更自然的声音感知。最重要的是立体声回波抵消(SAEC),这对低复杂度自适应算法实现可接受的AEC提出了困难的挑战,主要是由于双通道输入信号之间的强相互关联。提出了一种立体信号固有去相关的变换域双通道点阵算法。然而,对于大阶滤波器n,该算法具有较高的计算复杂度。在加权子带方案中,利用前一算法中的双通道晶格单元的功能,开发了一种低复杂度O(4N)的算法。该算法能够产生立体信号的完整正交子带,并且还允许在性能和复杂性之间进行权衡。通过仿真和实际电话会议室脉冲响应,将所提算法的性能与其他现有算法进行了比较。
{"title":"Stereophonic acoustic echo cancellation using lattice orthogonalization","authors":"K. Mayyas","doi":"10.1109/TSA.2002.804537","DOIUrl":"https://doi.org/10.1109/TSA.2002.804537","url":null,"abstract":"Stereophonic teleconferencing provides more natural acoustic perception by virtue of its enhanced sound localization. Of paramount importance is stereo acoustic echo cancellation (SAEC) that poses a difficult challenge to low complexity adaptive algorithms to achieve acceptable AEC due, mainly, to the strong cross-correlation between the two-channel input signals. This paper proposes a transform domain two-channel lattice algorithm that inherently decorrelates the stereo signals. The algorithm, however, bears a high computational complexity for large filter orders, N. A low complexity O(4N) algorithm is developed based on employing the functionality of the two-channel lattice cell in the previous algorithm in a weighted subband scheme. The algorithm is capable of producing complete orthogonal subbands of the stereo signals, and also allows for a tradeoff between performance and complexity. The performance of the proposed algorithms is compared with other existing algorithms via simulations and using actual teleconferencing room impulse responses.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"53 1","pages":"517-525"},"PeriodicalIF":0.0,"publicationDate":"2002-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80940389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Application of time-frequency principal component analysis to text-independent speaker identification 时频主成分分析在文本无关说话人识别中的应用
Pub Date : 2002-12-10 DOI: 10.1109/TSA.2002.800557
I. Magrin-Chagnolleau, G. Durou, F. Bimbot
We propose a formalism, called vector filtering of spectral trajectories, that allows the integration of a number of speech parameterization approaches (cepstral analysis, /spl Delta/ and /spl Delta//spl Delta/ parameterizations, auto-regressive vector modeling, ...) under a common formalism. We then propose a new filtering, called contextual principal components (CPC) or time-frequency principal components (TFPC). This filtering consists in extracting the principal components of the contextual covariance matrix, which is the covariance matrix of a sequence of vectors expanded by their context. We apply this new filtering in the framework of closed-set speaker identification, using a subset of the POLYCOST database. When using speaker-dependent TFPC filters, our results show a relative improvement of approximately 20% compared to the use of the classical cepstral coefficients augmented by their /spl Delta/-coefficients, which is significantly better with a 90% confidence level.
我们提出了一种形式,称为频谱轨迹的矢量滤波,它允许在一个共同的形式下集成许多语音参数化方法(倒谱分析,/spl Delta/和/spl Delta//spl Delta/参数化,自回归向量建模等)。然后,我们提出了一种新的滤波方法,称为上下文主成分(CPC)或时频主成分(TFPC)。该滤波包括提取上下文协方差矩阵的主成分,该协方差矩阵是由其上下文展开的向量序列的协方差矩阵。我们使用POLYCOST数据库的一个子集,在闭集说话人识别的框架中应用这种新的滤波。当使用依赖于扬声器的TFPC滤波器时,我们的结果显示,与使用经典的倒谱系数增加它们的/spl δ /-系数相比,大约有20%的相对改进,这在90%的置信水平下明显更好。
{"title":"Application of time-frequency principal component analysis to text-independent speaker identification","authors":"I. Magrin-Chagnolleau, G. Durou, F. Bimbot","doi":"10.1109/TSA.2002.800557","DOIUrl":"https://doi.org/10.1109/TSA.2002.800557","url":null,"abstract":"We propose a formalism, called vector filtering of spectral trajectories, that allows the integration of a number of speech parameterization approaches (cepstral analysis, /spl Delta/ and /spl Delta//spl Delta/ parameterizations, auto-regressive vector modeling, ...) under a common formalism. We then propose a new filtering, called contextual principal components (CPC) or time-frequency principal components (TFPC). This filtering consists in extracting the principal components of the contextual covariance matrix, which is the covariance matrix of a sequence of vectors expanded by their context. We apply this new filtering in the framework of closed-set speaker identification, using a subset of the POLYCOST database. When using speaker-dependent TFPC filters, our results show a relative improvement of approximately 20% compared to the use of the classical cepstral coefficients augmented by their /spl Delta/-coefficients, which is significantly better with a 90% confidence level.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"206 1","pages":"371-378"},"PeriodicalIF":0.0,"publicationDate":"2002-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77793274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Improved audio coding using a psychoacoustic model based on a cochlear filter bank 基于耳蜗滤波器库的心理声学模型改进音频编码
Pub Date : 2002-12-10 DOI: 10.1109/TSA.2002.804536
F. Baumgarte
Perceptual audio coders use an estimated masked threshold for the determination of the maximum permissible just-inaudible noise level introduced by quantization. This estimate is derived from a psychoacoustic model mimicking the properties of. masking. Most psychoacoustic models for coding applications use a uniform (equal bandwidth) spectral decomposition as a first step to approximate the frequency selectivity of the human auditory system. However, the equal filter properties of the uniform subbands do not match the nonuniform characteristics of cochlear filters and reduce the precision of psychoacoustic modeling. Even so, uniform filter banks are applied because they are computationally efficient. This paper presents a psychoacoustic model based on an efficient nonuniform cochlear filter bank and a simple masked threshold estimation. The novel filter-bank structure employs cascaded low-order IIR filters and appropriate down-sampling to increase efficiency. The filter responses are optimized for the modeling of auditory masking effects. Results of the new psychoacoustic model applied to audio coding show better performance in terms of bit rate and/or quality of the new model in comparison with other state-of-the-art models using a uniform spectral decomposition. The low delay of the new model is particularly suitable for low-delay coders.
感知音频编码器使用估计的屏蔽阈值来确定量化引入的最大允许的刚刚听不到的噪声水平。这个估计是从心理声学模型中得出的,模拟了。掩蔽。大多数用于编码应用的心理声学模型使用均匀(等带宽)频谱分解作为近似人类听觉系统频率选择性的第一步。然而,均匀子带的等滤特性与耳蜗滤波器的非均匀特性不匹配,降低了心理声学建模的精度。即便如此,均匀滤波器组仍被应用,因为它们的计算效率很高。提出了一种基于高效非均匀耳蜗滤波器组和简单掩码阈值估计的心理声学模型。新型滤波器组结构采用级联低阶IIR滤波器和适当的下采样来提高效率。针对听觉掩蔽效应的建模,优化了滤波器响应。应用于音频编码的新心理声学模型的结果显示,与使用均匀频谱分解的其他最先进的模型相比,新模型在比特率和/或质量方面表现更好。新模型的低延迟特别适用于低延迟编码器。
{"title":"Improved audio coding using a psychoacoustic model based on a cochlear filter bank","authors":"F. Baumgarte","doi":"10.1109/TSA.2002.804536","DOIUrl":"https://doi.org/10.1109/TSA.2002.804536","url":null,"abstract":"Perceptual audio coders use an estimated masked threshold for the determination of the maximum permissible just-inaudible noise level introduced by quantization. This estimate is derived from a psychoacoustic model mimicking the properties of. masking. Most psychoacoustic models for coding applications use a uniform (equal bandwidth) spectral decomposition as a first step to approximate the frequency selectivity of the human auditory system. However, the equal filter properties of the uniform subbands do not match the nonuniform characteristics of cochlear filters and reduce the precision of psychoacoustic modeling. Even so, uniform filter banks are applied because they are computationally efficient. This paper presents a psychoacoustic model based on an efficient nonuniform cochlear filter bank and a simple masked threshold estimation. The novel filter-bank structure employs cascaded low-order IIR filters and appropriate down-sampling to increase efficiency. The filter responses are optimized for the modeling of auditory masking effects. Results of the new psychoacoustic model applied to audio coding show better performance in terms of bit rate and/or quality of the new model in comparison with other state-of-the-art models using a uniform spectral decomposition. The low delay of the new model is particularly suitable for low-delay coders.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"1 1","pages":"495-503"},"PeriodicalIF":0.0,"publicationDate":"2002-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88908245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Text-independent speaker verification using utterance level scoring and covariance modeling 使用话语水平评分和协方差建模的文本独立说话人验证
Pub Date : 2002-12-10 DOI: 10.1109/TSA.2002.803419
Ran D. Zilca
This paper describes a computationally simple method to perform text independent speaker verification using second order statistics. The suggested method, called utterance level scoring (ULS), allows one to obtain a normalized score using a single pass through the frames of the tested utterance. The utterance sample covariance is first calculated and then compared to the speaker covariance using a distortion measure. Subsequently, a distortion measure between the utterance covariance and the sample covariance of data taken from different speakers is used to normalize the score. Experimental results from the 2000 NIST speaker recognition evaluation are presented for ULS, used with different distortion measures, and for a Gaussian mixture model (GMM) system. The results indicate that ULS as a viable alternative to GMM whenever the computational complexity and verification accuracy needs to be traded.
本文描述了一种计算简单的方法,利用二阶统计量进行文本无关的说话人验证。所建议的方法,称为话语水平评分(ULS),允许人们通过测试话语的框架获得一个标准化的分数。首先计算话语样本协方差,然后使用失真测量将其与说话人协方差进行比较。随后,使用不同说话人的语音协方差和样本协方差之间的失真度量来对分数进行归一化。给出了2000年NIST语音识别评估的实验结果,用于不同失真措施的ULS和高斯混合模型(GMM)系统。结果表明,当需要权衡计算复杂度和验证精度时,ULS作为GMM的可行替代方案。
{"title":"Text-independent speaker verification using utterance level scoring and covariance modeling","authors":"Ran D. Zilca","doi":"10.1109/TSA.2002.803419","DOIUrl":"https://doi.org/10.1109/TSA.2002.803419","url":null,"abstract":"This paper describes a computationally simple method to perform text independent speaker verification using second order statistics. The suggested method, called utterance level scoring (ULS), allows one to obtain a normalized score using a single pass through the frames of the tested utterance. The utterance sample covariance is first calculated and then compared to the speaker covariance using a distortion measure. Subsequently, a distortion measure between the utterance covariance and the sample covariance of data taken from different speakers is used to normalize the score. Experimental results from the 2000 NIST speaker recognition evaluation are presented for ULS, used with different distortion measures, and for a Gaussian mixture model (GMM) system. The results indicate that ULS as a viable alternative to GMM whenever the computational complexity and verification accuracy needs to be traded.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"25 1","pages":"363-370"},"PeriodicalIF":0.0,"publicationDate":"2002-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81386012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Perceptual audio coding using adaptive pre- and post-filters and lossless compression 使用自适应预滤波器和后滤波器和无损压缩的感知音频编码
Pub Date : 2002-12-10 DOI: 10.1109/TSA.2002.803444
G. Schuller, Bin Yu, Dawei Huang, B. Edler
This paper proposes a versatile perceptual audio coding method that achieves high compression ratios and is capable of low encoding/decoding delay. It accommodates a variety of source signals (including both music and speech) with different sampling rates. It is based on separating irrelevance and redundancy reductions into independent functional units. This contrasts traditional audio coding where both are integrated within the same subband decomposition. The separation allows for the independent optimization of the irrelevance and redundancy reduction units. For both reductions, we rely on adaptive filtering and predictive coding as much as possible to minimize the delay. A psycho-acoustically controlled adaptive linear filter is used for the irrelevance reduction, and the redundancy reduction is carried out by a predictive lossless coding scheme, which is termed weighted cascaded least mean squared (WCLMS) method. Experiments are carried out on a database of moderate size which contains mono-signals of different sampling rates and varying nature (music, speech, or mixed). They show that the proposed WCLMS lossless coder outperforms other competing lossless coders in terms of compression ratios and delay, as applied to the pre-filtered signal. Moreover, a subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay.
本文提出了一种通用的感知音频编码方法,该方法可以实现高压缩比和低编码/解码延迟。它适应各种不同采样率的源信号(包括音乐和语音)。它基于将不相关和冗余缩减分离为独立的功能单元。这与传统的音频编码形成对比,两者都集成在同一子带分解中。这种分离允许对不相关和冗余减少单元进行独立优化。对于这两种缩减,我们都尽可能地依赖自适应滤波和预测编码来最小化延迟。采用心理声学控制的自适应线性滤波器进行无关性降低,并采用加权级联最小均方(WCLMS)预测无损编码方案进行冗余降低。实验在中等大小的数据库上进行,该数据库包含不同采样率和不同性质(音乐,语音或混合)的单信号。他们表明,所提出的WCLMS无损编码器在压缩比和延迟方面优于其他竞争的无损编码器,应用于预滤波信号。此外,结合预滤波/无损编码器和最先进的感知音频编码器(PAC)的主观聆听测试表明,新方法在较低的延迟下获得了相当的压缩比和音频质量。
{"title":"Perceptual audio coding using adaptive pre- and post-filters and lossless compression","authors":"G. Schuller, Bin Yu, Dawei Huang, B. Edler","doi":"10.1109/TSA.2002.803444","DOIUrl":"https://doi.org/10.1109/TSA.2002.803444","url":null,"abstract":"This paper proposes a versatile perceptual audio coding method that achieves high compression ratios and is capable of low encoding/decoding delay. It accommodates a variety of source signals (including both music and speech) with different sampling rates. It is based on separating irrelevance and redundancy reductions into independent functional units. This contrasts traditional audio coding where both are integrated within the same subband decomposition. The separation allows for the independent optimization of the irrelevance and redundancy reduction units. For both reductions, we rely on adaptive filtering and predictive coding as much as possible to minimize the delay. A psycho-acoustically controlled adaptive linear filter is used for the irrelevance reduction, and the redundancy reduction is carried out by a predictive lossless coding scheme, which is termed weighted cascaded least mean squared (WCLMS) method. Experiments are carried out on a database of moderate size which contains mono-signals of different sampling rates and varying nature (music, speech, or mixed). They show that the proposed WCLMS lossless coder outperforms other competing lossless coders in terms of compression ratios and delay, as applied to the pre-filtered signal. Moreover, a subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"20 1","pages":"379-390"},"PeriodicalIF":0.0,"publicationDate":"2002-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80152641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
Robust speech recognition using probabilistic union models 基于概率联合模型的鲁棒语音识别
Pub Date : 2002-12-10 DOI: 10.1109/TSA.2002.803439
J. Ming, P. Jančovič, F. J. Smith
This paper introduces a new statistical approach, namely the probabilistic union model, for speech recognition involving partial, unknown frequency-band corruption. Partial frequency-band corruption accounts for the effect of a family of real-world noises. Previous methods based on the missing feature theory usually require the identity of the noisy bands. This identification can be difficult for unexpected noise with unknown, time-varying band characteristics. The new model combines the local frequency-band information based on the union of random events, to reduce the dependence of the model on information about the noise. This model partially accomplishes the target: offering robustness to partial frequency-band corruption, while requiring no information about the noise. This paper introduces the theory and implementation of the union model, and is focused on several important advances. These new developments include a new algorithm for automatic order selection, a generalization of the modeling principle to accommodate partial feature stream corruption, and a combination of the union model with conventional noise reduction techniques to deal with a mixture of stationary noise and unknown, nonstationary noise. For the evaluation, we used the TIDIGITS database for speaker-independent connected digit recognition. The utterances were corrupted by various types of additive noise, stationary or time-varying, assuming no knowledge about the noise characteristics. The results indicate that the new model offers significantly improved robustness in comparison to other models.
本文介绍了一种新的统计方法,即概率联合模型,用于涉及部分未知频带损坏的语音识别。部分频带损坏解释了一系列现实世界噪声的影响。以往基于缺失特征理论的方法通常需要对噪声带进行识别。对于具有未知时变频带特性的意外噪声,这种识别可能很困难。该模型结合了基于随机事件并集的局部频带信息,降低了模型对噪声信息的依赖。该模型部分实现了目标:对部分频带损坏提供鲁棒性,同时不需要有关噪声的信息。本文介绍了联合模型的理论和实现,重点介绍了联合模型的几个重要进展。这些新的发展包括一种新的自动顺序选择算法,一种一般化的建模原则,以适应部分特征流损坏,以及联合模型与传统降噪技术的结合,以处理平稳噪声和未知非平稳噪声的混合。为了评估,我们使用TIDIGITS数据库进行与说话人无关的连接数字识别。在不知道噪声特性的情况下,话语被各种类型的加性噪声(平稳的或时变的)所破坏。结果表明,与其他模型相比,新模型的鲁棒性显著提高。
{"title":"Robust speech recognition using probabilistic union models","authors":"J. Ming, P. Jančovič, F. J. Smith","doi":"10.1109/TSA.2002.803439","DOIUrl":"https://doi.org/10.1109/TSA.2002.803439","url":null,"abstract":"This paper introduces a new statistical approach, namely the probabilistic union model, for speech recognition involving partial, unknown frequency-band corruption. Partial frequency-band corruption accounts for the effect of a family of real-world noises. Previous methods based on the missing feature theory usually require the identity of the noisy bands. This identification can be difficult for unexpected noise with unknown, time-varying band characteristics. The new model combines the local frequency-band information based on the union of random events, to reduce the dependence of the model on information about the noise. This model partially accomplishes the target: offering robustness to partial frequency-band corruption, while requiring no information about the noise. This paper introduces the theory and implementation of the union model, and is focused on several important advances. These new developments include a new algorithm for automatic order selection, a generalization of the modeling principle to accommodate partial feature stream corruption, and a combination of the union model with conventional noise reduction techniques to deal with a mixture of stationary noise and unknown, nonstationary noise. For the evaluation, we used the TIDIGITS database for speaker-independent connected digit recognition. The utterances were corrupted by various types of additive noise, stationary or time-varying, assuming no knowledge about the noise characteristics. The results indicate that the new model offers significantly improved robustness in comparison to other models.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"64 1","pages":"403-414"},"PeriodicalIF":0.0,"publicationDate":"2002-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73659045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
期刊
IEEE Trans. Speech Audio Process.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1