IEEE Trans. Speech Audio Process.最新文献

英文中文

Multiple fundamental frequency estimation based on harmonicity and spectral smoothness 基于谐波性和谱平滑性的多基频估计

IEEE Trans. Speech Audio Process.

Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.815516

Anssi Klapuri

A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the residual signal. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the subtraction stage, the spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds. With these techniques, multiple fundamental frequency estimation can be performed quite accurately in a single time frame, without the use of long-term temporal features. The experimental data comprised recorded samples of 30 musical instruments from four different sources. Multiple fundamental frequency estimation was performed for random sound source and pitch combinations. Error rates for mixtures ranging from one to six simultaneous sounds were 1.8%, 3.9%, 6.3%, 9.9%, 14%, and 18%, respectively. In musical interval and chord identification tasks, the algorithm outperformed the average of ten trained musicians. The method works robustly in noise, and is able to handle sounds that exhibit inharmonicities. The inharmonicity factor and spectral envelope of each sound is estimated along with the fundamental frequency.

提出了一种估计并发音基频的新方法。该方法基于迭代方法，其中估计最突出声音的基频，从混合中减去声音，并重复该过程以获得剩余信号。在估计阶段，提出了一种利用同时谱分量的频率关系而不假设理想谐波的算法。在减法阶段，提出了谱平滑原理作为估计检测声音谱包络的有效新机制。使用这些技术，可以在单个时间框架内相当准确地执行多个基频估计，而无需使用长期时间特征。实验数据包括来自四个不同来源的30种乐器的记录样本。对随机声源和基音组合进行了多重基频估计。从一个到六个同步声音的混合错误率分别为1.8%，3.9%，6.3%，9.9%，14%和18%。在音程和和弦识别任务中，该算法的表现优于10名训练有素的音乐家的平均水平。该方法在噪声中工作稳定，并且能够处理表现出不和谐的声音。每个声音的非谐波系数和频谱包络随着基频估计。

{"title":"Multiple fundamental frequency estimation based on harmonicity and spectral smoothness","authors":"Anssi Klapuri","doi":"10.1109/TSA.2003.815516","DOIUrl":"https://doi.org/10.1109/TSA.2003.815516","url":null,"abstract":"A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the residual signal. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the subtraction stage, the spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds. With these techniques, multiple fundamental frequency estimation can be performed quite accurately in a single time frame, without the use of long-term temporal features. The experimental data comprised recorded samples of 30 musical instruments from four different sources. Multiple fundamental frequency estimation was performed for random sound source and pitch combinations. Error rates for mixtures ranging from one to six simultaneous sounds were 1.8%, 3.9%, 6.3%, 9.9%, 14%, and 18%, respectively. In musical interval and chord identification tasks, the algorithm outperformed the average of ten trained musicians. The method works robustly in noise, and is able to handle sounds that exhibit inharmonicities. The inharmonicity factor and spectral envelope of each sound is estimated along with the fundamental frequency.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"373 1","pages":"804-816"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75121481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 352

Speech enhancement using 2-D Fourier transform 基于二维傅里叶变换的语音增强

IEEE Trans. Speech Audio Process.

Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.816063

I. Soon, S. Koh

This paper presents an innovative way of using the two-dimensional (2-D) Fourier transform for speech enhancement. The blocking and windowing of the speech data for the 2-D Fourier transform are explained in detail. Several techniques of filtering in the 2-D Fourier transform domain are also proposed. They include magnitude spectral subtraction, 2-D Wiener filtering as well as a hybrid filter which effectively combines the one-dimensional (1-D) Wiener filter with the 2-D Wiener filter. The proposed hybrid filter compares favorably against other techniques using an objective test.

本文提出了一种利用二维傅里叶变换进行语音增强的创新方法。详细阐述了二维傅里叶变换中语音数据的块化和窗口化。本文还提出了几种二维傅里叶变换域的滤波技术。它们包括幅度谱减法、二维维纳滤波以及将一维(一维)维纳滤波器与二维维纳滤波器有效结合的混合滤波器。所提出的混合滤波器与使用客观测试的其他技术相比具有优势。

引用次数: 58

CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments CSA-BF:一种用于真实汽车环境下语音增强和识别的约束开关自适应波束形成器

IEEE Trans. Speech Audio Process.

Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818034

Xianxian Zhang, J. Hansen

While a number of studies have investigated various speech enhancement and processing schemes for in-vehicle speech systems, little research has been performed using actual voice data collected in noisy car environments. In this paper, we propose a new constrained switched adaptive beamforming algorithm (CSA-BF) for speech enhancement and recognition in real moving car environments. The proposed algorithm consists of a speech/noise constraint section, a speech adaptive beamformer, and a noise adaptive beamformer. We investigate CSA-BF performance with a comparison to classic delay-and-sum beamforming (DASB) in realistic car conditions using a corpus of data recorded in various car noise environments from across the U.S. After analyzing the experimental results and considering the range of complex noise situations in the car environment using the CU-Move corpus, we formulate the three specific processing stages of the CSA-BF algorithm. This method is evaluated and shown to simultaneously decrease word-error-rate (WER) for speech recognition by up to 31% and improve speech quality via the SEGSNR measure by up to +5.5 dB on the average.

虽然许多研究已经研究了车载语音系统的各种语音增强和处理方案，但很少有研究使用在嘈杂的汽车环境中收集的实际语音数据进行研究。在本文中，我们提出了一种新的约束切换自适应波束形成算法(CSA-BF)，用于真实移动汽车环境下的语音增强和识别。该算法由语音/噪声约束部分、语音自适应波束形成器和噪声自适应波束形成器组成。我们使用来自美国各地的各种汽车噪声环境中记录的数据语料库来研究CSA-BF算法在现实汽车条件下的性能，并将其与经典的延迟和波束形成(DASB)进行比较。在分析实验结果并考虑到汽车环境中复杂噪声情况的范围后，我们使用CU-Move语料库制定了CSA-BF算法的三个具体处理阶段。经过评估和证明，该方法可以同时将语音识别的单词错误率(WER)降低高达31%，并通过SEGSNR测量平均提高高达+5.5 dB的语音质量。

{"title":"CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments","authors":"Xianxian Zhang, J. Hansen","doi":"10.1109/TSA.2003.818034","DOIUrl":"https://doi.org/10.1109/TSA.2003.818034","url":null,"abstract":"While a number of studies have investigated various speech enhancement and processing schemes for in-vehicle speech systems, little research has been performed using actual voice data collected in noisy car environments. In this paper, we propose a new constrained switched adaptive beamforming algorithm (CSA-BF) for speech enhancement and recognition in real moving car environments. The proposed algorithm consists of a speech/noise constraint section, a speech adaptive beamformer, and a noise adaptive beamformer. We investigate CSA-BF performance with a comparison to classic delay-and-sum beamforming (DASB) in realistic car conditions using a corpus of data recorded in various car noise environments from across the U.S. After analyzing the experimental results and considering the range of complex noise situations in the car environment using the CU-Move corpus, we formulate the three specific processing stages of the CSA-BF algorithm. This method is evaluated and shown to simultaneously decrease word-error-rate (WER) for speech recognition by up to 31% and improve speech quality via the SEGSNR measure by up to +5.5 dB on the average.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"31 1","pages":"733-745"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85264102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Approximately independent factors of speech using nonlinear symplectic transformation 用非线性辛变换近似独立的语音因子

IEEE Trans. Speech Audio Process.

Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.814457

M. Omar, M. Hasegawa-Johnson

This paper addresses the problem of representing the speech signal using a set of features that are approximately statistically independent. This statistical independence simplifies building probabilistic models based on these features that can be used in applications like speech recognition. Since there is no evidence that the speech signal is a linear combination of separate factors or sources, we use a more general nonlinear transformation of the speech signal to achieve our approximately statistically independent feature set. We choose the transformation to be symplectic to maximize the likelihood of the generated feature set. In this paper, we describe applying this nonlinear transformation to the speech time-domain data directly and to the Mel-frequency cepstrum coefficients (MFCC). We discuss also experiments in which the generated feature set is transformed into a more compact set using a maximum mutual information linear transformation. This linear transformation is used to generate the acoustic features that represent the distinctions among the phonemes. The features resulted from this transformation are used in phoneme recognition experiments. The best results achieved show about 2% improvement in recognition accuracy compared to results based on MFCC features.

本文解决了使用一组近似统计独立的特征来表示语音信号的问题。这种统计独立性简化了基于这些特征的概率模型的构建，可用于语音识别等应用程序。由于没有证据表明语音信号是独立因素或源的线性组合，因此我们使用语音信号的更一般的非线性变换来实现我们的近似统计独立的特征集。我们选择辛变换来最大化生成的特征集的可能性。在本文中，我们描述了将这种非线性变换直接应用于语音时域数据和mel频率倒频谱系数(MFCC)。我们还讨论了使用最大互信息线性变换将生成的特征集转换为更紧凑集的实验。这种线性变换用于生成表示音素之间区别的声学特征。将变换后的特征用于音素识别实验。所获得的最佳结果表明，与基于MFCC特征的结果相比，识别精度提高了约2%。

{"title":"Approximately independent factors of speech using nonlinear symplectic transformation","authors":"M. Omar, M. Hasegawa-Johnson","doi":"10.1109/TSA.2003.814457","DOIUrl":"https://doi.org/10.1109/TSA.2003.814457","url":null,"abstract":"This paper addresses the problem of representing the speech signal using a set of features that are approximately statistically independent. This statistical independence simplifies building probabilistic models based on these features that can be used in applications like speech recognition. Since there is no evidence that the speech signal is a linear combination of separate factors or sources, we use a more general nonlinear transformation of the speech signal to achieve our approximately statistically independent feature set. We choose the transformation to be symplectic to maximize the likelihood of the generated feature set. In this paper, we describe applying this nonlinear transformation to the speech time-domain data directly and to the Mel-frequency cepstrum coefficients (MFCC). We discuss also experiments in which the generated feature set is transformed into a more compact set using a maximum mutual information linear transformation. This linear transformation is used to generate the acoustic features that represent the distinctions among the phonemes. The features resulted from this transformation are used in phoneme recognition experiments. The best results achieved show about 2% improvement in recognition accuracy compared to results based on MFCC features.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"26 1","pages":"660-671"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80999206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Nonuniform oversampled filter banks for audio signal processing 音频信号处理的非均匀过采样滤波器组

IEEE Trans. Speech Audio Process.

Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.814412

Z. Cvetković, J. Johnston

In emerging audio technology applications, there is a need for decompositions of audio signals into oversampled subband components with time-frequency resolution which mimics that of the cochlear filter bank and with high aliasing attenuation in each of the subbands independently, rather than aliasing cancellation properties. We present a design of nearly perfect reconstruction nonuniform oversampled filter banks which implement signal decompositions of this kind.

在新兴的音频技术应用中，需要将音频信号分解为具有时频分辨率的过采样子带分量，该分量模仿耳蜗滤波器组的时频分辨率，并且在每个子带中具有独立的高混叠衰减，而不是混叠消除特性。我们设计了一种近乎完美的重构非均匀过采样滤波器组来实现这类信号分解。

引用次数: 60

Matching pursuits sinusoidal speech coding 匹配追求正弦语音编码

IEEE Trans. Speech Audio Process.

Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815520

Ç. Etemoglu, V. Cuperman

This paper introduces a sinusoidal modeling technique for low bit rate speech coding wherein the parameters for each sinusoidal component are sequentially extracted by a closed-loop analysis. The sinusoidal modeling of the speech linear prediction (LP) residual is performed within the general framework of matching pursuits with a dictionary of sinusoids. The frequency space of sinusoids is restricted to sets of frequency intervals or bins, which in conjunction with the closed-loop analysis allow us to map the frequencies of the sinusoids into a frequency vector that is efficiently quantized. In voiced frames, two sets of frequency vectors are generated: one of them represents harmonically related and the other one nonharmonically related components of the voiced segment. This approach eliminates the need for voicing dependent cutoff frequency that is difficult to estimate correctly and to quantize at low bit rates. In transition frames, to efficiently extract and quantize the set of frequencies needed for the sinusoidal representation of the LP residual, we introduce frequency bin vector quantization (FBVQ). FBVQ selects a vector of nonuniformly spaced frequencies from a frequency codebook in order to represent the frequency domain information in transition regions. Our use of FBVQ with closed-loop searching contribute to an improvement of speech quality in transition frames. The effectiveness of the coding scheme is enhanced by exploiting the critical band concept of auditory perception in defining the frequency bins. To demonstrate the viability and the advantages of the new models studied, we designed a 4 kbps matching pursuits sinusoidal speech coder. Subjective results indicate that the proposed coder at 4 kbps has quality exceeding the 6.3 kbps G.723.1 coder.

本文介绍了一种用于低比特率语音编码的正弦建模技术，其中每个正弦分量的参数通过闭环分析顺序提取。语音线性预测(LP)残差的正弦建模是在用正弦波字典匹配追踪的一般框架内进行的。正弦波的频率空间被限制为一组频率间隔或箱，这与闭环分析相结合，使我们能够将正弦波的频率映射为有效量化的频率向量。在浊音帧中，产生两组频率向量，其中一组表示浊音段的谐波相关分量，另一组表示非谐波相关分量。这种方法消除了难以正确估计和在低比特率下量化的语音相关截止频率的需要。在过渡帧中，为了有效地提取和量化低频残差正弦表示所需的频率集，我们引入了频率本向量量化(FBVQ)。FBVQ从频率码本中选择一个频率间隔不均匀的向量来表示过渡区域的频域信息。我们将FBVQ与闭环搜索相结合，有助于提高过渡帧的语音质量。利用听觉感知的临界频带概念来定义频率箱，提高了编码方案的有效性。为了证明新模型的可行性和优势，我们设计了一个4kbps匹配追踪正弦语音编码器。主观测试结果表明，4kbps编码器的质量优于6.3 kbps的G.723.1编码器。

{"title":"Matching pursuits sinusoidal speech coding","authors":"Ç. Etemoglu, V. Cuperman","doi":"10.1109/TSA.2003.815520","DOIUrl":"https://doi.org/10.1109/TSA.2003.815520","url":null,"abstract":"This paper introduces a sinusoidal modeling technique for low bit rate speech coding wherein the parameters for each sinusoidal component are sequentially extracted by a closed-loop analysis. The sinusoidal modeling of the speech linear prediction (LP) residual is performed within the general framework of matching pursuits with a dictionary of sinusoids. The frequency space of sinusoids is restricted to sets of frequency intervals or bins, which in conjunction with the closed-loop analysis allow us to map the frequencies of the sinusoids into a frequency vector that is efficiently quantized. In voiced frames, two sets of frequency vectors are generated: one of them represents harmonically related and the other one nonharmonically related components of the voiced segment. This approach eliminates the need for voicing dependent cutoff frequency that is difficult to estimate correctly and to quantize at low bit rates. In transition frames, to efficiently extract and quantize the set of frequencies needed for the sinusoidal representation of the LP residual, we introduce frequency bin vector quantization (FBVQ). FBVQ selects a vector of nonuniformly spaced frequencies from a frequency codebook in order to represent the frequency domain information in transition regions. Our use of FBVQ with closed-loop searching contribute to an improvement of speech quality in transition frames. The effectiveness of the coding scheme is enhanced by exploiting the critical band concept of auditory perception in defining the frequency bins. To demonstrate the viability and the advantages of the new models studied, we designed a 4 kbps matching pursuits sinusoidal speech coder. Subjective results indicate that the proposed coder at 4 kbps has quality exceeding the 6.3 kbps G.723.1 coder.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"6 1","pages":"413-424"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87930146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging 不利环境下的噪声谱估计:改进的最小控制递归平均

IEEE Trans. Speech Audio Process.

Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.811544

I. Cohen

Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

噪声谱估计是语音增强和语音识别系统的基本组成部分。我们提出了一种改进的最小控制递归平均(IMCRA)方法，用于在非平稳噪声、弱语音成分和低输入信噪比(SNR)的不利环境中进行噪声估计。噪声估计是通过平均过去的频谱功率值获得的，使用时变频率相关的平滑参数，该参数由信号存在概率调整。语音存在概率由平滑周期图的最小值控制。该方法包括平滑迭代和最小跟踪迭代。第一次迭代在每个频带中提供粗略的语音活动检测。然后，在第二次迭代中进行平滑，排除相对较强的语音成分，使得语音活动期间的最小跟踪具有鲁棒性。研究表明，在非平稳噪声环境和低信噪比条件下，IMCRA方法是非常有效的。特别是，与竞争方法相比，该方法的估计误差更小，当集成到语音增强系统中时，语音质量得到改善，残余噪声更低。

{"title":"Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging","authors":"I. Cohen","doi":"10.1109/TSA.2003.811544","DOIUrl":"https://doi.org/10.1109/TSA.2003.811544","url":null,"abstract":"Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"3 2 1","pages":"466-475"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78286955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 949

Quantization of LSF parameters using a trellis modeling 使用网格建模的LSF参数量化

IEEE Trans. Speech Audio Process.

Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.814411

F. Lahouti, A. Khandani

An efficient block-based trellis quantization (BTQ) scheme is proposed for the quantization of the line spectral frequencies (LSF) in speech coding applications. The scheme is based on the modeling of the LSF intraframe dependencies with a trellis structure. The ordering property and the fact that LSF parameters are bounded within a range is explicitly incorporated in the trellis model. BTQ search and design algorithms are discussed and an efficient algorithm for the index generation (finding the index of a path in the trellis) is presented. Also the sequential vector decorrelation technique is presented to effectively exploit the intraframe correlation of LSF parameters within the trellis. Based on the proposed block-based trellis quantizer, two intraframe schemes and one interframe scheme are proposed. Comparisons to the split-VQ, the trellis coded quantization of LSF parameters, and the multi-stage VQ, as well as the interframe scheme used in IS-641 EFRC and the GSM AMR codec are provided. These results demonstrate that the proposed BTQ schemes outperform the above systems.

针对语音编码中线谱频率的量化问题，提出了一种有效的基于分块的栅格量化方法。该方案基于网格结构的LSF框架内依赖关系建模。排序属性和LSF参数在一定范围内有界的事实被显式地合并到网格模型中。讨论了BTQ搜索和设计算法，并提出了一种有效的索引生成算法(在网格中查找路径的索引)。提出了序列向量去相关技术，有效地利用了网格内LSF参数的帧内相关性。在此基础上，提出了两种帧内量化方案和一种帧间量化方案。比较了分裂VQ、栅格编码量化LSF参数和多级VQ，以及IS-641 EFRC和GSM AMR编解码器中使用的帧间方案。结果表明，所提出的BTQ方案优于上述系统。

引用次数: 12

Efficient text-independent speaker verification with structural Gaussian mixture models and neural network 基于结构高斯混合模型和神经网络的高效文本无关说话人验证

IEEE Trans. Speech Audio Process.

Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815822

Bing Xiang, T. Berger

We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.

我们提出了一个结构高斯混合模型(SGMMs)和神经网络的集成系统，目的是在文本无关的说话人验证中实现计算效率和高精度。首先将通用背景模型中所有高斯混合分量分层聚类，构建结构背景模型。通过这种方式，声学空间被划分为不同分辨率的多个区域。对于每个目标说话人，可以通过多电平最大后验(MAP)自适应从SBM生成SGMM。在测试过程中，为了显著降低计算成本，每个特征向量只对一小部分高斯混合分量进行评分。此外，通过神经网络将树状结构模型中不同层的得分进行组合，以进行最终决策。在NIST说话人验证评估中使用的电话语音数据上进行了不同配置的对比实验。实验结果表明，与基线相比，计算量减少了17倍，等效错误率(EER)相对减少了5%。与最近提出的散列GMM相比，SGMM-SBM也显示出一些优势，包括更高的速度和更好的验证性能。

{"title":"Efficient text-independent speaker verification with structural Gaussian mixture models and neural network","authors":"Bing Xiang, T. Berger","doi":"10.1109/TSA.2003.815822","DOIUrl":"https://doi.org/10.1109/TSA.2003.815822","url":null,"abstract":"We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"43 1","pages":"447-456"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80338680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

A soft voice activity detector based on a Laplacian-Gaussian model 基于拉普拉斯-高斯模型的软语音活动检测器

IEEE Trans. Speech Audio Process.

Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815518

S. Gazor, Wei Zhang

A new voice activity detector (VAD) is developed in this paper. The VAD is derived by applying a Bayesian hypothesis test on decorrelated speech samples. The signal is first decorrelated using an orthogonal transformation, e.g., discrete cosine transform (DCT) or the adaptive Karhunen-Loeve transform (KLT). The distributions of clean speech and noise signals are assumed to be Laplacian and Gaussian, respectively, as investigated recently. In addition, a hidden Markov model (HMM) is employed with two states representing silence and speech. The proposed soft VAD estimates the probability of voice being active (VBA), recursively. To this end, first the a priori probability of VBA is estimated/predicted based on feedback information from the previous time instance. Then the predicted probability is combined/updated with the new observed signal to calculate the probability of VBA at the current time instance. The required parameters of both speech and noise signals are estimated, adaptively, by the maximum likelihood (ML) approach. The simulation results show that the proposed soft VAD that uses a Laplacian distribution model for speech signals outperforms the previous VAD that uses a Gaussian model.

本文研制了一种新的语音活动检测器(VAD)。VAD是通过对去相关语音样本进行贝叶斯假设检验得到的。信号首先使用正交变换去相关，例如离散余弦变换(DCT)或自适应Karhunen-Loeve变换(KLT)。根据最近的研究，假设干净语音和噪声信号的分布分别为拉普拉斯分布和高斯分布。此外，采用隐马尔可夫模型(HMM)，该模型具有沉默和说话两种状态。提出的软VAD递归估计语音处于活动状态的概率(VBA)。为此，首先根据前一个时间实例的反馈信息估计/预测VBA的先验概率。然后将预测概率与新的观测信号结合/更新，计算出当前时间实例下VBA发生的概率。通过最大似然(ML)方法自适应估计语音和噪声信号所需的参数。仿真结果表明，基于拉普拉斯分布模型的语音信号软VAD优于基于高斯分布模型的语音信号软VAD。

{"title":"A soft voice activity detector based on a Laplacian-Gaussian model","authors":"S. Gazor, Wei Zhang","doi":"10.1109/TSA.2003.815518","DOIUrl":"https://doi.org/10.1109/TSA.2003.815518","url":null,"abstract":"A new voice activity detector (VAD) is developed in this paper. The VAD is derived by applying a Bayesian hypothesis test on decorrelated speech samples. The signal is first decorrelated using an orthogonal transformation, e.g., discrete cosine transform (DCT) or the adaptive Karhunen-Loeve transform (KLT). The distributions of clean speech and noise signals are assumed to be Laplacian and Gaussian, respectively, as investigated recently. In addition, a hidden Markov model (HMM) is employed with two states representing silence and speech. The proposed soft VAD estimates the probability of voice being active (VBA), recursively. To this end, first the a priori probability of VBA is estimated/predicted based on feedback information from the previous time instance. Then the predicted probability is combined/updated with the new observed signal to calculate the probability of VBA at the current time instance. The required parameters of both speech and noise signals are estimated, adaptively, by the maximum likelihood (ML) approach. The simulation results show that the proposed soft VAD that uses a Laplacian distribution model for speech signals outperforms the previous VAD that uses a Gaussian model.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"82 1 1","pages":"498-505"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78486264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 157

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Trans. Speech Audio Process.

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀