首页 > 最新文献

IEEE Trans. Speech Audio Process.最新文献

英文 中文
Window optimization in linear prediction analysis 线性预测分析中的窗口优化
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818213
W. Chu
The autocorrelation method of linear prediction (LP) analysis relies on a window for data extraction. We propose an approach to optimize the window which is based on gradient-descent. It is shown that the optimized window can enhance the performance of LP-based speech coding algorithms; in most instances, improvement in performance comes at no additional computational cost, since it merely requires a window replacement.
线性预测(LP)分析的自相关方法依赖于一个窗口进行数据提取。提出了一种基于梯度下降的窗口优化方法。结果表明,优化后的窗口可以提高基于lp的语音编码算法的性能;在大多数情况下,性能的提高不需要额外的计算成本,因为它只需要替换一个窗口。
{"title":"Window optimization in linear prediction analysis","authors":"W. Chu","doi":"10.1109/TSA.2003.818213","DOIUrl":"https://doi.org/10.1109/TSA.2003.818213","url":null,"abstract":"The autocorrelation method of linear prediction (LP) analysis relies on a window for data extraction. We propose an approach to optimize the window which is based on gradient-descent. It is shown that the optimized window can enhance the performance of LP-based speech coding algorithms; in most instances, improvement in performance comes at no additional computational cost, since it merely requires a window replacement.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"24 1","pages":"626-635"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87186278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Speech enhancement using 2-D Fourier transform 基于二维傅里叶变换的语音增强
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.816063
I. Soon, S. Koh
This paper presents an innovative way of using the two-dimensional (2-D) Fourier transform for speech enhancement. The blocking and windowing of the speech data for the 2-D Fourier transform are explained in detail. Several techniques of filtering in the 2-D Fourier transform domain are also proposed. They include magnitude spectral subtraction, 2-D Wiener filtering as well as a hybrid filter which effectively combines the one-dimensional (1-D) Wiener filter with the 2-D Wiener filter. The proposed hybrid filter compares favorably against other techniques using an objective test.
本文提出了一种利用二维傅里叶变换进行语音增强的创新方法。详细阐述了二维傅里叶变换中语音数据的块化和窗口化。本文还提出了几种二维傅里叶变换域的滤波技术。它们包括幅度谱减法、二维维纳滤波以及将一维(一维)维纳滤波器与二维维纳滤波器有效结合的混合滤波器。所提出的混合滤波器与使用客观测试的其他技术相比具有优势。
{"title":"Speech enhancement using 2-D Fourier transform","authors":"I. Soon, S. Koh","doi":"10.1109/TSA.2003.816063","DOIUrl":"https://doi.org/10.1109/TSA.2003.816063","url":null,"abstract":"This paper presents an innovative way of using the two-dimensional (2-D) Fourier transform for speech enhancement. The blocking and windowing of the speech data for the 2-D Fourier transform are explained in detail. Several techniques of filtering in the 2-D Fourier transform domain are also proposed. They include magnitude spectral subtraction, 2-D Wiener filtering as well as a hybrid filter which effectively combines the one-dimensional (1-D) Wiener filter with the 2-D Wiener filter. The proposed hybrid filter compares favorably against other techniques using an objective test.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"42 1","pages":"717-724"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90668042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments CSA-BF:一种用于真实汽车环境下语音增强和识别的约束开关自适应波束形成器
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.818034
Xianxian Zhang, J. Hansen
While a number of studies have investigated various speech enhancement and processing schemes for in-vehicle speech systems, little research has been performed using actual voice data collected in noisy car environments. In this paper, we propose a new constrained switched adaptive beamforming algorithm (CSA-BF) for speech enhancement and recognition in real moving car environments. The proposed algorithm consists of a speech/noise constraint section, a speech adaptive beamformer, and a noise adaptive beamformer. We investigate CSA-BF performance with a comparison to classic delay-and-sum beamforming (DASB) in realistic car conditions using a corpus of data recorded in various car noise environments from across the U.S. After analyzing the experimental results and considering the range of complex noise situations in the car environment using the CU-Move corpus, we formulate the three specific processing stages of the CSA-BF algorithm. This method is evaluated and shown to simultaneously decrease word-error-rate (WER) for speech recognition by up to 31% and improve speech quality via the SEGSNR measure by up to +5.5 dB on the average.
虽然许多研究已经研究了车载语音系统的各种语音增强和处理方案,但很少有研究使用在嘈杂的汽车环境中收集的实际语音数据进行研究。在本文中,我们提出了一种新的约束切换自适应波束形成算法(CSA-BF),用于真实移动汽车环境下的语音增强和识别。该算法由语音/噪声约束部分、语音自适应波束形成器和噪声自适应波束形成器组成。我们使用来自美国各地的各种汽车噪声环境中记录的数据语料库来研究CSA-BF算法在现实汽车条件下的性能,并将其与经典的延迟和波束形成(DASB)进行比较。在分析实验结果并考虑到汽车环境中复杂噪声情况的范围后,我们使用CU-Move语料库制定了CSA-BF算法的三个具体处理阶段。经过评估和证明,该方法可以同时将语音识别的单词错误率(WER)降低高达31%,并通过SEGSNR测量平均提高高达+5.5 dB的语音质量。
{"title":"CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments","authors":"Xianxian Zhang, J. Hansen","doi":"10.1109/TSA.2003.818034","DOIUrl":"https://doi.org/10.1109/TSA.2003.818034","url":null,"abstract":"While a number of studies have investigated various speech enhancement and processing schemes for in-vehicle speech systems, little research has been performed using actual voice data collected in noisy car environments. In this paper, we propose a new constrained switched adaptive beamforming algorithm (CSA-BF) for speech enhancement and recognition in real moving car environments. The proposed algorithm consists of a speech/noise constraint section, a speech adaptive beamformer, and a noise adaptive beamformer. We investigate CSA-BF performance with a comparison to classic delay-and-sum beamforming (DASB) in realistic car conditions using a corpus of data recorded in various car noise environments from across the U.S. After analyzing the experimental results and considering the range of complex noise situations in the car environment using the CU-Move corpus, we formulate the three specific processing stages of the CSA-BF algorithm. This method is evaluated and shown to simultaneously decrease word-error-rate (WER) for speech recognition by up to 31% and improve speech quality via the SEGSNR measure by up to +5.5 dB on the average.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"31 1","pages":"733-745"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85264102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Approximately independent factors of speech using nonlinear symplectic transformation 用非线性辛变换近似独立的语音因子
Pub Date : 2003-11-01 DOI: 10.1109/TSA.2003.814457
M. Omar, M. Hasegawa-Johnson
This paper addresses the problem of representing the speech signal using a set of features that are approximately statistically independent. This statistical independence simplifies building probabilistic models based on these features that can be used in applications like speech recognition. Since there is no evidence that the speech signal is a linear combination of separate factors or sources, we use a more general nonlinear transformation of the speech signal to achieve our approximately statistically independent feature set. We choose the transformation to be symplectic to maximize the likelihood of the generated feature set. In this paper, we describe applying this nonlinear transformation to the speech time-domain data directly and to the Mel-frequency cepstrum coefficients (MFCC). We discuss also experiments in which the generated feature set is transformed into a more compact set using a maximum mutual information linear transformation. This linear transformation is used to generate the acoustic features that represent the distinctions among the phonemes. The features resulted from this transformation are used in phoneme recognition experiments. The best results achieved show about 2% improvement in recognition accuracy compared to results based on MFCC features.
本文解决了使用一组近似统计独立的特征来表示语音信号的问题。这种统计独立性简化了基于这些特征的概率模型的构建,可用于语音识别等应用程序。由于没有证据表明语音信号是独立因素或源的线性组合,因此我们使用语音信号的更一般的非线性变换来实现我们的近似统计独立的特征集。我们选择辛变换来最大化生成的特征集的可能性。在本文中,我们描述了将这种非线性变换直接应用于语音时域数据和mel频率倒频谱系数(MFCC)。我们还讨论了使用最大互信息线性变换将生成的特征集转换为更紧凑集的实验。这种线性变换用于生成表示音素之间区别的声学特征。将变换后的特征用于音素识别实验。所获得的最佳结果表明,与基于MFCC特征的结果相比,识别精度提高了约2%。
{"title":"Approximately independent factors of speech using nonlinear symplectic transformation","authors":"M. Omar, M. Hasegawa-Johnson","doi":"10.1109/TSA.2003.814457","DOIUrl":"https://doi.org/10.1109/TSA.2003.814457","url":null,"abstract":"This paper addresses the problem of representing the speech signal using a set of features that are approximately statistically independent. This statistical independence simplifies building probabilistic models based on these features that can be used in applications like speech recognition. Since there is no evidence that the speech signal is a linear combination of separate factors or sources, we use a more general nonlinear transformation of the speech signal to achieve our approximately statistically independent feature set. We choose the transformation to be symplectic to maximize the likelihood of the generated feature set. In this paper, we describe applying this nonlinear transformation to the speech time-domain data directly and to the Mel-frequency cepstrum coefficients (MFCC). We discuss also experiments in which the generated feature set is transformed into a more compact set using a maximum mutual information linear transformation. This linear transformation is used to generate the acoustic features that represent the distinctions among the phonemes. The features resulted from this transformation are used in phoneme recognition experiments. The best results achieved show about 2% improvement in recognition accuracy compared to results based on MFCC features.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"26 1","pages":"660-671"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80999206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Nonuniform oversampled filter banks for audio signal processing 音频信号处理的非均匀过采样滤波器组
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.814412
Z. Cvetković, J. Johnston
In emerging audio technology applications, there is a need for decompositions of audio signals into oversampled subband components with time-frequency resolution which mimics that of the cochlear filter bank and with high aliasing attenuation in each of the subbands independently, rather than aliasing cancellation properties. We present a design of nearly perfect reconstruction nonuniform oversampled filter banks which implement signal decompositions of this kind.
在新兴的音频技术应用中,需要将音频信号分解为具有时频分辨率的过采样子带分量,该分量模仿耳蜗滤波器组的时频分辨率,并且在每个子带中具有独立的高混叠衰减,而不是混叠消除特性。我们设计了一种近乎完美的重构非均匀过采样滤波器组来实现这类信号分解。
{"title":"Nonuniform oversampled filter banks for audio signal processing","authors":"Z. Cvetković, J. Johnston","doi":"10.1109/TSA.2003.814412","DOIUrl":"https://doi.org/10.1109/TSA.2003.814412","url":null,"abstract":"In emerging audio technology applications, there is a need for decompositions of audio signals into oversampled subband components with time-frequency resolution which mimics that of the cochlear filter bank and with high aliasing attenuation in each of the subbands independently, rather than aliasing cancellation properties. We present a design of nearly perfect reconstruction nonuniform oversampled filter banks which implement signal decompositions of this kind.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"110 1","pages":"393-399"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74757716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Matching pursuits sinusoidal speech coding 匹配追求正弦语音编码
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815520
Ç. Etemoglu, V. Cuperman
This paper introduces a sinusoidal modeling technique for low bit rate speech coding wherein the parameters for each sinusoidal component are sequentially extracted by a closed-loop analysis. The sinusoidal modeling of the speech linear prediction (LP) residual is performed within the general framework of matching pursuits with a dictionary of sinusoids. The frequency space of sinusoids is restricted to sets of frequency intervals or bins, which in conjunction with the closed-loop analysis allow us to map the frequencies of the sinusoids into a frequency vector that is efficiently quantized. In voiced frames, two sets of frequency vectors are generated: one of them represents harmonically related and the other one nonharmonically related components of the voiced segment. This approach eliminates the need for voicing dependent cutoff frequency that is difficult to estimate correctly and to quantize at low bit rates. In transition frames, to efficiently extract and quantize the set of frequencies needed for the sinusoidal representation of the LP residual, we introduce frequency bin vector quantization (FBVQ). FBVQ selects a vector of nonuniformly spaced frequencies from a frequency codebook in order to represent the frequency domain information in transition regions. Our use of FBVQ with closed-loop searching contribute to an improvement of speech quality in transition frames. The effectiveness of the coding scheme is enhanced by exploiting the critical band concept of auditory perception in defining the frequency bins. To demonstrate the viability and the advantages of the new models studied, we designed a 4 kbps matching pursuits sinusoidal speech coder. Subjective results indicate that the proposed coder at 4 kbps has quality exceeding the 6.3 kbps G.723.1 coder.
本文介绍了一种用于低比特率语音编码的正弦建模技术,其中每个正弦分量的参数通过闭环分析顺序提取。语音线性预测(LP)残差的正弦建模是在用正弦波字典匹配追踪的一般框架内进行的。正弦波的频率空间被限制为一组频率间隔或箱,这与闭环分析相结合,使我们能够将正弦波的频率映射为有效量化的频率向量。在浊音帧中,产生两组频率向量,其中一组表示浊音段的谐波相关分量,另一组表示非谐波相关分量。这种方法消除了难以正确估计和在低比特率下量化的语音相关截止频率的需要。在过渡帧中,为了有效地提取和量化低频残差正弦表示所需的频率集,我们引入了频率本向量量化(FBVQ)。FBVQ从频率码本中选择一个频率间隔不均匀的向量来表示过渡区域的频域信息。我们将FBVQ与闭环搜索相结合,有助于提高过渡帧的语音质量。利用听觉感知的临界频带概念来定义频率箱,提高了编码方案的有效性。为了证明新模型的可行性和优势,我们设计了一个4kbps匹配追踪正弦语音编码器。主观测试结果表明,4kbps编码器的质量优于6.3 kbps的G.723.1编码器。
{"title":"Matching pursuits sinusoidal speech coding","authors":"Ç. Etemoglu, V. Cuperman","doi":"10.1109/TSA.2003.815520","DOIUrl":"https://doi.org/10.1109/TSA.2003.815520","url":null,"abstract":"This paper introduces a sinusoidal modeling technique for low bit rate speech coding wherein the parameters for each sinusoidal component are sequentially extracted by a closed-loop analysis. The sinusoidal modeling of the speech linear prediction (LP) residual is performed within the general framework of matching pursuits with a dictionary of sinusoids. The frequency space of sinusoids is restricted to sets of frequency intervals or bins, which in conjunction with the closed-loop analysis allow us to map the frequencies of the sinusoids into a frequency vector that is efficiently quantized. In voiced frames, two sets of frequency vectors are generated: one of them represents harmonically related and the other one nonharmonically related components of the voiced segment. This approach eliminates the need for voicing dependent cutoff frequency that is difficult to estimate correctly and to quantize at low bit rates. In transition frames, to efficiently extract and quantize the set of frequencies needed for the sinusoidal representation of the LP residual, we introduce frequency bin vector quantization (FBVQ). FBVQ selects a vector of nonuniformly spaced frequencies from a frequency codebook in order to represent the frequency domain information in transition regions. Our use of FBVQ with closed-loop searching contribute to an improvement of speech quality in transition frames. The effectiveness of the coding scheme is enhanced by exploiting the critical band concept of auditory perception in defining the frequency bins. To demonstrate the viability and the advantages of the new models studied, we designed a 4 kbps matching pursuits sinusoidal speech coder. Subjective results indicate that the proposed coder at 4 kbps has quality exceeding the 6.3 kbps G.723.1 coder.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"6 1","pages":"413-424"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87930146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging 不利环境下的噪声谱估计:改进的最小控制递归平均
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.811544
I. Cohen
Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.
噪声谱估计是语音增强和语音识别系统的基本组成部分。我们提出了一种改进的最小控制递归平均(IMCRA)方法,用于在非平稳噪声、弱语音成分和低输入信噪比(SNR)的不利环境中进行噪声估计。噪声估计是通过平均过去的频谱功率值获得的,使用时变频率相关的平滑参数,该参数由信号存在概率调整。语音存在概率由平滑周期图的最小值控制。该方法包括平滑迭代和最小跟踪迭代。第一次迭代在每个频带中提供粗略的语音活动检测。然后,在第二次迭代中进行平滑,排除相对较强的语音成分,使得语音活动期间的最小跟踪具有鲁棒性。研究表明,在非平稳噪声环境和低信噪比条件下,IMCRA方法是非常有效的。特别是,与竞争方法相比,该方法的估计误差更小,当集成到语音增强系统中时,语音质量得到改善,残余噪声更低。
{"title":"Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging","authors":"I. Cohen","doi":"10.1109/TSA.2003.811544","DOIUrl":"https://doi.org/10.1109/TSA.2003.811544","url":null,"abstract":"Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"3 2 1","pages":"466-475"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78286955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 949
Quantization of LSF parameters using a trellis modeling 使用网格建模的LSF参数量化
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.814411
F. Lahouti, A. Khandani
An efficient block-based trellis quantization (BTQ) scheme is proposed for the quantization of the line spectral frequencies (LSF) in speech coding applications. The scheme is based on the modeling of the LSF intraframe dependencies with a trellis structure. The ordering property and the fact that LSF parameters are bounded within a range is explicitly incorporated in the trellis model. BTQ search and design algorithms are discussed and an efficient algorithm for the index generation (finding the index of a path in the trellis) is presented. Also the sequential vector decorrelation technique is presented to effectively exploit the intraframe correlation of LSF parameters within the trellis. Based on the proposed block-based trellis quantizer, two intraframe schemes and one interframe scheme are proposed. Comparisons to the split-VQ, the trellis coded quantization of LSF parameters, and the multi-stage VQ, as well as the interframe scheme used in IS-641 EFRC and the GSM AMR codec are provided. These results demonstrate that the proposed BTQ schemes outperform the above systems.
针对语音编码中线谱频率的量化问题,提出了一种有效的基于分块的栅格量化方法。该方案基于网格结构的LSF框架内依赖关系建模。排序属性和LSF参数在一定范围内有界的事实被显式地合并到网格模型中。讨论了BTQ搜索和设计算法,并提出了一种有效的索引生成算法(在网格中查找路径的索引)。提出了序列向量去相关技术,有效地利用了网格内LSF参数的帧内相关性。在此基础上,提出了两种帧内量化方案和一种帧间量化方案。比较了分裂VQ、栅格编码量化LSF参数和多级VQ,以及IS-641 EFRC和GSM AMR编解码器中使用的帧间方案。结果表明,所提出的BTQ方案优于上述系统。
{"title":"Quantization of LSF parameters using a trellis modeling","authors":"F. Lahouti, A. Khandani","doi":"10.1109/TSA.2003.814411","DOIUrl":"https://doi.org/10.1109/TSA.2003.814411","url":null,"abstract":"An efficient block-based trellis quantization (BTQ) scheme is proposed for the quantization of the line spectral frequencies (LSF) in speech coding applications. The scheme is based on the modeling of the LSF intraframe dependencies with a trellis structure. The ordering property and the fact that LSF parameters are bounded within a range is explicitly incorporated in the trellis model. BTQ search and design algorithms are discussed and an efficient algorithm for the index generation (finding the index of a path in the trellis) is presented. Also the sequential vector decorrelation technique is presented to effectively exploit the intraframe correlation of LSF parameters within the trellis. Based on the proposed block-based trellis quantizer, two intraframe schemes and one interframe scheme are proposed. Comparisons to the split-VQ, the trellis coded quantization of LSF parameters, and the multi-stage VQ, as well as the interframe scheme used in IS-641 EFRC and the GSM AMR codec are provided. These results demonstrate that the proposed BTQ schemes outperform the above systems.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"54 1","pages":"400-412"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86595832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Efficient text-independent speaker verification with structural Gaussian mixture models and neural network 基于结构高斯混合模型和神经网络的高效文本无关说话人验证
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815822
Bing Xiang, T. Berger
We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.
我们提出了一个结构高斯混合模型(SGMMs)和神经网络的集成系统,目的是在文本无关的说话人验证中实现计算效率和高精度。首先将通用背景模型中所有高斯混合分量分层聚类,构建结构背景模型。通过这种方式,声学空间被划分为不同分辨率的多个区域。对于每个目标说话人,可以通过多电平最大后验(MAP)自适应从SBM生成SGMM。在测试过程中,为了显著降低计算成本,每个特征向量只对一小部分高斯混合分量进行评分。此外,通过神经网络将树状结构模型中不同层的得分进行组合,以进行最终决策。在NIST说话人验证评估中使用的电话语音数据上进行了不同配置的对比实验。实验结果表明,与基线相比,计算量减少了17倍,等效错误率(EER)相对减少了5%。与最近提出的散列GMM相比,SGMM-SBM也显示出一些优势,包括更高的速度和更好的验证性能。
{"title":"Efficient text-independent speaker verification with structural Gaussian mixture models and neural network","authors":"Bing Xiang, T. Berger","doi":"10.1109/TSA.2003.815822","DOIUrl":"https://doi.org/10.1109/TSA.2003.815822","url":null,"abstract":"We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"43 1","pages":"447-456"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80338680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 120
Blind single channel deconvolution using nonstationary signal processing 采用非平稳信号处理的盲单通道反卷积
Pub Date : 2003-08-26 DOI: 10.1109/TSA.2003.815522
J. Hopgood, P. Rayner
Blind deconvolution is fundamental in signal processing applications and, in particular, the single channel case remains a challenging and formidable problem. This paper considers single channel blind deconvolution in the case where the degraded observed signal may be modeled as the convolution of a nonstationary source signal with a stationary distortion operator. The important feature that the source is nonstationary while the channel is stationary facilitates the unambiguous identification of either the source or channel, and deconvolution is possible, whereas if the source and channel are both stationary, identification is ambiguous. The parameters for the channel are estimated by modeling the source as a time-varyng AR process and the distortion by an all-pole filter, and using the Bayesian framework for parameter estimation. This estimate can then be used to deconvolve the observed signal. In contrast to the classical histogram approach for estimating the channel poles, where the technique merely relies on the fact that the channel is actually stationary rather than modeling it as so, the proposed Bayesian method does take account for the channel's stationarity in the model and, consequently, is more robust. The properties of this model are investigated, and the advantage of utilizing the nonstationarity of a system rather than considering it as a curse is discussed.
盲反卷积是信号处理应用的基础,特别是单通道情况仍然是一个具有挑战性和艰巨的问题。本文考虑退化观测信号可建模为非平稳源信号与平稳失真算子的卷积的单通道盲反卷积。源是非平稳的,而通道是平稳的,这一重要特征有助于对源或通道进行明确的识别,并且可以进行反卷积,而如果源和通道都是平稳的,则识别是模糊的。通过将源建模为时变AR过程,将失真建模为全极滤波器,并使用贝叶斯框架进行参数估计,估计了信道的参数。这个估计可以用来对观察到的信号进行反卷积。与用于估计信道极点的经典直方图方法相反,该技术仅仅依赖于信道实际上是平稳的事实,而不是像这样建模,所提出的贝叶斯方法确实考虑了模型中信道的平稳性,因此更健壮。研究了该模型的性质,讨论了利用系统的非平稳性而不是将其视为一种缺陷的优点。
{"title":"Blind single channel deconvolution using nonstationary signal processing","authors":"J. Hopgood, P. Rayner","doi":"10.1109/TSA.2003.815522","DOIUrl":"https://doi.org/10.1109/TSA.2003.815522","url":null,"abstract":"Blind deconvolution is fundamental in signal processing applications and, in particular, the single channel case remains a challenging and formidable problem. This paper considers single channel blind deconvolution in the case where the degraded observed signal may be modeled as the convolution of a nonstationary source signal with a stationary distortion operator. The important feature that the source is nonstationary while the channel is stationary facilitates the unambiguous identification of either the source or channel, and deconvolution is possible, whereas if the source and channel are both stationary, identification is ambiguous. The parameters for the channel are estimated by modeling the source as a time-varyng AR process and the distortion by an all-pole filter, and using the Bayesian framework for parameter estimation. This estimate can then be used to deconvolve the observed signal. In contrast to the classical histogram approach for estimating the channel poles, where the technique merely relies on the fact that the channel is actually stationary rather than modeling it as so, the proposed Bayesian method does take account for the channel's stationarity in the model and, consequently, is more robust. The properties of this model are investigated, and the advantage of utilizing the nonstationarity of a system rather than considering it as a curse is discussed.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"469 1","pages":"476-488"},"PeriodicalIF":0.0,"publicationDate":"2003-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77508201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
期刊
IEEE Trans. Speech Audio Process.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1