IEEE Transactions on Audio Speech and Language Processing最新文献_第4页

Learning Optimal Features for Polyphonic Audio-to-Score Alignment 学习复调音频与乐谱对齐的最佳功能

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2266794

C. Joder, S. Essid, G. Richard

This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.

本文讨论了特征函数的设计，用于将音乐录音与作品(乐谱)的符号表示相匹配。这些特征函数被定义为音频观察和对应于分数的模板向量之间的不相似性度量。通过将模板构造表示为从符号到音频表示的线性映射，可以通过优化线性变换来学习特征函数。本文探讨了两种不同的学习策略。第一个使用最佳拟合标准(最小散度)，而第二个利用基于条件随机场模型的判别框架(最大似然标准)。我们在一个大型的流行和古典复调音乐数据库中评估了特征函数在音频-乐谱对齐任务中的影响。结果表明，对于几种类型的模型，使用不同的时间约束，学习映射具有优于经典启发式映射的潜力。在这个校准任务中，比较了音频观测的几种表示以及几种距离函数。我们的实验选择对称的Kullback-Leibler散度。此外，光谱图和基于cqt的表示都提供了非常精确的对准，在我们最复杂的系统中，检测超过97%的发作，精度为100毫秒。

{"title":"Learning Optimal Features for Polyphonic Audio-to-Score Alignment","authors":"C. Joder, S. Essid, G. Richard","doi":"10.1109/TASL.2013.2266794","DOIUrl":"https://doi.org/10.1109/TASL.2013.2266794","url":null,"abstract":"This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2266794","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model 基于自适应全频带谐波模型的语音分析与合成

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2266772

G. Degottex, Y. Stylianou

Voice models often use frequency limits to split the speech spectrum into two or more voiced/unvoiced frequency bands. However, from the voice production, the amplitude spectrum of the voiced source decreases smoothly without any abrupt frequency limit. Accordingly, multiband models struggle to estimate these limits and, as a consequence, artifacts can degrade the perceived quality. Using a linear frequency basis adapted to the non-stationarities of the speech signal, the Fan Chirp Transformation (FChT) have demonstrated harmonicity at frequencies higher than usually observed from the DFT which motivates a full-band modeling. The previously proposed Adaptive Quasi-Harmonic model (aQHM) offers even more flexibility than the FChT by using a non-linear frequency basis. In the current paper, exploiting the properties of aQHM, we describe a full-band Adaptive Harmonic Model (aHM) along with detailed descriptions of its corresponding algorithms for the estimation of harmonics up to the Nyquist frequency. Formal listening tests show that the speech reconstructed using aHM is nearly indistinguishable from the original speech. Experiments with synthetic signals also show that the proposed aHM globally outperforms previous sinusoidal and harmonic models in terms of precision in estimating the sinusoidal parameters. As a perspective, such a precision is interesting for building higher level models upon the sinusoidal parameters, like spectral envelopes for speech synthesis.

语音模型通常使用频率限制将语音频谱划分为两个或多个浊音/非浊音频段。但从声音产生来看，浊音源的幅度谱平滑下降，没有突兀的频率限制。因此，多波段模型难以估计这些限制，因此，伪像会降低感知质量。使用适应语音信号非平稳性的线性频率基，风扇啁啾变换(FChT)在比DFT通常观察到的频率更高的频率下显示出谐波，这激发了全频带建模。先前提出的自适应准谐波模型(aQHM)通过使用非线性频率基，比FChT具有更大的灵活性。在本文中，我们利用aQHM的特性，描述了一种全频段自适应谐波模型(aHM)，并详细描述了其相应的算法，用于估计奈奎斯特频率以下的谐波。正式的听力测试表明，使用aHM重建的语音与原始语音几乎没有区别。对合成信号的实验也表明，所提出的aHM在估计正弦参数的精度方面总体上优于先前的正弦和谐波模型。从一个角度来看，这样的精度对于在正弦参数上建立更高层次的模型是很有趣的，比如语音合成的频谱包络。

{"title":"Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model","authors":"G. Degottex, Y. Stylianou","doi":"10.1109/TASL.2013.2266772","DOIUrl":"https://doi.org/10.1109/TASL.2013.2266772","url":null,"abstract":"Voice models often use frequency limits to split the speech spectrum into two or more voiced/unvoiced frequency bands. However, from the voice production, the amplitude spectrum of the voiced source decreases smoothly without any abrupt frequency limit. Accordingly, multiband models struggle to estimate these limits and, as a consequence, artifacts can degrade the perceived quality. Using a linear frequency basis adapted to the non-stationarities of the speech signal, the Fan Chirp Transformation (FChT) have demonstrated harmonicity at frequencies higher than usually observed from the DFT which motivates a full-band modeling. The previously proposed Adaptive Quasi-Harmonic model (aQHM) offers even more flexibility than the FChT by using a non-linear frequency basis. In the current paper, exploiting the properties of aQHM, we describe a full-band Adaptive Harmonic Model (aHM) along with detailed descriptions of its corresponding algorithms for the estimation of harmonics up to the Nyquist frequency. Formal listening tests show that the speech reconstructed using aHM is nearly indistinguishable from the original speech. Experiments with synthetic signals also show that the proposed aHM globally outperforms previous sinusoidal and harmonic models in terms of precision in estimating the sinusoidal parameters. As a perspective, such a precision is interesting for building higher level models upon the sinusoidal parameters, like spectral envelopes for speech synthesis.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2266772","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62890382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis 基于受限玻尔兹曼机和深度信念网络的频谱包络建模用于统计参数语音合成

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2269291

Zhenhua Ling, L. Deng, Dong Yu

This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter generation criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra.

提出了一种新的用于统计参数语音合成的频谱建模方法。在传统的方法中，基于隐马尔可夫模型(HMM)的参数化语音合成采用高阶谱参数，如梅尔倒谱或线谱对。本文提出的方法在两个方面对传统方法进行了改进。首先，使用低水平、未变换的频谱包络(由STRAIGHT声码器提取)的分布作为合成参数。其次，我们采用包含约束玻尔兹曼机(RBM)和深度信念网络(DBN)等多隐变量的图形模型来代替单一高斯分布来表示各HMM状态下的低能级谱包络分布。在合成时，在动态特征约束下，按照最大输出概率参数生成准则，从输入句子的rbm - hmm或dbn - hmm预测谱包络。在每个HMM状态下，对RBM或DBN中可见随机变量的边际分布应用高斯近似，以实现参数生成问题的封闭解。实验结果表明，RBM-HMM和DBN-HMM都能比传统的高斯- hmm更好地生成频谱包络参数序列，具有更好的泛化能力，DBN-HMM和RBM-HMM的表现相似，可能是由于使用了高斯近似。结果表明，本文提出的方法可以显著缓解传统基于hmm的语音合成系统的过度平滑效应，提高系统的自然度。

{"title":"Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis","authors":"Zhenhua Ling, L. Deng, Dong Yu","doi":"10.1109/TASL.2013.2269291","DOIUrl":"https://doi.org/10.1109/TASL.2013.2269291","url":null,"abstract":"This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter generation criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2269291","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62890772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 160

Noise Model Transfer: Novel Approach to Robustness Against Nonstationary Noise 噪声模型转移:抗非平稳噪声鲁棒性的新方法

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2272513

Takuya Yoshioka, T. Nakatani

This paper proposes an approach, called noise model transfer (NMT), for estimating the rapidly changing parameter values of a feature-domain noise model, which can be used to enhance feature vectors corrupted by highly nonstationary noise. Unlike conventional methods, the proposed approach can exploit both observed feature vectors, representing spectral envelopes, and other signal properties that are usually discarded during feature extraction but that are useful for separating nonstationary noise from speech. Specifically, we assume the availability of a noise power spectrum estimator that can capture rapid changes in noise characteristics by leveraging such signal properties. NMT determines the optimal transformation from the estimated noise power spectra into the feature-domain noise model parameter values in the sense of maximum likelihood. NMT is successfully applied to meeting speech recognition, where the main noise sources are competing talkers; and reverberant speech recognition, where the late reverberation is regarded as highly nonstationary additive noise.

本文提出了一种用于估计特征域噪声模型的快速变化参数值的方法，该方法可用于增强被高度非平稳噪声破坏的特征向量。与传统方法不同，本文提出的方法可以利用观察到的特征向量(表示频谱包络)和其他信号特性，这些特性通常在特征提取过程中被丢弃，但对于从语音中分离非平稳噪声很有用。具体来说，我们假设噪声功率谱估计器的可用性可以通过利用这些信号特性来捕获噪声特性的快速变化。在极大似然意义上，NMT确定了从估计的噪声功率谱到特征域噪声模型参数值的最优转换。NMT成功地应用于会议语音识别，其中主要噪声源是相互竞争的说话者;在混响语音识别中，后期混响被视为高度非平稳的加性噪声。

引用次数: 11

An Experimental Analysis on Integrating Multi-Stream Spectro-Temporal, Cepstral and Pitch Information for Mandarin Speech Recognition 融合多流谱时、倒谱和音高信息的普通话语音识别实验分析

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2263803

Yow-Bang Wang, Shang-Wen Li, Lin-Shan Lee

Gabor features have been proposed for extracting spectro-temporal modulation information from speech signals, and have been shown to yield large improvements in recognition accuracy. We use a flexible Tandem system framework that integrates multi-stream information including Gabor, MFCC, and pitch features in various ways, by modeling either or both of the tone and phoneme variations in Mandarin speech recognition. We use either phonemes or tonal phonemes (tonemes) as either the target classes of MLP posterior estimation and/or the acoustic units of HMM recognition. The experiments yield a comprehensive analysis on the contributions to recognition accuracy made by either of the feature sets. We discuss their complementarities in tone, phoneme, and toneme classification. We show that Gabor features are better for recognition of vowels and unvoiced consonants, while MFCCs are better for voiced consonants. Also, Gabor features are capable of capturing changes in signals across time and frequency bands caused by Mandarin tone patterns, while pitch features further offer extra tonal information. This explains why the integration of Gabor, MFCC, and pitch features offers such significant improvements.

Gabor特征已被提出用于从语音信号中提取光谱-时间调制信息，并已被证明可以大大提高识别精度。我们使用一个灵活的串联系统框架，通过建模普通话语音识别中的音调和音素变化，以各种方式集成多流信息，包括Gabor, MFCC和音高特征。我们使用音素或音调音素作为MLP后验估计的目标类别和/或HMM识别的声学单位。实验结果对两种特征集对识别精度的贡献进行了综合分析。我们讨论了它们在声调、音素和声调分类上的互补性。我们发现Gabor特征更适合识别元音和不发音辅音，而mfccc更适合识别发音辅音。此外，Gabor特征能够捕捉由普通话音调模式引起的信号在时间和频带上的变化，而音调特征进一步提供额外的音调信息。这就解释了为什么Gabor、MFCC和pitch功能的集成提供了如此显著的改进。

{"title":"An Experimental Analysis on Integrating Multi-Stream Spectro-Temporal, Cepstral and Pitch Information for Mandarin Speech Recognition","authors":"Yow-Bang Wang, Shang-Wen Li, Lin-Shan Lee","doi":"10.1109/TASL.2013.2263803","DOIUrl":"https://doi.org/10.1109/TASL.2013.2263803","url":null,"abstract":"Gabor features have been proposed for extracting spectro-temporal modulation information from speech signals, and have been shown to yield large improvements in recognition accuracy. We use a flexible Tandem system framework that integrates multi-stream information including Gabor, MFCC, and pitch features in various ways, by modeling either or both of the tone and phoneme variations in Mandarin speech recognition. We use either phonemes or tonal phonemes (tonemes) as either the target classes of MLP posterior estimation and/or the acoustic units of HMM recognition. The experiments yield a comprehensive analysis on the contributions to recognition accuracy made by either of the feature sets. We discuss their complementarities in tone, phoneme, and toneme classification. We show that Gabor features are better for recognition of vowels and unvoiced consonants, while MFCCs are better for voiced consonants. Also, Gabor features are capable of capturing changes in signals across time and frequency bands caused by Mandarin tone patterns, while pitch features further offer extra tonal information. This explains why the integration of Gabor, MFCC, and pitch features offers such significant improvements.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2263803","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition 基于判别动态高斯混合选择的汉语多口音语音识别中可靠的口音单位生成

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2265087

Chao Zhang, Yi Liu, Yunqing Xia, Xuan Wang, Chin-Hui Lee

In this paper, we propose a discriminative dynamic Gaussian mixture selection (DGMS) strategy to generate reliable accent-specific units (ASUs) for multi-accent speech recognition. Time-aligned phone recognition is used to generate the ASUs that model accent variations explicitly and accurately. DGMS reconstructs and adjusts a pre-trained set of hidden Markov model (HMM) state densities to build dynamic observation densities for each input speech frame. A discriminative minimum classification error criterion is adopted to optimize the sizes of the HMM state observation densities with a genetic algorithm (GA). To the author's knowledge, the discriminative optimization for DGMS accomplishes discriminative training of discrete variables that is first proposed. We found the proposed framework is able to cover more multi-accent changes, thus reduce some performance loss in pruned beam search, without increasing the model size of the original acoustic model set. Evaluation on three typical Chinese accents, Chuan, Yue and Wu, shows that our approach outperforms traditional acoustic model reconstruction techniques with a syllable error rate reduction of 8.0%, 5.5% and 5.0%, respectively, while maintaining a good performance on standard Putonghua speech.

在本文中，我们提出了一种判别动态高斯混合选择(DGMS)策略来生成可靠的多口音特定单元(ASUs)。时间对齐的电话识别用于生成华硕模型口音变化明确和准确。DGMS对预训练的隐马尔可夫模型(HMM)状态密度进行重构和调整，为每个输入语音帧构建动态观察密度。采用判别最小分类误差准则，利用遗传算法优化HMM状态观测密度的大小。据笔者所知，DGMS的判别优化完成了首次提出的离散变量的判别训练。我们发现，所提出的框架能够覆盖更多的多重音变化，从而在不增加原始声学模型集的模型大小的情况下减少修剪波束搜索的一些性能损失。对三种典型的中国口音川、越、吴的测试表明，我们的方法优于传统的声学模型重建技术，音节错误率分别降低了8.0%、5.5%和5.0%，同时在标准普通话语音上保持了良好的表现。

{"title":"Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition","authors":"Chao Zhang, Yi Liu, Yunqing Xia, Xuan Wang, Chin-Hui Lee","doi":"10.1109/TASL.2013.2265087","DOIUrl":"https://doi.org/10.1109/TASL.2013.2265087","url":null,"abstract":"In this paper, we propose a discriminative dynamic Gaussian mixture selection (DGMS) strategy to generate reliable accent-specific units (ASUs) for multi-accent speech recognition. Time-aligned phone recognition is used to generate the ASUs that model accent variations explicitly and accurately. DGMS reconstructs and adjusts a pre-trained set of hidden Markov model (HMM) state densities to build dynamic observation densities for each input speech frame. A discriminative minimum classification error criterion is adopted to optimize the sizes of the HMM state observation densities with a genetic algorithm (GA). To the author's knowledge, the discriminative optimization for DGMS accomplishes discriminative training of discrete variables that is first proposed. We found the proposed framework is able to cover more multi-accent changes, thus reduce some performance loss in pruned beam search, without increasing the model size of the original acoustic model set. Evaluation on three typical Chinese accents, Chuan, Yue and Wu, shows that our approach outperforms traditional acoustic model reconstruction techniques with a syllable error rate reduction of 8.0%, 5.5% and 5.0%, respectively, while maintaining a good performance on standard Putonghua speech.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2265087","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62890303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems 连接主义语音识别系统中说话人自适应的厄米多项式

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2270370

S. Siniscalchi, Jinyu Li, Chin-Hui Lee

Model adaptation techniques are an efficient way to reduce the mismatch that typically occurs between the training and test condition of any automatic speech recognition (ASR) system. This work addresses the problem of increased degradation in performance when moving from speaker-dependent (SD) to speaker-independent (SI) conditions for connectionist (or hybrid) hidden Markov model/artificial neural network (HMM/ANN) systems in the context of large vocabulary continuous speech recognition (LVCSR). Adapting hybrid HMM/ANN systems on a small amount of adaptation data has been proven to be a difficult task, and has been a limiting factor in the widespread deployment of hybrid techniques in operational ASR systems. Addressing the crucial issue of speaker adaptation (SA) for hybrid HMM/ANN system can thereby have a great impact on the connectionist paradigm, which will play a major role in the design of next-generation LVCSR considering the great success reported by deep neural networks - ANNs with many hidden layers that adopts the pre-training technique - on many speech tasks. Current adaptation techniques for ANNs based on injecting an adaptable linear transformation network connected to either the input, or the output layer are not effective especially with a small amount of adaptation data, e.g., a single adaptation utterance. In this paper, a novel solution is proposed to overcome those limits and make it robust to scarce adaptation resources. The key idea is to adapt the hidden activation functions rather than the network weights. The adoption of Hermitian activation functions makes this possible. Experimental results on an LVCSR task demonstrate the effectiveness of the proposed approach.

模型自适应技术是一种有效的方法来减少任何自动语音识别(ASR)系统的训练条件和测试条件之间的不匹配。这项工作解决了连接主义(或混合)隐马尔可夫模型/人工神经网络(HMM/ANN)系统在大词汇量连续语音识别(LVCSR)背景下从依赖说话人(SD)到独立说话人(SI)条件下性能下降的问题。在少量自适应数据上适应混合HMM/ANN系统已被证明是一项艰巨的任务，并且已成为在作战ASR系统中广泛部署混合技术的限制因素。因此，解决混合HMM/ANN系统的说话人自适应(SA)的关键问题可以对连接主义范式产生重大影响，考虑到深度神经网络(采用预训练技术的具有许多隐藏层的ANN)在许多语音任务上取得的巨大成功，连接主义范式将在下一代LVCSR的设计中发挥重要作用。目前的人工神经网络自适应技术是基于注入一个连接到输入层或输出层的自适应线性变换网络，特别是对于少量的自适应数据，例如单个自适应话语，效果不佳。本文提出了一种新的解决方案来克服这些限制，并使其对稀缺的适应资源具有鲁棒性。关键思想是适应隐藏的激活函数而不是网络权值。厄米激活函数的采用使这成为可能。在LVCSR任务上的实验结果验证了该方法的有效性。

{"title":"Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems","authors":"S. Siniscalchi, Jinyu Li, Chin-Hui Lee","doi":"10.1109/TASL.2013.2270370","DOIUrl":"https://doi.org/10.1109/TASL.2013.2270370","url":null,"abstract":"Model adaptation techniques are an efficient way to reduce the mismatch that typically occurs between the training and test condition of any automatic speech recognition (ASR) system. This work addresses the problem of increased degradation in performance when moving from speaker-dependent (SD) to speaker-independent (SI) conditions for connectionist (or hybrid) hidden Markov model/artificial neural network (HMM/ANN) systems in the context of large vocabulary continuous speech recognition (LVCSR). Adapting hybrid HMM/ANN systems on a small amount of adaptation data has been proven to be a difficult task, and has been a limiting factor in the widespread deployment of hybrid techniques in operational ASR systems. Addressing the crucial issue of speaker adaptation (SA) for hybrid HMM/ANN system can thereby have a great impact on the connectionist paradigm, which will play a major role in the design of next-generation LVCSR considering the great success reported by deep neural networks - ANNs with many hidden layers that adopts the pre-training technique - on many speech tasks. Current adaptation techniques for ANNs based on injecting an adaptable linear transformation network connected to either the input, or the output layer are not effective especially with a small amount of adaptation data, e.g., a single adaptation utterance. In this paper, a novel solution is proposed to overcome those limits and make it robust to scarce adaptation resources. The key idea is to adapt the hidden activation functions rather than the network weights. The adoption of Hermitian activation functions makes this possible. Experimental results on an LVCSR task demonstrate the effectiveness of the proposed approach.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2270370","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 79

Blind Channel Magnitude Response Estimation in Speech Using Spectrum Classification 基于频谱分类的语音盲信道幅度响应估计

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2270406

N. Gaubitch, M. Brookes, P. Naylor

We present an algorithm for blind estimation of the magnitude response of an acoustic channel from single microphone observations of a speech signal. The algorithm employs channel robust RASTA filtered Mel-frequency cepstral coefficients as features to train a Gaussian mixture model based classifier and average clean speech spectra are associated with each mixture; these are then used to blindly estimate the acoustic channel magnitude response from speech that has undergone spectral modification due to the channel. Experimental results using a variety of simulated and measured acoustic channels and additive babble noise, car noise and white Gaussian noise are presented. The results demonstrate that the proposed method is able to estimate a variety of channel magnitude responses to within an Itakura distance of dI ≤0.5 for SNR ≥10 dB.

我们提出了一种从单个麦克风的语音信号观测中盲估计声道幅度响应的算法。该算法采用信道鲁棒RASTA滤波后的mel频率倒谱系数作为特征，训练基于高斯混合模型的分类器，并将平均干净语音频谱与每个混合相关联;然后使用这些方法来盲目估计由于信道而发生频谱修改的语音的声通道幅度响应。给出了各种模拟声通道和实测声通道的实验结果，并给出了加性杂音噪声、汽车噪声和高斯白噪声。结果表明，该方法能够在信噪比≥10 dB的Itakura距离dI≤0.5范围内估计各种信道幅度响应。

引用次数: 20

Video-Aided Model-Based Source Separation in Real Reverberant Rooms 真实混响室内基于视频辅助模型的声源分离

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2261814

Muhammad Salman Khan, S. M. Naqvi, Ata ur-Rehman, Wenwu Wang, J. Chambers

Source separation algorithms that utilize only audio data can perform poorly if multiple sources or reverberation are present. In this paper we therefore propose a video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static. By exploiting cues from video, we first localize individual speech sources in the enclosure and then estimate their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors are probabilistically modeled. The models make use of the source direction information and are evaluated at discrete time-frequency points. The model parameters are refined with the well-known expectation-maximization (EM) algorithm. The algorithm outputs time-frequency masks that are used to reconstruct the individual sources. Simulation results show that by utilizing the visual modality the proposed algorithm can produce better time-frequency masks thereby giving improved source estimates. We provide experimental results to test the proposed algorithm in different scenarios and provide comparisons with both other audio-only and audio-visual algorithms and achieve improved performance both on synthetic and real data. We also include dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited.

如果存在多个声源或混响，仅利用音频数据的源分离算法会表现不佳。因此，在本文中，我们提出了一种基于视频辅助模型的源分离算法，用于双通道混响记录，其中假设源是静态的。通过利用来自视频的线索，我们首先定位了箱体中的单个语音源，然后估计了它们的方向。在此基础上，建立了耳间空间信号、耳间相位差和耳间水位差以及混合矢量的概率模型。该模型利用源方向信息，并在离散时频点进行评估。模型参数采用期望最大化(EM)算法进行细化。该算法输出用于重建单个源的时频掩码。仿真结果表明，利用视觉模态，该算法可以产生更好的时频掩模，从而改善了源估计。我们提供了不同场景下的实验结果，并与其他纯音频和视听算法进行了比较，在合成数据和真实数据上都取得了改进的性能。我们还在我们的算法中加入了基于去混响的预处理，以抑制观察到的立体声混合的后期混响分量，并进一步提高算法的整体输出。这一优势使我们的算法适合在不确定的高度混响设置中使用，其中其他纯音频和视听方法的性能有限。

{"title":"Video-Aided Model-Based Source Separation in Real Reverberant Rooms","authors":"Muhammad Salman Khan, S. M. Naqvi, Ata ur-Rehman, Wenwu Wang, J. Chambers","doi":"10.1109/TASL.2013.2261814","DOIUrl":"https://doi.org/10.1109/TASL.2013.2261814","url":null,"abstract":"Source separation algorithms that utilize only audio data can perform poorly if multiple sources or reverberation are present. In this paper we therefore propose a video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static. By exploiting cues from video, we first localize individual speech sources in the enclosure and then estimate their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors are probabilistically modeled. The models make use of the source direction information and are evaluated at discrete time-frequency points. The model parameters are refined with the well-known expectation-maximization (EM) algorithm. The algorithm outputs time-frequency masks that are used to reconstruct the individual sources. Simulation results show that by utilizing the visual modality the proposed algorithm can produce better time-frequency masks thereby giving improved source estimates. We provide experimental results to test the proposed algorithm in different scenarios and provide comparisons with both other audio-only and audio-visual algorithms and achieve improved performance both on synthetic and real data. We also include dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2261814","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Sound Field Classification in Small Microphone Arrays Using Spatial Coherences 基于空间相干的小型传声器阵列声场分类

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-09-01 DOI: 10.1109/TASL.2013.2261813

R. Scharrer, M. Vorländer

The quality and performance of many multi-channel signal processing strategies in microphone arrays as well as mobile devices for the enhancement of speech intelligibility and audio quality depends to a large extent on the acoustic sound field that they are exposed to. As long as the assumption on the sound field is not met, the performance decreases significantly and may even yield worse results for the user than an unprocessed signal. Current hearing aids provide the user for instance with different programs to adapt the signal processing to the acoustic situation. Signal classification describes the signal content and not the type of sound field. Therefore, a further classification of the sound field, in addition to the signal classification, would increase the possibilities for an optimal adaption of the automatic program selection and the signal processing methods in mobile devices. To this end a sound field classification method is proposed that is based on the complex coherences between the input signals of distributed acoustic sensors. In addition to the general approach an exemplary setup of a hearing aid equipped with two microphone sensors is discussed. As only coherences are used, the method classifies the sound field regardless of the signal carried by it. This approach complements and extends the current signal classification approach used in common mobile devices. The method was successfully verified with simulated audio input signals and with real life examples.

麦克风阵列和移动设备中用于提高语音清晰度和音频质量的许多多通道信号处理策略的质量和性能在很大程度上取决于它们所处的声场。只要对声场的假设不满足，性能就会显著下降，甚至可能比未处理的信号对用户产生更差的结果。例如，当前的助听器为用户提供了不同的程序，以使信号处理适应声学情况。信号分类描述的是信号的内容而不是声场的类型。因此，在对信号进行分类的基础上，进一步对声场进行分类，将增加在移动设备中对自动程序选择和信号处理方法进行优化适应的可能性。为此，提出了一种基于分布式声传感器输入信号间复相干性的声场分类方法。除了一般方法之外，还讨论了配备两个麦克风传感器的助听器的示例性设置。由于只使用相干，该方法对声场进行了分类，而不考虑声场所携带的信号。这种方法补充和扩展了目前在普通移动设备中使用的信号分类方法。该方法已通过模拟音频输入信号和实际应用实例进行了验证。

{"title":"Sound Field Classification in Small Microphone Arrays Using Spatial Coherences","authors":"R. Scharrer, M. Vorländer","doi":"10.1109/TASL.2013.2261813","DOIUrl":"https://doi.org/10.1109/TASL.2013.2261813","url":null,"abstract":"The quality and performance of many multi-channel signal processing strategies in microphone arrays as well as mobile devices for the enhancement of speech intelligibility and audio quality depends to a large extent on the acoustic sound field that they are exposed to. As long as the assumption on the sound field is not met, the performance decreases significantly and may even yield worse results for the user than an unprocessed signal. Current hearing aids provide the user for instance with different programs to adapt the signal processing to the acoustic situation. Signal classification describes the signal content and not the type of sound field. Therefore, a further classification of the sound field, in addition to the signal classification, would increase the possibilities for an optimal adaption of the automatic program selection and the signal processing methods in mobile devices. To this end a sound field classification method is proposed that is based on the complex coherences between the input signals of distributed acoustic sensors. In addition to the general approach an exemplary setup of a hearing aid equipped with two microphone sensors is discussed. As only coherences are used, the method classifies the sound field regardless of the signal carried by it. This approach complements and extends the current signal classification approach used in common mobile devices. The method was successfully verified with simulated audio input signals and with real life examples.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2261813","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10