首页 > 最新文献

IEEE Transactions on Audio Speech and Language Processing最新文献

英文 中文
Feature Enhancement With Joint Use of Consecutive Corrupted and Noise Feature Vectors With Discriminative Region Weighting 结合区分区域加权的连续损坏和噪声特征向量的特征增强
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2270407
Masayuki Suzuki, Takuya Yoshioka, Shinji Watanabe, N. Minematsu, K. Hirose
This paper proposes a feature enhancement method that can achieve high speech recognition performance in a variety of noise environments with feasible computational cost. As the well-known Stereo-based Piecewise Linear Compensation for Environments (SPLICE) algorithm, the proposed method learns piecewise linear transformation to map corrupted feature vectors to the corresponding clean features, which enables efficient operation. To make the feature enhancement process adaptive to changes in noise, the piecewise linear transformation is performed by using a subspace of the joint space of corrupted and noise feature vectors, where the subspace is chosen such that classes (i.e., Gaussian mixture components) of underlying clean feature vectors can be best predicted. In addition, we propose utilizing temporally adjacent frames of corrupted and noise features in order to leverage dynamic characteristics of feature vectors. To prevent overfitting caused by the high dimensionality of the extended feature vectors covering the neighboring frames, we introduce regularized weighted minimum mean square error criterion. The proposed method achieved relative improvements of 34.2% and 22.2% over SPLICE under the clean and multi-style conditions, respectively, on the Aurora 2 task.
本文提出了一种特征增强方法,可以在各种噪声环境下以可行的计算成本获得较高的语音识别性能。作为著名的基于立体的SPLICE (Piecewise Linear Compensation for Environments)算法,该方法通过学习分段线性变换,将损坏的特征向量映射到相应的干净特征,提高了操作效率。为了使特征增强过程适应噪声的变化,通过使用损坏和噪声特征向量联合空间的子空间来执行分段线性变换,其中子空间的选择使得可以最好地预测底层干净特征向量的类别(即高斯混合分量)。此外,我们建议利用损坏和噪声特征的时间相邻帧,以利用特征向量的动态特性。为了防止扩展特征向量覆盖相邻帧的高维导致的过拟合,我们引入了正则化加权最小均方误差准则。在“极光2号”任务中,该方法在清洁和多样式条件下分别比SPLICE相对提高34.2%和22.2%。
{"title":"Feature Enhancement With Joint Use of Consecutive Corrupted and Noise Feature Vectors With Discriminative Region Weighting","authors":"Masayuki Suzuki, Takuya Yoshioka, Shinji Watanabe, N. Minematsu, K. Hirose","doi":"10.1109/TASL.2013.2270407","DOIUrl":"https://doi.org/10.1109/TASL.2013.2270407","url":null,"abstract":"This paper proposes a feature enhancement method that can achieve high speech recognition performance in a variety of noise environments with feasible computational cost. As the well-known Stereo-based Piecewise Linear Compensation for Environments (SPLICE) algorithm, the proposed method learns piecewise linear transformation to map corrupted feature vectors to the corresponding clean features, which enables efficient operation. To make the feature enhancement process adaptive to changes in noise, the piecewise linear transformation is performed by using a subspace of the joint space of corrupted and noise feature vectors, where the subspace is chosen such that classes (i.e., Gaussian mixture components) of underlying clean feature vectors can be best predicted. In addition, we propose utilizing temporally adjacent frames of corrupted and noise features in order to leverage dynamic characteristics of feature vectors. To prevent overfitting caused by the high dimensionality of the extended feature vectors covering the neighboring frames, we introduce regularized weighted minimum mean square error criterion. The proposed method achieved relative improvements of 34.2% and 22.2% over SPLICE under the clean and multi-style conditions, respectively, on the Aurora 2 task.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2172-2181"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2270407","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62890737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Learning Optimal Features for Polyphonic Audio-to-Score Alignment 学习复调音频与乐谱对齐的最佳功能
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2266794
C. Joder, S. Essid, G. Richard
This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.
本文讨论了特征函数的设计,用于将音乐录音与作品(乐谱)的符号表示相匹配。这些特征函数被定义为音频观察和对应于分数的模板向量之间的不相似性度量。通过将模板构造表示为从符号到音频表示的线性映射,可以通过优化线性变换来学习特征函数。本文探讨了两种不同的学习策略。第一个使用最佳拟合标准(最小散度),而第二个利用基于条件随机场模型的判别框架(最大似然标准)。我们在一个大型的流行和古典复调音乐数据库中评估了特征函数在音频-乐谱对齐任务中的影响。结果表明,对于几种类型的模型,使用不同的时间约束,学习映射具有优于经典启发式映射的潜力。在这个校准任务中,比较了音频观测的几种表示以及几种距离函数。我们的实验选择对称的Kullback-Leibler散度。此外,光谱图和基于cqt的表示都提供了非常精确的对准,在我们最复杂的系统中,检测超过97%的发作,精度为100毫秒。
{"title":"Learning Optimal Features for Polyphonic Audio-to-Score Alignment","authors":"C. Joder, S. Essid, G. Richard","doi":"10.1109/TASL.2013.2266794","DOIUrl":"https://doi.org/10.1109/TASL.2013.2266794","url":null,"abstract":"This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2118-2128"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2266794","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Multiobjective Time Series Matching for Audio Classification and Retrieval 音频分类与检索的多目标时间序列匹配
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2265086
P. Esling, C. Agón
Seeking sound samples in a massive database can be a tedious and time consuming task. Even when metadata are available, query results may remain far from the timbre expected by users. This problem stems from the nature of query specification, which does not account for the underlying complexity of audio data. The Query By Example (QBE) paradigm tries to tackle this shortcoming by finding audio clips similar to a given sound example. However, it requires users to have a well-formed soundfile of what they seek, which is not always a valid assumption. Furthermore, most audio-retrieval systems rely on a single measure of similarity, which is unlikely to convey the perceptual similarity of audio signals. We address in this paper an innovative way of querying generic audio databases by simultaneously optimizing the temporal evolution of multiple spectral properties. We show how this problem can be cast into a new approach merging multiobjective optimization and time series matching, called MultiObjective Time Series (MOTS) matching. We formally state this problem and report an efficient implementation. This approach introduces a multidimensional assessment of similarity in audio matching. This allows to cope with the multidimensional nature of timbre perception and also to obtain a set of efficient propositions rather than a single best solution. To demonstrate the performances of our approach, we show its efficiency in audio classification tasks. By introducing a selection criterion based on the hypervolume dominated by a class, we show that our approach outstands the state-of-art methods in audio classification even with a few number of features. We demonstrate its robustness to several classes of audio distortions. Finally, we introduce two innovative applications of our method for sound querying.
在庞大的数据库中寻找声音样本可能是一项乏味且耗时的任务。即使元数据可用,查询结果也可能与用户期望的音色相差甚远。这个问题源于查询规范的本质,它没有考虑到音频数据的底层复杂性。按示例查询(QBE)范式试图通过查找与给定声音示例相似的音频剪辑来解决这个缺点。然而,它要求用户有一个格式良好的声音文件,这并不总是一个有效的假设。此外,大多数音频检索系统依赖于单一的相似性度量,这不太可能传达音频信号的感知相似性。本文提出了一种通过同时优化多个频谱特性的时间演化来查询通用音频数据库的创新方法。我们展示了如何将这个问题转化为一种融合多目标优化和时间序列匹配的新方法,称为多目标时间序列(MOTS)匹配。我们正式陈述了这个问题,并报告了一个有效的实施。该方法引入了音频匹配中相似性的多维评估。这样可以处理音色感知的多维性,也可以获得一组有效的命题,而不是单一的最佳解决方案。为了证明我们的方法的性能,我们展示了它在音频分类任务中的效率。通过引入基于类主导的超音量的选择标准,我们表明我们的方法即使具有少量特征,也优于最先进的音频分类方法。我们证明了它对几种音频失真的鲁棒性。最后,我们介绍了我们的声音查询方法的两个创新应用。
{"title":"Multiobjective Time Series Matching for Audio Classification and Retrieval","authors":"P. Esling, C. Agón","doi":"10.1109/TASL.2013.2265086","DOIUrl":"https://doi.org/10.1109/TASL.2013.2265086","url":null,"abstract":"Seeking sound samples in a massive database can be a tedious and time consuming task. Even when metadata are available, query results may remain far from the timbre expected by users. This problem stems from the nature of query specification, which does not account for the underlying complexity of audio data. The Query By Example (QBE) paradigm tries to tackle this shortcoming by finding audio clips similar to a given sound example. However, it requires users to have a well-formed soundfile of what they seek, which is not always a valid assumption. Furthermore, most audio-retrieval systems rely on a single measure of similarity, which is unlikely to convey the perceptual similarity of audio signals. We address in this paper an innovative way of querying generic audio databases by simultaneously optimizing the temporal evolution of multiple spectral properties. We show how this problem can be cast into a new approach merging multiobjective optimization and time series matching, called MultiObjective Time Series (MOTS) matching. We formally state this problem and report an efficient implementation. This approach introduces a multidimensional assessment of similarity in audio matching. This allows to cope with the multidimensional nature of timbre perception and also to obtain a set of efficient propositions rather than a single best solution. To demonstrate the performances of our approach, we show its efficiency in audio classification tasks. By introducing a selection criterion based on the hypervolume dominated by a class, we show that our approach outstands the state-of-art methods in audio classification even with a few number of features. We demonstrate its robustness to several classes of audio distortions. Finally, we introduce two innovative applications of our method for sound querying.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"54 1","pages":"2057-2072"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2265086","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62890213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Accurate Estimation of Low Fundamental Frequencies From Real-Valued Measurements 从实值测量中精确估计低基频
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2265085
M. G. Christensen
In this paper, the difficult problem of estimating low fundamental frequencies from real-valued measurements is addressed. The methods commonly employed do not take the phenomena encountered in this scenario into account and thus fail to deliver accurate estimates. The reason for this is that they employ asymptotic approximations that are violated when the harmonics are not well-separated in frequency, something that happens when the observed signal is real-valued and the fundamental frequency is low. To mitigate this, we analyze the problem and present some exact fundamental frequency estimators that are aimed at solving this problem. These estimators are based on the principles of nonlinear least-squares, harmonic fitting, optimal filtering, subspace orthogonality, and shift-invariance, and they all reduce to already published methods for a high number of observations. In experiments, the methods are compared and the increased accuracy obtained by avoiding asymptotic approximations is demonstrated.
本文解决了从实值测量中估计低基频的难题。通常采用的方法没有考虑到在这种情况下遇到的现象,因此无法提供准确的估计。这样做的原因是,当谐波在频率上没有很好地分离时,当观察到的信号是实值的,基频很低时,就会发生这种情况。为了减轻这个问题,我们分析了这个问题,并提出了一些精确的基频估计器,旨在解决这个问题。这些估计是基于非线性最小二乘、调和拟合、最优滤波、子空间正交性和移位不变性的原理,它们都减少到已经发表的大量观测的方法。在实验中,对这些方法进行了比较,并证明了避免渐近逼近所获得的精度提高。
{"title":"Accurate Estimation of Low Fundamental Frequencies From Real-Valued Measurements","authors":"M. G. Christensen","doi":"10.1109/TASL.2013.2265085","DOIUrl":"https://doi.org/10.1109/TASL.2013.2265085","url":null,"abstract":"In this paper, the difficult problem of estimating low fundamental frequencies from real-valued measurements is addressed. The methods commonly employed do not take the phenomena encountered in this scenario into account and thus fail to deliver accurate estimates. The reason for this is that they employ asymptotic approximations that are violated when the harmonics are not well-separated in frequency, something that happens when the observed signal is real-valued and the fundamental frequency is low. To mitigate this, we analyze the problem and present some exact fundamental frequency estimators that are aimed at solving this problem. These estimators are based on the principles of nonlinear least-squares, harmonic fitting, optimal filtering, subspace orthogonality, and shift-invariance, and they all reduce to already published methods for a high number of observations. In experiments, the methods are compared and the increased accuracy obtained by avoiding asymptotic approximations is demonstrated.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2042-2056"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2265085","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62890616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation 单耳歌声分离的多阶段非负矩阵分解
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2266773
Bilei Zhu, Wei Li, Ruijiang Li, X. Xue
Separating singing voice from music accompaniment can be of interest for many applications such as melody extraction, singer identification, lyrics alignment and recognition, and content-based music retrieval. In this paper, a novel algorithm for singing voice separation in monaural mixtures is proposed. The algorithm consists of two stages, where non-negative matrix factorization (NMF) is applied to decompose the mixture spectrograms with long and short windows respectively. A spectral discontinuity thresholding method is devised for the long-window NMF to select out NMF components originating from pitched instrumental sounds, and a temporal discontinuity thresholding method is designed for the short-window NMF to pick out NMF components that are from percussive sounds. By eliminating the selected components, most pitched and percussive elements of the music accompaniment are filtered out from the input sound mixture, with little effect on the singing voice. Extensive testing on the MIR-1K public dataset of 1000 short audio clips and the Beach-Boys dataset of 14 full-track real-world songs showed that the proposed algorithm is both effective and efficient.
从音乐伴奏中分离歌唱声音对于许多应用程序都很有意义,例如旋律提取、歌手识别、歌词对齐和识别以及基于内容的音乐检索。本文提出了一种新的单声道混音中唱腔分离算法。该算法分为两个阶段,分别采用非负矩阵分解(NMF)对长窗和短窗混合谱图进行分解。针对长窗口NMF设计了频谱不连续阈值法,以筛选出来自有音调乐器的NMF分量;针对短窗口NMF设计了时间不连续阈值法,以筛选出来自打击乐器的NMF分量。通过消除选定的成分,音乐伴奏的大部分音调和打击元素从输入的声音混合中过滤出来,对唱歌的声音几乎没有影响。在包含1000个短音频片段的MIR-1K公共数据集和包含14首真实世界全音轨歌曲的Beach-Boys数据集上进行的广泛测试表明,所提出的算法既有效又高效。
{"title":"Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation","authors":"Bilei Zhu, Wei Li, Ruijiang Li, X. Xue","doi":"10.1109/TASL.2013.2266773","DOIUrl":"https://doi.org/10.1109/TASL.2013.2266773","url":null,"abstract":"Separating singing voice from music accompaniment can be of interest for many applications such as melody extraction, singer identification, lyrics alignment and recognition, and content-based music retrieval. In this paper, a novel algorithm for singing voice separation in monaural mixtures is proposed. The algorithm consists of two stages, where non-negative matrix factorization (NMF) is applied to decompose the mixture spectrograms with long and short windows respectively. A spectral discontinuity thresholding method is devised for the long-window NMF to select out NMF components originating from pitched instrumental sounds, and a temporal discontinuity thresholding method is designed for the short-window NMF to pick out NMF components that are from percussive sounds. By eliminating the selected components, most pitched and percussive elements of the music accompaniment are filtered out from the input sound mixture, with little effect on the singing voice. Extensive testing on the MIR-1K public dataset of 1000 short audio clips and the Beach-Boys dataset of 14 full-track real-world songs showed that the proposed algorithm is both effective and efficient.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2096-2107"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2266773","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Real-Time Multiple Sound Source Localization and Counting Using a Circular Microphone Array 使用圆形麦克风阵列的实时多声源定位和计数
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2272524
Despoina Pavlidi, Anthony Griffin, M. Puigt, A. Mouchtaris
In this work, a multiple sound source localization and counting method is presented, that imposes relaxed sparsity constraints on the source signals. A uniform circular microphone array is used to overcome the ambiguities of linear arrays, however the underlying concepts (sparse component analysis and matching pursuit-based operation on the histogram of estimates) are applicable to any microphone array topology. Our method is based on detecting time-frequency (TF) zones where one source is dominant over the others. Using appropriately selected TF components in these “single-source” zones, the proposed method jointly estimates the number of active sources and their corresponding directions of arrival (DOAs) by applying a matching pursuit-based approach to the histogram of DOA estimates. The method is shown to have excellent performance for DOA estimation and source counting, and to be highly suitable for real-time applications due to its low complexity. Through simulations (in various signal-to-noise ratio conditions and reverberant environments) and real environment experiments, we indicate that our method outperforms other state-of-the-art DOA and source counting methods in terms of accuracy, while being significantly more efficient in terms of computational complexity.
在这项工作中,提出了一种多声源定位和计数方法,该方法对源信号施加宽松的稀疏性约束。均匀圆形传声器阵列用于克服线性阵列的模糊性,但其基本概念(稀疏分量分析和基于估计直方图的匹配追踪操作)适用于任何传声器阵列拓扑。我们的方法是基于检测时频(TF)区域,其中一个源比其他源占优势。该方法利用在这些“单源”区域中适当选择的TF分量,通过对DOA估计直方图应用基于匹配追踪的方法,共同估计有效源的数量及其对应的到达方向(DOA)。该方法具有较好的DOA估计和源计数性能,且复杂度较低,非常适合于实时应用。通过模拟(在各种信噪比条件和混响环境中)和真实环境实验,我们表明,我们的方法在精度方面优于其他最先进的DOA和源计数方法,同时在计算复杂度方面显着提高效率。
{"title":"Real-Time Multiple Sound Source Localization and Counting Using a Circular Microphone Array","authors":"Despoina Pavlidi, Anthony Griffin, M. Puigt, A. Mouchtaris","doi":"10.1109/TASL.2013.2272524","DOIUrl":"https://doi.org/10.1109/TASL.2013.2272524","url":null,"abstract":"In this work, a multiple sound source localization and counting method is presented, that imposes relaxed sparsity constraints on the source signals. A uniform circular microphone array is used to overcome the ambiguities of linear arrays, however the underlying concepts (sparse component analysis and matching pursuit-based operation on the histogram of estimates) are applicable to any microphone array topology. Our method is based on detecting time-frequency (TF) zones where one source is dominant over the others. Using appropriately selected TF components in these “single-source” zones, the proposed method jointly estimates the number of active sources and their corresponding directions of arrival (DOAs) by applying a matching pursuit-based approach to the histogram of DOA estimates. The method is shown to have excellent performance for DOA estimation and source counting, and to be highly suitable for real-time applications due to its low complexity. Through simulations (in various signal-to-noise ratio conditions and reverberant environments) and real environment experiments, we indicate that our method outperforms other state-of-the-art DOA and source counting methods in terms of accuracy, while being significantly more efficient in terms of computational complexity.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"47 1","pages":"2193-2206"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2272524","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 208
Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model 基于自适应全频带谐波模型的语音分析与合成
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2266772
G. Degottex, Y. Stylianou
Voice models often use frequency limits to split the speech spectrum into two or more voiced/unvoiced frequency bands. However, from the voice production, the amplitude spectrum of the voiced source decreases smoothly without any abrupt frequency limit. Accordingly, multiband models struggle to estimate these limits and, as a consequence, artifacts can degrade the perceived quality. Using a linear frequency basis adapted to the non-stationarities of the speech signal, the Fan Chirp Transformation (FChT) have demonstrated harmonicity at frequencies higher than usually observed from the DFT which motivates a full-band modeling. The previously proposed Adaptive Quasi-Harmonic model (aQHM) offers even more flexibility than the FChT by using a non-linear frequency basis. In the current paper, exploiting the properties of aQHM, we describe a full-band Adaptive Harmonic Model (aHM) along with detailed descriptions of its corresponding algorithms for the estimation of harmonics up to the Nyquist frequency. Formal listening tests show that the speech reconstructed using aHM is nearly indistinguishable from the original speech. Experiments with synthetic signals also show that the proposed aHM globally outperforms previous sinusoidal and harmonic models in terms of precision in estimating the sinusoidal parameters. As a perspective, such a precision is interesting for building higher level models upon the sinusoidal parameters, like spectral envelopes for speech synthesis.
语音模型通常使用频率限制将语音频谱划分为两个或多个浊音/非浊音频段。但从声音产生来看,浊音源的幅度谱平滑下降,没有突兀的频率限制。因此,多波段模型难以估计这些限制,因此,伪像会降低感知质量。使用适应语音信号非平稳性的线性频率基,风扇啁啾变换(FChT)在比DFT通常观察到的频率更高的频率下显示出谐波,这激发了全频带建模。先前提出的自适应准谐波模型(aQHM)通过使用非线性频率基,比FChT具有更大的灵活性。在本文中,我们利用aQHM的特性,描述了一种全频段自适应谐波模型(aHM),并详细描述了其相应的算法,用于估计奈奎斯特频率以下的谐波。正式的听力测试表明,使用aHM重建的语音与原始语音几乎没有区别。对合成信号的实验也表明,所提出的aHM在估计正弦参数的精度方面总体上优于先前的正弦和谐波模型。从一个角度来看,这样的精度对于在正弦参数上建立更高层次的模型是很有趣的,比如语音合成的频谱包络。
{"title":"Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model","authors":"G. Degottex, Y. Stylianou","doi":"10.1109/TASL.2013.2266772","DOIUrl":"https://doi.org/10.1109/TASL.2013.2266772","url":null,"abstract":"Voice models often use frequency limits to split the speech spectrum into two or more voiced/unvoiced frequency bands. However, from the voice production, the amplitude spectrum of the voiced source decreases smoothly without any abrupt frequency limit. Accordingly, multiband models struggle to estimate these limits and, as a consequence, artifacts can degrade the perceived quality. Using a linear frequency basis adapted to the non-stationarities of the speech signal, the Fan Chirp Transformation (FChT) have demonstrated harmonicity at frequencies higher than usually observed from the DFT which motivates a full-band modeling. The previously proposed Adaptive Quasi-Harmonic model (aQHM) offers even more flexibility than the FChT by using a non-linear frequency basis. In the current paper, exploiting the properties of aQHM, we describe a full-band Adaptive Harmonic Model (aHM) along with detailed descriptions of its corresponding algorithms for the estimation of harmonics up to the Nyquist frequency. Formal listening tests show that the speech reconstructed using aHM is nearly indistinguishable from the original speech. Experiments with synthetic signals also show that the proposed aHM globally outperforms previous sinusoidal and harmonic models in terms of precision in estimating the sinusoidal parameters. As a perspective, such a precision is interesting for building higher level models upon the sinusoidal parameters, like spectral envelopes for speech synthesis.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2085-2095"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2266772","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62890382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach 说话人特征化的无监督方法:一种集成迭代方法
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2264673
Stephen Shum, N. Dehak, Réda Dehak, James R. Glass
In speaker diarization, standard approaches typically perform speaker clustering on some initial segmentation before refining the segment boundaries in a re-segmentation step to obtain a final diarization hypothesis. In this paper, we integrate an improved clustering method with an existing re-segmentation algorithm and, in iterative fashion, optimize both speaker cluster assignments and segmentation boundaries jointly. For clustering, we extend our previous research using factor analysis for speaker modeling. In continuing to take advantage of the effectiveness of factor analysis as a front-end for extracting speaker-specific features (i.e., i-vectors), we develop a probabilistic approach to speaker clustering by applying a Bayesian Gaussian Mixture Model (GMM) to principal component analysis (PCA)-processed i-vectors. We then utilize information at different temporal resolutions to arrive at an iterative optimization scheme that, in alternating between clustering and re-segmentation steps, demonstrates the ability to improve both speaker cluster assignments and segmentation boundaries in an unsupervised manner. Our proposed methods attain results that are comparable to those of a state-of-the-art benchmark set on the multi-speaker CallHome telephone corpus. We further compare our system with a Bayesian nonparametric approach to diarization and attempt to reconcile their differences in both methodology and performance.
在说话人分割中,标准方法通常在初始分割上对说话人进行聚类,然后在重新分割步骤中对分割边界进行细化,以获得最终的说话人分割假设。在本文中,我们将改进的聚类方法与现有的再分割算法相结合,并以迭代的方式共同优化说话人聚类分配和分割边界。对于聚类,我们扩展了之前的研究,使用因子分析对说话人建模。为了继续利用因子分析作为提取说话人特定特征(即i向量)的前端的有效性,我们通过将贝叶斯高斯混合模型(GMM)应用于主成分分析(PCA)处理的i向量,开发了一种概率方法来聚类说话人。然后,我们利用不同时间分辨率的信息来得出一个迭代优化方案,该方案在聚类和重新分割步骤之间交替进行,证明了以无监督的方式改善说话人聚类分配和分割边界的能力。我们提出的方法获得的结果与在多扬声器CallHome电话语料库上设置的最先进的基准相当。我们进一步将我们的系统与贝叶斯非参数化方法进行比较,并试图调和它们在方法和性能方面的差异。
{"title":"Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach","authors":"Stephen Shum, N. Dehak, Réda Dehak, James R. Glass","doi":"10.1109/TASL.2013.2264673","DOIUrl":"https://doi.org/10.1109/TASL.2013.2264673","url":null,"abstract":"In speaker diarization, standard approaches typically perform speaker clustering on some initial segmentation before refining the segment boundaries in a re-segmentation step to obtain a final diarization hypothesis. In this paper, we integrate an improved clustering method with an existing re-segmentation algorithm and, in iterative fashion, optimize both speaker cluster assignments and segmentation boundaries jointly. For clustering, we extend our previous research using factor analysis for speaker modeling. In continuing to take advantage of the effectiveness of factor analysis as a front-end for extracting speaker-specific features (i.e., i-vectors), we develop a probabilistic approach to speaker clustering by applying a Bayesian Gaussian Mixture Model (GMM) to principal component analysis (PCA)-processed i-vectors. We then utilize information at different temporal resolutions to arrive at an iterative optimization scheme that, in alternating between clustering and re-segmentation steps, demonstrates the ability to improve both speaker cluster assignments and segmentation boundaries in an unsupervised manner. Our proposed methods attain results that are comparable to those of a state-of-the-art benchmark set on the multi-speaker CallHome telephone corpus. We further compare our system with a Bayesian nonparametric approach to diarization and attempt to reconcile their differences in both methodology and performance.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2015-2028"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2264673","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62890036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 168
Automatic Ontology Generation for Musical Instruments Based on Audio Analysis 基于音频分析的乐器本体自动生成
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2263801
Ş. Kolozali, M. Barthet, György Fazekas, M. Sandler
In this paper we present a novel hybrid system that involves a formal method of automatic ontology generation for web-based audio signal processing applications. An ontology is seen as a knowledge management structure that represents domain knowledge in a machine interpretable format. It describes concepts and relationships within a particular domain, in our case, the domain of musical instruments. However, the different tasks of ontology engineering including manual annotation, hierarchical structuring and organization of data can be laborious and challenging. For these reasons, we investigate how the process of creating ontologies can be made less dependent on human supervision by exploring concept analysis techniques in a Semantic Web environment. In this study, various musical instruments, from wind to string families, are classified using timbre features extracted from audio. To obtain models of the analysed instrument recordings, we use K-means clustering to determine an optimised codebook of Line Spectral Frequencies (LSFs), or Mel-frequency Cepstral Coefficients (MFCCs). Two classification techniques based on Multi-Layer Perceptron (MLP) neural network and Support Vector Machines (SVM) were tested. Then, Formal Concept Analysis (FCA) is used to automatically build the hierarchical structure of musical instrument ontologies. Finally, the generated ontologies are expressed using the Ontology Web Language (OWL). System performance was evaluated under natural recording conditions using databases of isolated notes and melodic phrases. Analysis of Variance (ANOVA) were conducted with the feature and classifier attributes as independent variables and the musical instrument recognition F-measure as dependent variable. Based on these statistical analyses, a detailed comparison between musical instrument recognition models is made to investigate their effects on the automatic ontology generation system. The proposed system is general and also applicable to other research fields that are related to ontologies and the Semantic Web.
在本文中,我们提出了一种新的混合系统,其中包括一种基于web的音频信号处理应用的自动本体生成的形式化方法。本体被视为一种知识管理结构,它以机器可解释的格式表示领域知识。它描述了特定领域内的概念和关系,在我们的例子中,是乐器领域。然而,本体工程的不同任务,包括手工标注、分层结构和数据组织,可能是费力和具有挑战性的。由于这些原因,我们研究了如何通过探索语义Web环境中的概念分析技术来减少对人类监督的依赖。在本研究中,使用从音频中提取的音色特征对从管乐器到弦乐器的各种乐器进行分类。为了获得分析仪器记录的模型,我们使用K-means聚类来确定线谱频率(lfs)或mel频率倒谱系数(MFCCs)的优化码本。测试了基于多层感知器(MLP)神经网络和支持向量机(SVM)的两种分类技术。然后,使用形式概念分析(FCA)自动构建乐器本体的层次结构。最后,使用本体Web语言(OWL)表达生成的本体。在自然录音条件下,使用孤立音符和旋律短语数据库评估系统性能。以特征和分类器属性为自变量,乐器识别f测度为因变量进行方差分析(ANOVA)。在此基础上,对不同的乐器识别模型进行了详细的比较,探讨了它们对本体自动生成系统的影响。该系统具有通用性,也适用于与本体和语义网相关的其他研究领域。
{"title":"Automatic Ontology Generation for Musical Instruments Based on Audio Analysis","authors":"Ş. Kolozali, M. Barthet, György Fazekas, M. Sandler","doi":"10.1109/TASL.2013.2263801","DOIUrl":"https://doi.org/10.1109/TASL.2013.2263801","url":null,"abstract":"In this paper we present a novel hybrid system that involves a formal method of automatic ontology generation for web-based audio signal processing applications. An ontology is seen as a knowledge management structure that represents domain knowledge in a machine interpretable format. It describes concepts and relationships within a particular domain, in our case, the domain of musical instruments. However, the different tasks of ontology engineering including manual annotation, hierarchical structuring and organization of data can be laborious and challenging. For these reasons, we investigate how the process of creating ontologies can be made less dependent on human supervision by exploring concept analysis techniques in a Semantic Web environment. In this study, various musical instruments, from wind to string families, are classified using timbre features extracted from audio. To obtain models of the analysed instrument recordings, we use K-means clustering to determine an optimised codebook of Line Spectral Frequencies (LSFs), or Mel-frequency Cepstral Coefficients (MFCCs). Two classification techniques based on Multi-Layer Perceptron (MLP) neural network and Support Vector Machines (SVM) were tested. Then, Formal Concept Analysis (FCA) is used to automatically build the hierarchical structure of musical instrument ontologies. Finally, the generated ontologies are expressed using the Ontology Web Language (OWL). System performance was evaluated under natural recording conditions using databases of isolated notes and melodic phrases. Analysis of Variance (ANOVA) were conducted with the feature and classifier attributes as independent variables and the musical instrument recognition F-measure as dependent variable. Based on these statistical analyses, a detailed comparison between musical instrument recognition models is made to investigate their effects on the automatic ontology generation system. The proposed system is general and also applicable to other research fields that are related to ontologies and the Semantic Web.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2207-2220"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2263801","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Non-Negative Temporal Decomposition of Speech Parameters by Multiplicative Update Rules 基于乘法更新规则的语音参数非负时间分解
Pub Date : 2013-10-01 DOI: 10.1109/TASL.2013.2266774
S. Hiroya
I invented a non-negative temporal decomposition method for line spectral pairs and articulatory parameters based on the multiplicative update rules. These parameters are decomposed into a set of temporally overlapped unimodal event functions restricted to the range [0,1] and corresponding event vectors. When line spectral pairs are used, event vectors preserve their ordering property. With the proposed method, the RMS error of the measured and reconstructed articulatory parameters is 0.21 mm and the spectral distance of the measured and reconstructed line spectral pairs parameters is 2.0 dB. The RMS error and spectral distance in the proposed method are smaller than those in conventional methods. This technique will be useful for many applications of speech coding and speech modification.
我发明了一种基于乘法更新规则的线谱对和关节参数的非负时间分解方法。这些参数被分解成一组时间上重叠的单峰事件函数,限制在[0,1]范围内,以及相应的事件向量。当使用线谱对时,事件向量保持其有序属性。利用该方法,测得和重建的关节参数的均方根误差为0.21 mm,测得和重建的线谱对参数的谱距为2.0 dB。该方法的均方根误差和光谱距离均小于常规方法。该技术将在语音编码和语音修改的许多应用中发挥重要作用。
{"title":"Non-Negative Temporal Decomposition of Speech Parameters by Multiplicative Update Rules","authors":"S. Hiroya","doi":"10.1109/TASL.2013.2266774","DOIUrl":"https://doi.org/10.1109/TASL.2013.2266774","url":null,"abstract":"I invented a non-negative temporal decomposition method for line spectral pairs and articulatory parameters based on the multiplicative update rules. These parameters are decomposed into a set of temporally overlapped unimodal event functions restricted to the range [0,1] and corresponding event vectors. When line spectral pairs are used, event vectors preserve their ordering property. With the proposed method, the RMS error of the measured and reconstructed articulatory parameters is 0.21 mm and the spectral distance of the measured and reconstructed line spectral pairs parameters is 2.0 dB. The RMS error and spectral distance in the proposed method are smaller than those in conventional methods. This technique will be useful for many applications of speech coding and speech modification.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2108-2117"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2266774","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62890946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
IEEE Transactions on Audio Speech and Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1