首页 > 最新文献

IEEE Transactions on Audio Speech and Language Processing最新文献

英文 中文
A Free-Source Method (FrSM) for Calibrating a Large-Aperture Microphone Array 一种自由源法校准大口径传声器阵列
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2256896
Sarthak Khanal, H. Silverman, Rahul R. Shakya
Large-aperture microphone arrays can be used to capture and enhance speech from individual talkers in noisy, multi-talker, and reverberant environments. However, they must be calibrated, often more than once, to obtain accurate 3-dimensional coordinates for all microphones. Direct-measurement techniques, such as using a measuring tape or a laser-based tool are cumbersome and time-consuming. Some previous methods that used acoustic signals for array calibration required bulky hardware and/or fixed, known source locations. Others, which allowed more flexible source placement, often have issues with real data, have reported results for 2D only, or work only for small arrays. This paper describes a complete and robust method for automatic calibration using acoustic signals which is simple, repeatable, accurate, and has been shown to work for a real system. The method requires only a single transducer (speaker) with a microphone attached above its center. The unit is freely moved around the focal volume of the microphone array generating a single long recording from all the microphones. After that, the system is completely automatic. We describe the free source method (FrSM), validate its effectiveness and present accuracy results against measured ground truth. The performance of FrSM is compared to that from several other methods for a real 128-microphone array.
大孔径麦克风阵列可用于捕获和增强来自嘈杂、多说话和混响环境中的单个说话者的语音。然而,他们必须校准,往往不止一次,以获得准确的三维坐标为所有的麦克风。直接测量技术,如使用卷尺或基于激光的工具,既麻烦又耗时。以前一些使用声信号进行阵列校准的方法需要笨重的硬件和/或固定的已知源位置。其他允许更灵活的源位置的方法,通常在实际数据上存在问题,只能报告2D的结果,或者只能用于小型阵列。本文描述了一种完整的、鲁棒的利用声信号进行自动校准的方法,该方法简单、可重复、准确,并已被证明适用于实际系统。该方法只需要一个换能器(扬声器),在其中心上方附加一个麦克风。该装置在麦克风阵列的焦点体积周围自由移动,从所有麦克风产生一个长录音。在那之后,系统是完全自动的。我们描述了自由源方法(FrSM),验证了其有效性,并根据实测的地面真值给出了精度结果。在实际的128个传声器阵列中,与其他几种方法进行了性能比较。
{"title":"A Free-Source Method (FrSM) for Calibrating a Large-Aperture Microphone Array","authors":"Sarthak Khanal, H. Silverman, Rahul R. Shakya","doi":"10.1109/TASL.2013.2256896","DOIUrl":"https://doi.org/10.1109/TASL.2013.2256896","url":null,"abstract":"Large-aperture microphone arrays can be used to capture and enhance speech from individual talkers in noisy, multi-talker, and reverberant environments. However, they must be calibrated, often more than once, to obtain accurate 3-dimensional coordinates for all microphones. Direct-measurement techniques, such as using a measuring tape or a laser-based tool are cumbersome and time-consuming. Some previous methods that used acoustic signals for array calibration required bulky hardware and/or fixed, known source locations. Others, which allowed more flexible source placement, often have issues with real data, have reported results for 2D only, or work only for small arrays. This paper describes a complete and robust method for automatic calibration using acoustic signals which is simple, repeatable, accurate, and has been shown to work for a real system. The method requires only a single transducer (speaker) with a microphone attached above its center. The unit is freely moved around the focal volume of the microphone array generating a single long recording from all the microphones. After that, the system is completely automatic. We describe the free source method (FrSM), validate its effectiveness and present accuracy results against measured ground truth. The performance of FrSM is compared to that from several other methods for a real 128-microphone array.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1632-1639"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2256896","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Model-Based Multiple Pitch Tracking Using Factorial HMMs: Model Adaptation and Inference 基于模型的多音高跟踪:模型自适应与推理
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2260744
Michael Wohlmayr, F. Pernkopf
Robustness against noise and interfering audio signals is one of the challenges in speech recognition and audio analysis technology. One avenue to approach this challenge is single-channel multiple-source modeling. Factorial hidden Markov models (FHMMs) are capable of modeling acoustic scenes with multiple sources interacting over time. While these models reach good performance on specific tasks, there are still serious limitations restricting the applicability in many domains. In this paper, we generalize these models and enhance their applicability. In particular, we develop an EM-like iterative adaptation framework which is capable to adapt the model parameters to the specific situation (e.g. actual speakers, gain, acoustic channel, etc.) using only speech mixture data. Currently, source-specific data is required to learn the model. Inference in FHMMs is an essential ingredient for adaptation. We develop efficient approaches based on observation likelihood pruning. Both adaptation and efficient inference are empirically evaluated for the task of multipitch tracking using the GRID corpus.
对噪声和干扰音频信号的鲁棒性是语音识别和音频分析技术面临的挑战之一。解决这一挑战的一个途径是单通道多源建模。阶乘隐马尔可夫模型(fhmm)能够对多个声源随时间相互作用的声学场景进行建模。虽然这些模型在特定任务上达到了良好的性能,但在许多领域的适用性仍然存在严重的局限性。本文对这些模型进行了推广,提高了它们的适用性。特别是,我们开发了一个类似于em的迭代自适应框架,该框架能够仅使用语音混合数据使模型参数适应特定情况(例如实际扬声器,增益,声学通道等)。目前,需要特定于源的数据来学习模型。fhmm中的推理是适应的重要组成部分。我们开发了基于观察似然修剪的有效方法。利用网格语料库对多音高跟踪任务的自适应和有效推理进行了实证评估。
{"title":"Model-Based Multiple Pitch Tracking Using Factorial HMMs: Model Adaptation and Inference","authors":"Michael Wohlmayr, F. Pernkopf","doi":"10.1109/TASL.2013.2260744","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260744","url":null,"abstract":"Robustness against noise and interfering audio signals is one of the challenges in speech recognition and audio analysis technology. One avenue to approach this challenge is single-channel multiple-source modeling. Factorial hidden Markov models (FHMMs) are capable of modeling acoustic scenes with multiple sources interacting over time. While these models reach good performance on specific tasks, there are still serious limitations restricting the applicability in many domains. In this paper, we generalize these models and enhance their applicability. In particular, we develop an EM-like iterative adaptation framework which is capable to adapt the model parameters to the specific situation (e.g. actual speakers, gain, acoustic channel, etc.) using only speech mixture data. Currently, source-specific data is required to learn the model. Inference in FHMMs is an essential ingredient for adaptation. We develop efficient approaches based on observation likelihood pruning. Both adaptation and efficient inference are empirically evaluated for the task of multipitch tracking using the GRID corpus.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"48 1","pages":"1742-1754"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260744","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Applying Multi- and Cross-Lingual Stochastic Phone Space Transformations to Non-Native Speech Recognition 多语言和跨语言随机电话空间变换在非母语语音识别中的应用
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2260150
David Imseng, H. Bourlard, J. Dines, Philip N. Garner, M. Magimai.-Doss
In the context of hybrid HMM/MLP Automatic Speech Recognition (ASR), this paper describes an investigation into a new type of stochastic phone space transformation, which maps “source” phone (or phone HMM state) posterior probabilities (as obtained at the output of a Multilayer Perceptron/MLP) into “destination” phone (HMM phone state) posterior probabilities. The resulting stochastic matrix transformation can be used within the same language to automatically adapt to different phone formats (e.g., IPA) or across languages. Additionally, as shown here, it can also be applied successfully to non-native speech recognition. In the same spirit as MLLR adaptation, or MLP adaptation, the approach proposed here is directly mapping posterior distributions, and is trained by optimizing on a small amount of adaptation data a Kullback-Leibler based cost function, along a modified version of an iterative EM algorithm. On a non-native English database (HIWIRE), and comparing with multiple setups (monophone and triphone mapping, MLLR adaptation) we show that the resulting posterior mapping yields state-of-the-art results using very limited amounts of adaptation data in mono-, cross- and multi-lingual setups. We also show that “universal” phone posteriors, trained on a large amount of multilingual data, can be transformed to English phone posteriors, resulting in an ASR system that significantly outperforms a system trained on English data only. Finally, we demonstrate that the proposed approach outperforms alternative data-driven, as well as a knowledge-based, mapping techniques.
在混合HMM/MLP自动语音识别(ASR)的背景下,研究了一种新的随机电话空间变换,该变换将“源”电话(或电话HMM状态)后验概率(在多层感知机/MLP的输出处获得)映射到“目标”电话(HMM电话状态)后验概率。由此产生的随机矩阵变换可以在同一语言中使用,以自动适应不同的电话格式(例如,IPA)或跨语言。此外,如图所示,它也可以成功地应用于非母语语音识别。与MLLR自适应或MLP自适应的精神相同,本文提出的方法是直接映射后见分布,并通过基于Kullback-Leibler的成本函数对少量自适应数据进行优化,并沿着迭代EM算法的改进版本进行训练。在非母语英语数据库(HIWIRE)上,并与多种设置(单声道和三声道映射,MLLR适应)进行比较,我们表明,在单语言、跨语言和多语言设置中,使用非常有限的适应数据,所得后验映射产生了最先进的结果。我们还表明,在大量多语言数据上训练的“通用”电话后验可以转换为英语电话后验,从而使ASR系统的性能明显优于仅在英语数据上训练的系统。最后,我们证明了所提出的方法优于其他数据驱动的以及基于知识的映射技术。
{"title":"Applying Multi- and Cross-Lingual Stochastic Phone Space Transformations to Non-Native Speech Recognition","authors":"David Imseng, H. Bourlard, J. Dines, Philip N. Garner, M. Magimai.-Doss","doi":"10.1109/TASL.2013.2260150","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260150","url":null,"abstract":"In the context of hybrid HMM/MLP Automatic Speech Recognition (ASR), this paper describes an investigation into a new type of stochastic phone space transformation, which maps “source” phone (or phone HMM state) posterior probabilities (as obtained at the output of a Multilayer Perceptron/MLP) into “destination” phone (HMM phone state) posterior probabilities. The resulting stochastic matrix transformation can be used within the same language to automatically adapt to different phone formats (e.g., IPA) or across languages. Additionally, as shown here, it can also be applied successfully to non-native speech recognition. In the same spirit as MLLR adaptation, or MLP adaptation, the approach proposed here is directly mapping posterior distributions, and is trained by optimizing on a small amount of adaptation data a Kullback-Leibler based cost function, along a modified version of an iterative EM algorithm. On a non-native English database (HIWIRE), and comparing with multiple setups (monophone and triphone mapping, MLLR adaptation) we show that the resulting posterior mapping yields state-of-the-art results using very limited amounts of adaptation data in mono-, cross- and multi-lingual setups. We also show that “universal” phone posteriors, trained on a large amount of multilingual data, can be transformed to English phone posteriors, resulting in an ASR system that significantly outperforms a system trained on English data only. Finally, we demonstrate that the proposed approach outperforms alternative data-driven, as well as a knowledge-based, mapping techniques.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1713-1726"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260150","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Joint Discriminative Decoding of Words and Semantic Tags for Spoken Language Understanding 词语和语义标签联合判别解码在口语理解中的应用
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2256894
Anoop Deoras, Gökhan Tür, R. Sarikaya, Dilek Z. Hakkani-Tür
Most Spoken Language Understanding (SLU) systems today employ a cascade approach, where the best hypothesis from Automatic Speech Recognizer (ASR) is fed into understanding modules such as slot sequence classifiers and intent detectors. The output of these modules is then further fed into downstream components such as interpreter and/or knowledge broker. These statistical models are usually trained individually to optimize the error rate of their respective output. In such approaches, errors from one module irreversibly propagates into other modules causing a serious degradation in the overall performance of the SLU system. Thus it is desirable to jointly optimize all the statistical models together. As a first step towards this, in this paper, we propose a joint decoding framework in which we predict the optimal word as well as slot sequence (semantic tag sequence) jointly given the input acoustic stream. Furthermore, the improved recognition output is then used for an utterance classification task, specifically, we focus on intent detection task. On a SLU task, we show 1.5% absolute reduction (7.6% relative reduction) in word error rate (WER) and 1.2% absolute improvement in F measure for slot prediction when compared to a very strong cascade baseline comprising of state-of-the-art large vocabulary ASR followed by conditional random field (CRF) based slot sequence tagger. Similarly, for intent detection, we show 1.2% absolute reduction (12% relative reduction) in classification error rate.
今天大多数口语理解(SLU)系统采用级联方法,其中自动语音识别器(ASR)的最佳假设被馈送到理解模块,如槽序列分类器和意图检测器。然后将这些模块的输出进一步馈送到下游组件,如解释器和/或知识代理。这些统计模型通常单独训练,以优化各自输出的错误率。在这种方法中,来自一个模块的错误不可逆转地传播到其他模块,导致SLU系统的整体性能严重下降。因此,需要将所有统计模型联合起来进行优化。作为实现这一目标的第一步,在本文中,我们提出了一个联合解码框架,在该框架中,我们在给定输入声流的情况下联合预测最优词和槽序列(语义标签序列)。然后,将改进后的识别输出用于语音分类任务,特别是意图检测任务。在SLU任务中,与由最先进的大词汇量ASR和基于条件随机场(CRF)的槽序列标注器组成的强大级联基线相比,我们显示单词错误率(WER)绝对降低了1.5%(相对降低了7.6%),槽序列预测的F测量绝对提高了1.2%。同样,对于意图检测,我们显示分类错误率绝对降低了1.2%(相对降低了12%)。
{"title":"Joint Discriminative Decoding of Words and Semantic Tags for Spoken Language Understanding","authors":"Anoop Deoras, Gökhan Tür, R. Sarikaya, Dilek Z. Hakkani-Tür","doi":"10.1109/TASL.2013.2256894","DOIUrl":"https://doi.org/10.1109/TASL.2013.2256894","url":null,"abstract":"Most Spoken Language Understanding (SLU) systems today employ a cascade approach, where the best hypothesis from Automatic Speech Recognizer (ASR) is fed into understanding modules such as slot sequence classifiers and intent detectors. The output of these modules is then further fed into downstream components such as interpreter and/or knowledge broker. These statistical models are usually trained individually to optimize the error rate of their respective output. In such approaches, errors from one module irreversibly propagates into other modules causing a serious degradation in the overall performance of the SLU system. Thus it is desirable to jointly optimize all the statistical models together. As a first step towards this, in this paper, we propose a joint decoding framework in which we predict the optimal word as well as slot sequence (semantic tag sequence) jointly given the input acoustic stream. Furthermore, the improved recognition output is then used for an utterance classification task, specifically, we focus on intent detection task. On a SLU task, we show 1.5% absolute reduction (7.6% relative reduction) in word error rate (WER) and 1.2% absolute improvement in F measure for slot prediction when compared to a very strong cascade baseline comprising of state-of-the-art large vocabulary ASR followed by conditional random field (CRF) based slot sequence tagger. Similarly, for intent detection, we show 1.2% absolute reduction (12% relative reduction) in classification error rate.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1612-1621"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2256894","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Robust Log-Energy Estimation and its Dynamic Change Enhancement for In-car Speech Recognition 车载语音识别的鲁棒对数能量估计及其动态变化增强
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2260151
Weifeng Li, Longbiao Wang, Yicong Zhou, H. Bourlard, Q. Liao
The log-energy parameter, typically derived from a full-band spectrum, is a critical feature commonly used in automatic speech recognition (ASR) systems. However, log-energy is difficult to estimate reliably in the presence of background noise. In this paper, we theoretically show that background noise affects the trajectories of not only the “conventional” log-energy, but also its delta parameters. This results in a poor estimation of the actual log-energy and its delta parameters, which no longer describe the speech signal. We thus propose a new method to estimate log-energy from a sub-band spectrum, followed by dynamic change enhancement and mean smoothing. We demonstrate the effectiveness of the proposed log-energy estimation and its post-processing steps through speech recognition experiments conducted on the in-car CENSREC-2 database. The proposed log-energy (together with its corresponding delta parameters) yields an average improvement of 32.8% compared with the baseline front-ends. Moreover, it is also shown that further improvement can be achieved by incorporating the new Mel-Frequency Cepstral Coefficients (MFCCs) obtained by non-linear spectral contrast stretching.
对数能量参数通常来自全频段频谱,是自动语音识别(ASR)系统中常用的一个关键特征。然而,在存在背景噪声的情况下,对数能量难以可靠地估计。在本文中,我们从理论上证明了背景噪声不仅影响“常规”对数能量的轨迹,而且影响其δ参数。这导致对实际对数能量及其δ参数的估计很差,它们不再描述语音信号。因此,我们提出了一种从子带频谱估计对数能量的新方法,然后进行动态变化增强和均值平滑。我们通过在车载censrec2数据库上进行的语音识别实验证明了所提出的对数能量估计及其后处理步骤的有效性。与基线前端相比,所提出的对数能量(连同其相应的delta参数)平均提高了32.8%。此外,还表明,结合非线性频谱对比度拉伸获得的新的Mel-Frequency倒谱系数(MFCCs)可以进一步改进。
{"title":"Robust Log-Energy Estimation and its Dynamic Change Enhancement for In-car Speech Recognition","authors":"Weifeng Li, Longbiao Wang, Yicong Zhou, H. Bourlard, Q. Liao","doi":"10.1109/TASL.2013.2260151","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260151","url":null,"abstract":"The log-energy parameter, typically derived from a full-band spectrum, is a critical feature commonly used in automatic speech recognition (ASR) systems. However, log-energy is difficult to estimate reliably in the presence of background noise. In this paper, we theoretically show that background noise affects the trajectories of not only the “conventional” log-energy, but also its delta parameters. This results in a poor estimation of the actual log-energy and its delta parameters, which no longer describe the speech signal. We thus propose a new method to estimate log-energy from a sub-band spectrum, followed by dynamic change enhancement and mean smoothing. We demonstrate the effectiveness of the proposed log-energy estimation and its post-processing steps through speech recognition experiments conducted on the in-car CENSREC-2 database. The proposed log-energy (together with its corresponding delta parameters) yields an average improvement of 32.8% compared with the baseline front-ends. Moreover, it is also shown that further improvement can be achieved by incorporating the new Mel-Frequency Cepstral Coefficients (MFCCs) obtained by non-linear spectral contrast stretching.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1689-1698"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260151","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Study of the General Kalman Filter for Echo Cancellation 用于回波抵消的通用卡尔曼滤波器的研究
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2245654
C. Paleologu, J. Benesty, S. Ciochină
The Kalman filter is a very interesting signal processing tool, which is widely used in many practical applications. In this paper, we study the Kalman filter in the context of echo cancellation. The contribution of this work is threefold. First, we derive a different form of the Kalman filter by considering, at each iteration, a block of time samples instead of one time sample as it is the case in the conventional approach. Second, we show how this general Kalman filter (GKF) is connected with some of the most popular adaptive filters for echo cancellation, i.e., the normalized least-mean-square (NLMS) algorithm, the affine projection algorithm (APA) and its proportionate version (PAPA). Third, a simplified Kalman filter is developed in order to reduce the computational load of the GKF; this algorithm behaves like a variable step-size adaptive filter. Simulation results indicate the good performance of the proposed algorithms, which can be attractive choices for echo cancellation.
卡尔曼滤波器是一种非常有趣的信号处理工具,在许多实际应用中得到了广泛的应用。本文在回波抵消的背景下研究了卡尔曼滤波器。这项工作的贡献是三重的。首先,我们推导出一种不同形式的卡尔曼滤波器,在每次迭代中考虑一个时间样本块,而不是像传统方法中那样考虑一个时间样本。其次,我们展示了这种通用卡尔曼滤波器(GKF)如何与一些最流行的回声消除自适应滤波器相连接,即归一化最小均方(NLMS)算法、仿射投影算法(APA)及其比例版本(PAPA)。第三,为了减少GKF的计算量,提出了一种简化的卡尔曼滤波器;该算法的行为类似于可变步长自适应滤波器。仿真结果表明,该算法具有良好的性能,是一种有吸引力的回声消除方法。
{"title":"Study of the General Kalman Filter for Echo Cancellation","authors":"C. Paleologu, J. Benesty, S. Ciochină","doi":"10.1109/TASL.2013.2245654","DOIUrl":"https://doi.org/10.1109/TASL.2013.2245654","url":null,"abstract":"The Kalman filter is a very interesting signal processing tool, which is widely used in many practical applications. In this paper, we study the Kalman filter in the context of echo cancellation. The contribution of this work is threefold. First, we derive a different form of the Kalman filter by considering, at each iteration, a block of time samples instead of one time sample as it is the case in the conventional approach. Second, we show how this general Kalman filter (GKF) is connected with some of the most popular adaptive filters for echo cancellation, i.e., the normalized least-mean-square (NLMS) algorithm, the affine projection algorithm (APA) and its proportionate version (PAPA). Third, a simplified Kalman filter is developed in order to reduce the computational load of the GKF; this algorithm behaves like a variable step-size adaptive filter. Simulation results indicate the good performance of the proposed algorithms, which can be attractive choices for echo cancellation.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1539-1549"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2245654","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
Effective Model Representation by Information Bottleneck Principle 基于信息瓶颈原理的有效模型表示
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2253097
Ron M. Hecht, Elad Noor, Gil Dobry, Y. Zigel, Aharon Bar-Hillel, Naftali Tishby
The common approaches to feature extraction in speech processing are generative and parametric although they are highly sensitive to violations of their model assumptions. Here, we advocate the non-parametric Information Bottleneck (IB). IB is an information theoretic approach that extends minimal sufficient statistics. However, unlike minimal sufficient statistics which does not allow any relevant data loss, IB method enables a principled tradeoff between compactness and the amount of target-related information. IB's ability to improve a broad range of recognition tasks is illustrated for model dimension reduction tasks for speaker recognition and model clustering for age-group verification.
语音处理中常见的特征提取方法是生成和参数化的,尽管它们对模型假设的违反非常敏感。在这里,我们提倡非参数信息瓶颈(IB)。IB是一种扩展了最小充分统计的信息理论方法。然而,与不允许任何相关数据丢失的最小充分统计不同,IB方法能够在紧凑性和目标相关信息的数量之间进行原则性权衡。对于说话人识别的模型降维任务和年龄组验证的模型聚类任务,说明了IB提高广泛识别任务的能力。
{"title":"Effective Model Representation by Information Bottleneck Principle","authors":"Ron M. Hecht, Elad Noor, Gil Dobry, Y. Zigel, Aharon Bar-Hillel, Naftali Tishby","doi":"10.1109/TASL.2013.2253097","DOIUrl":"https://doi.org/10.1109/TASL.2013.2253097","url":null,"abstract":"The common approaches to feature extraction in speech processing are generative and parametric although they are highly sensitive to violations of their model assumptions. Here, we advocate the non-parametric Information Bottleneck (IB). IB is an information theoretic approach that extends minimal sufficient statistics. However, unlike minimal sufficient statistics which does not allow any relevant data loss, IB method enables a principled tradeoff between compactness and the amount of target-related information. IB's ability to improve a broad range of recognition tasks is illustrated for model dimension reduction tasks for speaker recognition and model clustering for age-group verification.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1755-1759"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2253097","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Analysis and Design of Multichannel Systems for Perceptual Sound Field Reconstruction 多通道感知声场重构系统的分析与设计
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2260152
E. D. Sena, H. Hacıhabiboğlu, Z. Cvetković
This paper presents a systematic framework for the analysis and design of circular multichannel surround sound systems. Objective analysis based on the concept of active intensity fields shows that for stable rendition of monochromatic plane waves it is beneficial to render each such wave by no more than two channels. Based on that finding, we propose a methodology for the design of circular microphone arrays, in the same configuration as the corresponding loudspeaker system, which aims to capture inter-channel time and intensity differences that ensure accurate rendition of the auditory perspective. The methodology is applicable to regular and irregular microphone/speaker layouts, and a wide range of microphone array radii, including the special case of coincident arrays which corresponds to intensity-based systems. Several design examples, involving first and higher-order microphones are presented. Results of formal listening tests suggest that the proposed design methodology achieves a performance comparable to prior art in the center of the loudspeaker array and a more graceful degradation away from the center.
本文提出了一个分析和设计圆形多声道环绕声系统的系统框架。基于有源强场概念的客观分析表明,为了稳定呈现单色平面波,最好将单色平面波呈现为不超过两个通道。基于这一发现,我们提出了一种圆形麦克风阵列的设计方法,其配置与相应的扬声器系统相同,旨在捕获通道间时间和强度差异,以确保听觉视角的准确呈现。该方法适用于规则和不规则的麦克风/扬声器布局,以及广泛的麦克风阵列半径范围,包括与基于强度的系统相对应的一致阵列的特殊情况。给出了几个设计实例,包括一阶和高阶麦克风。正式听力测试的结果表明,所提出的设计方法在扬声器阵列的中心实现了与现有技术相当的性能,并且在远离中心的地方实现了更优雅的退化。
{"title":"Analysis and Design of Multichannel Systems for Perceptual Sound Field Reconstruction","authors":"E. D. Sena, H. Hacıhabiboğlu, Z. Cvetković","doi":"10.1109/TASL.2013.2260152","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260152","url":null,"abstract":"This paper presents a systematic framework for the analysis and design of circular multichannel surround sound systems. Objective analysis based on the concept of active intensity fields shows that for stable rendition of monochromatic plane waves it is beneficial to render each such wave by no more than two channels. Based on that finding, we propose a methodology for the design of circular microphone arrays, in the same configuration as the corresponding loudspeaker system, which aims to capture inter-channel time and intensity differences that ensure accurate rendition of the auditory perspective. The methodology is applicable to regular and irregular microphone/speaker layouts, and a wide range of microphone array radii, including the special case of coincident arrays which corresponds to intensity-based systems. Several design examples, involving first and higher-order microphones are presented. Results of formal listening tests suggest that the proposed design methodology achieves a performance comparable to prior art in the center of the loudspeaker array and a more graceful degradation away from the center.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1653-1665"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260152","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Musical Instrument Sound Morphing Guided by Perceptually Motivated Features 由感知动机特征引导的乐器声音变形
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2260154
Marcelo F. Caetano, X. Rodet
Sound morphing is a transformation that gradually blurs the distinction between the source and target sounds. For musical instrument sounds, the morph must operate across timbre dimensions to create the auditory illusion of hybrid musical instruments. The ultimate goal of sound morphing is to perform perceptually linear transitions, which requires an appropriate model to represent the sounds being morphed and an interpolation function to obtain intermediate sounds. Typically, morphing techniques directly interpolate the parameters of the sound model without considering the perceptual impact or evaluating the results. Perceptual evaluations are cumbersome and not always conclusive. In this work, we seek parameters of a sound model that favor linear variation of perceptually motivated temporal and spectral features used to guide the morph towards more perceptually linear results. The requirement of linear variation of feature values gives rise to objective evaluation criteria for sound morphing. We investigate several spectral envelope morphing techniques to determine which spectral representation renders the most linear transformation in the spectral shape feature domain. We found that interpolation of line spectral frequencies gives the most linear spectral envelope morphs. Analogously, we study temporal envelope morphing techniques and we concluded that interpolation of cepstral coefficients results in the most linear temporal envelope morph.
声音变形是一种逐渐模糊源音和目标音之间区别的转换。对于乐器的声音,变形必须跨音色维度运作,以创造混合乐器的听觉错觉。声音变形的最终目标是实现感知上的线性转换,这需要一个合适的模型来表示被变形的声音,并需要一个插值函数来获得中间声音。通常,变形技术直接插入声音模型的参数,而不考虑感知影响或评估结果。知觉评价是繁琐的,并不总是决定性的。在这项工作中,我们寻求一个健全模型的参数,该模型有利于感知驱动的时间和光谱特征的线性变化,用于指导更多感知线性结果的变化。特征值线性变化的要求为声音变形提供了客观的评价标准。我们研究了几种光谱包络变形技术,以确定哪种光谱表示在光谱形状特征域中呈现最线性的变换。我们发现线谱频率的插值给出了最线性的谱包络变形。类似地,我们研究了时间包络变形技术,我们得出的结论是,倒谱系数的插值结果是最线性的时间包络变形。
{"title":"Musical Instrument Sound Morphing Guided by Perceptually Motivated Features","authors":"Marcelo F. Caetano, X. Rodet","doi":"10.1109/TASL.2013.2260154","DOIUrl":"https://doi.org/10.1109/TASL.2013.2260154","url":null,"abstract":"Sound morphing is a transformation that gradually blurs the distinction between the source and target sounds. For musical instrument sounds, the morph must operate across timbre dimensions to create the auditory illusion of hybrid musical instruments. The ultimate goal of sound morphing is to perform perceptually linear transitions, which requires an appropriate model to represent the sounds being morphed and an interpolation function to obtain intermediate sounds. Typically, morphing techniques directly interpolate the parameters of the sound model without considering the perceptual impact or evaluating the results. Perceptual evaluations are cumbersome and not always conclusive. In this work, we seek parameters of a sound model that favor linear variation of perceptually motivated temporal and spectral features used to guide the morph towards more perceptually linear results. The requirement of linear variation of feature values gives rise to objective evaluation criteria for sound morphing. We investigate several spectral envelope morphing techniques to determine which spectral representation renders the most linear transformation in the spectral shape feature domain. We found that interpolation of line spectral frequencies gives the most linear spectral envelope morphs. Analogously, we study temporal envelope morphing techniques and we concluded that interpolation of cepstral coefficients results in the most linear temporal envelope morph.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1666-1675"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260154","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Joint Source-Filter Optimization for Accurate Vocal Tract Estimation Using Differential Evolution 基于差分进化的声道精确估计联合源-滤波器优化
Pub Date : 2013-08-01 DOI: 10.1109/TASL.2013.2255275
O. Schleusing, T. Kinnunen, B. Story, J. Vesin
In this work, we present a joint source-filter optimization approach for separating voiced speech into vocal tract (VT) and voice source components. The presented method is pitch-synchronous and thereby exhibits a high robustness against vocal jitter, shimmer and other glottal variations while covering various voice qualities. The voice source is modeled using the Liljencrants-Fant (LF) model, which is integrated into a time-varying auto-regressive speech production model with exogenous input (ARX). The non-convex optimization problem of finding the optimal model parameters is addressed by a heuristic, evolutionary optimization method called differential evolution. The optimization method is first validated in a series of experiments with synthetic speech. Estimated glottal source and VT parameters are the criteria used for comparison with the iterative adaptive inverse filter (IAIF) method and the linear prediction (LP) method under varying conditions such as jitter, fundamental frequency (f0) as well as environmental and glottal noise. The results show that the proposed method largely reduces the bias and standard deviation of estimated VT coefficients and glottal source parameters. Furthermore, the performance of the source-filter separation is evaluated in experiments using speech generated with a physical model of speech production. The proposed method reliably estimates glottal flow waveforms and lower formant frequencies. Results obtained for higher formant frequencies indicate that research on more accurate voice source models and their interaction with the VT is necessary to improve the source-filter separation. The proposed optimization approach promises to be a useful tool for future research addressing this topic.
在这项工作中,我们提出了一种联合源滤波器优化方法,用于将语音分离为声道(VT)和声源组件。所提出的方法是音高同步的,因此对声音抖动、闪烁和其他声门变化具有很高的鲁棒性,同时涵盖了各种声音质量。语音源使用Liljencrants-Fant (LF)模型建模,该模型集成到具有外生输入(ARX)的时变自回归语音产生模型中。寻找最优模型参数的非凸优化问题通过一种称为微分进化的启发式进化优化方法来解决。通过一系列的人工语音实验,验证了优化方法的有效性。估计的声门源和声门参数是在抖动、基频(f0)以及环境噪声和声门噪声等不同条件下与迭代自适应反滤波(IAIF)方法和线性预测(LP)方法进行比较的标准。结果表明,该方法在很大程度上减小了估计的VT系数和声门源参数的偏差和标准差。此外,在使用语音产生的物理模型生成语音的实验中评估了源-滤波器分离的性能。该方法能可靠地估计声门流波形和较低的形成峰频率。在较高的形成峰频率下得到的结果表明,有必要研究更精确的声源模型及其与VT的相互作用,以改善源-滤波器分离。提出的优化方法有望成为解决该主题的未来研究的有用工具。
{"title":"Joint Source-Filter Optimization for Accurate Vocal Tract Estimation Using Differential Evolution","authors":"O. Schleusing, T. Kinnunen, B. Story, J. Vesin","doi":"10.1109/TASL.2013.2255275","DOIUrl":"https://doi.org/10.1109/TASL.2013.2255275","url":null,"abstract":"In this work, we present a joint source-filter optimization approach for separating voiced speech into vocal tract (VT) and voice source components. The presented method is pitch-synchronous and thereby exhibits a high robustness against vocal jitter, shimmer and other glottal variations while covering various voice qualities. The voice source is modeled using the Liljencrants-Fant (LF) model, which is integrated into a time-varying auto-regressive speech production model with exogenous input (ARX). The non-convex optimization problem of finding the optimal model parameters is addressed by a heuristic, evolutionary optimization method called differential evolution. The optimization method is first validated in a series of experiments with synthetic speech. Estimated glottal source and VT parameters are the criteria used for comparison with the iterative adaptive inverse filter (IAIF) method and the linear prediction (LP) method under varying conditions such as jitter, fundamental frequency (f0) as well as environmental and glottal noise. The results show that the proposed method largely reduces the bias and standard deviation of estimated VT coefficients and glottal source parameters. Furthermore, the performance of the source-filter separation is evaluated in experiments using speech generated with a physical model of speech production. The proposed method reliably estimates glottal flow waveforms and lower formant frequencies. Results obtained for higher formant frequencies indicate that research on more accurate voice source models and their interaction with the VT is necessary to improve the source-filter separation. The proposed optimization approach promises to be a useful tool for future research addressing this topic.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1560-1572"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2255275","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
IEEE Transactions on Audio Speech and Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1