首页 > 最新文献

IEEE Trans. Speech Audio Process.最新文献

英文 中文
Higher Order Cepstral Moment Normalization for Improved Robust Speech Recognition 改进鲁棒语音识别的高阶倒谱矩归一化
Pub Date : 2009-02-01 DOI: 10.1109/TASL.2008.2006575
C. Hsu, Lin-Shan Lee
Cepstral normalization has widely been used as a powerful approach to produce robust features for speech recognition. Good examples of this approach include cepstral mean subtraction, and cepstral mean and variance normalization, in which either the first or both the first and the second moments of the Mel-frequency cepstral coefficients (MFCCs) are normalized. In this paper, we propose the family of higher order cepstral moment normalization, in which the MFCC parameters are normalized with respect to a few moments of orders higher than 1 or 2. The basic idea is that the higher order moments are more dominated by samples with larger values, which are very likely the primary sources of the asymmetry and abnormal flatness or tail size of the parameter distributions. Normalization with respect to these moments therefore puts more emphasis on these signal components and constrains the distributions to be more symmetric with more reasonable flatness and tail size. The fundamental principles behind this approach are also analyzed and discussed based on the statistical properties of the distributions of the MFCC parameters. Experimental results based on the AURORA 2, AURORA 3, AURORA 4, and Resource Management (RM) testing environments show that with the proposed approach, recognition accuracy can be significantly and consistently improved for all types of noise and all SNR conditions.
倒谱归一化作为一种产生鲁棒特征的有效方法被广泛应用于语音识别。这种方法的好例子包括倒谱均值减法,以及倒谱均值和方差归一化,其中mel频率倒谱系数(mfccc)的第一个或第一个和第二个矩都被归一化。在本文中,我们提出了高阶倒谱矩归一化族,其中MFCC参数相对于高于1或2阶的几个矩进行归一化。其基本思想是,高阶矩更多地由较大值的样本所主导,这很可能是参数分布不对称和异常平坦度或尾部尺寸的主要来源。因此,对这些矩的归一化更加强调这些信号分量,并约束分布更加对称,具有更合理的平坦度和尾部大小。基于MFCC参数分布的统计特性,分析和讨论了该方法的基本原理。基于AURORA 2、AURORA 3、AURORA 4和资源管理(Resource Management, RM)测试环境的实验结果表明,采用该方法,在所有类型的噪声和所有信噪比条件下,识别精度都能得到显著且持续的提高。
{"title":"Higher Order Cepstral Moment Normalization for Improved Robust Speech Recognition","authors":"C. Hsu, Lin-Shan Lee","doi":"10.1109/TASL.2008.2006575","DOIUrl":"https://doi.org/10.1109/TASL.2008.2006575","url":null,"abstract":"Cepstral normalization has widely been used as a powerful approach to produce robust features for speech recognition. Good examples of this approach include cepstral mean subtraction, and cepstral mean and variance normalization, in which either the first or both the first and the second moments of the Mel-frequency cepstral coefficients (MFCCs) are normalized. In this paper, we propose the family of higher order cepstral moment normalization, in which the MFCC parameters are normalized with respect to a few moments of orders higher than 1 or 2. The basic idea is that the higher order moments are more dominated by samples with larger values, which are very likely the primary sources of the asymmetry and abnormal flatness or tail size of the parameter distributions. Normalization with respect to these moments therefore puts more emphasis on these signal components and constrains the distributions to be more symmetric with more reasonable flatness and tail size. The fundamental principles behind this approach are also analyzed and discussed based on the statistical properties of the distributions of the MFCC parameters. Experimental results based on the AURORA 2, AURORA 3, AURORA 4, and Resource Management (RM) testing environments show that with the proposed approach, recognition accuracy can be significantly and consistently improved for all types of noise and all SNR conditions.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"1 1","pages":"205-220"},"PeriodicalIF":0.0,"publicationDate":"2009-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74259698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Cascaded RLS-LMS Prediction in MPEG-4 Lossless Audio Coding MPEG-4无损音频编码中的级联RLS-LMS预测
Pub Date : 2008-03-01 DOI: 10.1109/TASL.2007.911675
Haibin Huang, P. Fränti, Dong-Yan Huang, S. Rahardja
This paper describes the cascaded recursive least square-least mean square (RLS-LMS) prediction, which is part of the recently published MPEG-4 Audio Lossless Coding international standard. The predictor consists of cascaded stages of simple linear predictors, with the prediction error at the output of one stage passed to the next stage as the input signal. A linear combiner adds up the intermediate estimates at the output of each prediction stage to give a final estimate of the RLS-LMS predictor. In the RLS-LMS predictor, the first prediction stage is a simple first-order predictor with a fixed coefficient value 1. The second prediction stage uses the recursive least square algorithm to adaptively update the predictor coefficients. The subsequent prediction stages use the normalized least mean square algorithm to update the predictor coefficients. The coefficients of the linear combiner are then updated using the sign-sign least mean square algorithm. For stereo audio signals, the RLS-LMS predictor uses both intrachannel prediction and interchannel prediction, which results in a 3% improvement in compression ratio over using only the intrachannel prediction. Through extensive tests, the MPEG-4 Audio Lossless coder using the RLS-LMS predictor has demonstrated a compression ratio that is on par with the best lossless audio coders in the field. In this paper, the structure of the RLS-LMS predictor is described in detail, and the optimal predictor configuration is studied through various experiments.
本文介绍了最近发布的MPEG-4音频无损编码国际标准中的级联递归最小二乘最小均方(RLS-LMS)预测。该预测器由简单线性预测器的级联级组成,其中一级输出的预测误差作为输入信号传递到下一级。线性组合器将每个预测阶段输出的中间估计相加,给出RLS-LMS预测器的最终估计。在RLS-LMS预测器中,第一个预测阶段是一个固定系数值1的简单一阶预测器。第二阶段使用递归最小二乘算法自适应更新预测系数。随后的预测阶段使用归一化最小均方算法来更新预测系数。然后使用符号-符号最小均方算法更新线性组合器的系数。对于立体声音频信号,RLS-LMS预测器同时使用通道内预测和通道间预测,与仅使用通道内预测相比,压缩比提高了3%。通过广泛的测试,使用RLS-LMS预测器的MPEG-4音频无损编码器的压缩比与该领域最好的无损音频编码器相当。本文详细描述了RLS-LMS预测器的结构,并通过各种实验研究了最优的预测器配置。
{"title":"Cascaded RLS-LMS Prediction in MPEG-4 Lossless Audio Coding","authors":"Haibin Huang, P. Fränti, Dong-Yan Huang, S. Rahardja","doi":"10.1109/TASL.2007.911675","DOIUrl":"https://doi.org/10.1109/TASL.2007.911675","url":null,"abstract":"This paper describes the cascaded recursive least square-least mean square (RLS-LMS) prediction, which is part of the recently published MPEG-4 Audio Lossless Coding international standard. The predictor consists of cascaded stages of simple linear predictors, with the prediction error at the output of one stage passed to the next stage as the input signal. A linear combiner adds up the intermediate estimates at the output of each prediction stage to give a final estimate of the RLS-LMS predictor. In the RLS-LMS predictor, the first prediction stage is a simple first-order predictor with a fixed coefficient value 1. The second prediction stage uses the recursive least square algorithm to adaptively update the predictor coefficients. The subsequent prediction stages use the normalized least mean square algorithm to update the predictor coefficients. The coefficients of the linear combiner are then updated using the sign-sign least mean square algorithm. For stereo audio signals, the RLS-LMS predictor uses both intrachannel prediction and interchannel prediction, which results in a 3% improvement in compression ratio over using only the intrachannel prediction. Through extensive tests, the MPEG-4 Audio Lossless coder using the RLS-LMS predictor has demonstrated a compression ratio that is on par with the best lossless audio coders in the field. In this paper, the structure of the RLS-LMS predictor is described in detail, and the optimal predictor configuration is studied through various experiments.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"32 1","pages":"554-562"},"PeriodicalIF":0.0,"publicationDate":"2008-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77222643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Comments on Vocal Tract Length Normalization Equals Linear Transformation in Cepstral Space 关于声道长度归一化等于背侧空间线性变换的评述
Pub Date : 2007-07-01 DOI: 10.1109/TASL.2007.896653
M. Afify, O. Siohan
The bilinear transformation (BT) is used for vocal tract length normalization (VTLN) in speech recogniton systems. We prove two properties of the bilinear mapping that motivated the band-diagonal transform proposed in M. Afify and O. Siohan, (ldquoConstrained maximum likelihood linear regression for speaker adaptation,rdquo in Proc. ICSLP, Beijing, China, Oct. 2000.) This is in contrast to what is stated in M. Pitz and H. Ney, (ldquoVocal tract length normalization equals linear transformation in cepstral space,rdquo IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp 930-944, September 2005) that the transform of Afify and Siohan was motivated by empirical observations.
在语音识别系统中,双线性变换(BT)被用于声道长度归一化。我们证明了M. Afify和O. Siohan提出的双线性映射的两个性质,这两个性质驱动了带对角变换(ldq -约束的最大似然线性回归用于讲话人自适应,rdquo . Proc. ICSLP,北京,中国,2000年10月)。这与M. Pitz和H. Ney在《声道长度归一化等于倒谱空间的线性变换》中所陈述的相反,见《IEEE语音与音频处理汇刊》第13卷第1期。5, 930-944页,2005年9月),Afify和Siohan的转变是由实证观察推动的。
{"title":"Comments on Vocal Tract Length Normalization Equals Linear Transformation in Cepstral Space","authors":"M. Afify, O. Siohan","doi":"10.1109/TASL.2007.896653","DOIUrl":"https://doi.org/10.1109/TASL.2007.896653","url":null,"abstract":"The bilinear transformation (BT) is used for vocal tract length normalization (VTLN) in speech recogniton systems. We prove two properties of the bilinear mapping that motivated the band-diagonal transform proposed in M. Afify and O. Siohan, (ldquoConstrained maximum likelihood linear regression for speaker adaptation,rdquo in Proc. ICSLP, Beijing, China, Oct. 2000.) This is in contrast to what is stated in M. Pitz and H. Ney, (ldquoVocal tract length normalization equals linear transformation in cepstral space,rdquo IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp 930-944, September 2005) that the transform of Afify and Siohan was motivated by empirical observations.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"23 1","pages":"1731-1732"},"PeriodicalIF":0.0,"publicationDate":"2007-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76538899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Generalized Lempel-Ziv Compression for Audio 音频的广义Lempel-Ziv压缩
Pub Date : 2007-02-01 DOI: 10.1109/TASL.2006.881687
D. Kirovski, Zeph Landau
We introduce a novel compression paradigm to generalize a class of Lempel-Ziv algorithms for lossy compression of multimedia. Based upon the fact that music, in particular electronically generated sound, has substantial level of repetitiveness within a single clip, we generalize the basic Lempel-Ziv compression algorithm to support representing a single window of audio using a linear combination of filtered past windows. In this positioning paper, we present a detailed overview of the new lossy compression paradigm, we identify the basic challenges such as similarity search and present preliminary experimental results on a benchmark of electronically generated musical pieces
我们引入了一种新的压缩范式来推广一类用于多媒体有损压缩的Lempel-Ziv算法。基于音乐,特别是电子生成的声音,在单个剪辑中具有相当程度的重复性,我们推广了基本的Lempel-Ziv压缩算法,以支持使用过滤过去窗口的线性组合来表示单个音频窗口。在这篇定位论文中,我们对新的有损压缩范式进行了详细的概述,我们确定了类似搜索等基本挑战,并在电子生成音乐作品的基准上给出了初步的实验结果
{"title":"Generalized Lempel-Ziv Compression for Audio","authors":"D. Kirovski, Zeph Landau","doi":"10.1109/TASL.2006.881687","DOIUrl":"https://doi.org/10.1109/TASL.2006.881687","url":null,"abstract":"We introduce a novel compression paradigm to generalize a class of Lempel-Ziv algorithms for lossy compression of multimedia. Based upon the fact that music, in particular electronically generated sound, has substantial level of repetitiveness within a single clip, we generalize the basic Lempel-Ziv compression algorithm to support representing a single window of audio using a linear combination of filtered past windows. In this positioning paper, we present a detailed overview of the new lossy compression paradigm, we identify the basic challenges such as similarity search and present preliminary experimental results on a benchmark of electronically generated musical pieces","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"29 1","pages":"509-518"},"PeriodicalIF":0.0,"publicationDate":"2007-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79203937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Multiple change-point audio segmentation and classification using an MDL-based Gaussian model 使用基于mdl的高斯模型的多变化点音频分割和分类
Pub Date : 2006-12-01 DOI: 10.1109/TSA.2005.852988
Chung-Hsien Wu, Chia-Hsin Hsieh
This study presents an approach for segmenting and classifying an audio stream based on audio type. First, a silence deletion procedure is employed to remove silence segments in the audio stream. A minimum description length (MDL)-based Gaussian model is then proposed to statistically characterize the audio features. Audio segmentation segments the audio stream into a sequence of homogeneous subsegments using the MDL-based Gaussian model. A hierarchical threshold-based classifier is then used to classify each subsegment into different audio types. Finally, a heuristic method is adopted to smooth the subsegment sequence and provide the final segmentation and classification results. Experimental results indicate that for TDT-3 news broadcast, a missed detection rate (MDR) of 0.1 and a false alarm rate (FAR) of 0.14 were achieved for audio segmentation. Given the same MDR and FAR values, segment-based audio classification achieved a better classification accuracy of 88% compared to a clip-based approach.
本文提出了一种基于音频类型的音频流分割和分类方法。首先,采用静默删除过程去除音频流中的静默段。然后提出了一种基于最小描述长度(MDL)的高斯模型来统计表征音频特征。音频分割使用基于mdl的高斯模型将音频流分割成一系列同质子段。然后使用基于层次阈值的分类器将每个子段分类为不同的音频类型。最后,采用启发式方法对子片段序列进行平滑处理,给出最终的分割分类结果。实验结果表明,对于TDT-3新闻广播,该方法对音频进行分割的漏检率(MDR)为0.1,虚警率(FAR)为0.14。在相同的MDR和FAR值下,基于片段的音频分类比基于片段的方法获得了88%的更好的分类准确率。
{"title":"Multiple change-point audio segmentation and classification using an MDL-based Gaussian model","authors":"Chung-Hsien Wu, Chia-Hsin Hsieh","doi":"10.1109/TSA.2005.852988","DOIUrl":"https://doi.org/10.1109/TSA.2005.852988","url":null,"abstract":"This study presents an approach for segmenting and classifying an audio stream based on audio type. First, a silence deletion procedure is employed to remove silence segments in the audio stream. A minimum description length (MDL)-based Gaussian model is then proposed to statistically characterize the audio features. Audio segmentation segments the audio stream into a sequence of homogeneous subsegments using the MDL-based Gaussian model. A hierarchical threshold-based classifier is then used to classify each subsegment into different audio types. Finally, a heuristic method is adopted to smooth the subsegment sequence and provide the final segmentation and classification results. Experimental results indicate that for TDT-3 news broadcast, a missed detection rate (MDR) of 0.1 and a false alarm rate (FAR) of 0.14 were achieved for audio segmentation. Given the same MDR and FAR values, segment-based audio classification achieved a better classification accuracy of 88% compared to a clip-based approach.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"524 1","pages":"647-657"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77874467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Corrections to "Automatic Transcription of Conversational Telephone Speech" 对“会话式电话语音自动转录”的更正
Pub Date : 2006-12-01 DOI: 10.1109/TASL.2006.871051
T. Hain, P. Woodland, G. Evermann, M. Gales, Xunying Liu, G. Moore, Daniel Povey, Lan Wang
Manuscript received December 9, 2003; August 9, 2004. This work was supported by GCHQ and by DARPA under Grant MDA972–02–0013. This paper does not necessarily reflect the position or the policy of the U.S. Government and no official endorsement should be inferred. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Geoffrey Zweig. The authors are with the Cambridge University Engineering Department, Cambridge CB2 1PZ, U.K. (e-mail: pcw@eng.cam.ac.uk). Digital Object Identifier 10.1109/TASL.2006.871051
2003年12月9日收稿;2004年8月9日这项工作得到了GCHQ和DARPA的资助,授权号为MDA972-02-0013。本文并不一定反映美国政府的立场或政策,也不应据此推断官方认可。协调审稿并批准发表的副编辑是Geoffrey Zweig博士。作者来自剑桥大学工程系,剑桥CB2 1PZ,英国(e-mail: pcw@eng.cam.ac.uk)。数字对象标识符10.1109/TASL.2006.871051
{"title":"Corrections to \"Automatic Transcription of Conversational Telephone Speech\"","authors":"T. Hain, P. Woodland, G. Evermann, M. Gales, Xunying Liu, G. Moore, Daniel Povey, Lan Wang","doi":"10.1109/TASL.2006.871051","DOIUrl":"https://doi.org/10.1109/TASL.2006.871051","url":null,"abstract":"Manuscript received December 9, 2003; August 9, 2004. This work was supported by GCHQ and by DARPA under Grant MDA972–02–0013. This paper does not necessarily reflect the position or the policy of the U.S. Government and no official endorsement should be inferred. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Geoffrey Zweig. The authors are with the Cambridge University Engineering Department, Cambridge CB2 1PZ, U.K. (e-mail: pcw@eng.cam.ac.uk). Digital Object Identifier 10.1109/TASL.2006.871051","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"52 1","pages":"727-727"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79824690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Corrections to "Segmental minimum Bayes-risk decoding for automatic speech recognition" 对“自动语音识别的分段最小贝叶斯风险解码”的修正
Pub Date : 2006-12-01 DOI: 10.1109/TSA.2005.854087
V. Goel, Shankar Kumar, W. Byrne
The purpose of this paper is to correct and expand upon the experimental results presented in our recently published paper [1]. In [1, Sec. III-B], we present a risk-based lattice cutting (RLC) procedure to segment ASR word lattices into sequences of smaller sublattices. The purpose of this procedure is to restructure the original lattice to improve the efficiency of minimum Bayes-risk (MBR) and other lattice rescoring procedures. Given that the segmented lattices are to be rescored, it is crucial that no paths from the original lattice be lost in the segmentation process. In the experiments reported in our original publication, some of the original paths were inadvertently discarded from the segmented lattices. This affected the performance of the MBR results presented. In this paper, we briefly review the segmentation algorithm and explain the flaw in our previous experiments. We find consistent minor improvements in word error rate (WER) under the corrected procedure. More importantly, we report experiments confirming that the lattice segmentation procedure does indeed preserve all the paths in the original lattice.
本文的目的是对我们最近发表的论文[1]中的实验结果进行修正和扩展。在[1,第III-B节]中,我们提出了一种基于风险的格切割(RLC)方法,将ASR词格分割成更小的子格序列。该程序的目的是对原始格进行重构,以提高最小贝叶斯风险(MBR)和其他格重记程序的效率。考虑到分割后的网格要被重新分割,在分割过程中不能丢失来自原始网格的路径是至关重要的。在我们最初发表的实验报告中,一些原始路径被无意地从分割的网格中丢弃。这影响了MBR的性能。在本文中,我们简要回顾了分割算法,并解释了我们之前实验中的缺陷。我们发现在修正后的程序下,单词错误率(WER)有了一致的小幅改善。更重要的是,我们报告的实验证实了晶格分割过程确实保留了原始晶格中的所有路径。
{"title":"Corrections to \"Segmental minimum Bayes-risk decoding for automatic speech recognition\"","authors":"V. Goel, Shankar Kumar, W. Byrne","doi":"10.1109/TSA.2005.854087","DOIUrl":"https://doi.org/10.1109/TSA.2005.854087","url":null,"abstract":"The purpose of this paper is to correct and expand upon the experimental results presented in our recently published paper [1]. In [1, Sec. III-B], we present a risk-based lattice cutting (RLC) procedure to segment ASR word lattices into sequences of smaller sublattices. The purpose of this procedure is to restructure the original lattice to improve the efficiency of minimum Bayes-risk (MBR) and other lattice rescoring procedures. Given that the segmented lattices are to be rescored, it is crucial that no paths from the original lattice be lost in the segmentation process. In the experiments reported in our original publication, some of the original paths were inadvertently discarded from the segmented lattices. This affected the performance of the MBR results presented. In this paper, we briefly review the segmentation algorithm and explain the flaw in our previous experiments. We find consistent minor improvements in word error rate (WER) under the corrected procedure. More importantly, we report experiments confirming that the lattice segmentation procedure does indeed preserve all the paths in the original lattice.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"94 1","pages":"356-357"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79935234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Aggregate a posteriori linear regression adaptation 集合后验线性回归自适应
Pub Date : 2006-12-01 DOI: 10.1109/TSA.2005.860847
Jen-Tzung Chien, Chih-Hsien Huang
We present a new discriminative linear regression adaptation algorithm for hidden Markov model (HMM) based speech recognition. The cluster-dependent regression matrices are estimated from speaker-specific adaptation data through maximizing the aggregate a posteriori probability, which can be expressed in a form of classification error function adopting the logarithm of posterior distribution as the discriminant function. Accordingly, the aggregate a posteriori linear regression (AAPLR) is developed for discriminative adaptation where the classification errors of adaptation data are minimized. Because the prior distribution of regression matrix is involved, AAPLR is geared with the Bayesian learning capability. We demonstrate that the difference between AAPLR discriminative adaptation and maximum a posteriori linear regression (MAPLR) adaptation is due to the treatment of the evidence. Different from minimum classification error linear regression (MCELR), AAPLR has closed-form solution to fulfil rapid adaptation. Experimental results reveal that AAPLR speaker adaptation does improve speech recognition performance with moderate computational cost compared to maximum likelihood linear regression (MLLR), MAPLR, MCELR and conditional maximum likelihood linear regression (CMLLR). These results are verified for supervised adaptation as well as unsupervised adaptation for different numbers of adaptation data.
提出了一种新的基于隐马尔可夫模型的语音识别判别线性回归自适应算法。从特定说话人的适应数据中,通过最大化集合的后验概率来估计聚类相关的回归矩阵,该后验概率可以表示为采用后验分布的对数作为判别函数的分类误差函数。在此基础上,提出了一种用于判别自适应的聚类后先验线性回归(AAPLR)方法,使自适应数据的分类误差最小化。由于涉及到回归矩阵的先验分布,AAPLR与贝叶斯学习能力相结合。我们证明了AAPLR判别适应与最大后检线性回归(MAPLR)适应之间的差异是由于对证据的处理。与最小分类误差线性回归(MCELR)不同,AAPLR具有封闭解,可以实现快速自适应。实验结果表明,与最大似然线性回归(MLLR)、MAPLR、mclr和条件最大似然线性回归(CMLLR)相比,AAPLR自适应方法在计算成本中等的情况下提高了语音识别性能。对不同数量的自适应数据进行了监督自适应和无监督自适应的验证。
{"title":"Aggregate a posteriori linear regression adaptation","authors":"Jen-Tzung Chien, Chih-Hsien Huang","doi":"10.1109/TSA.2005.860847","DOIUrl":"https://doi.org/10.1109/TSA.2005.860847","url":null,"abstract":"We present a new discriminative linear regression adaptation algorithm for hidden Markov model (HMM) based speech recognition. The cluster-dependent regression matrices are estimated from speaker-specific adaptation data through maximizing the aggregate a posteriori probability, which can be expressed in a form of classification error function adopting the logarithm of posterior distribution as the discriminant function. Accordingly, the aggregate a posteriori linear regression (AAPLR) is developed for discriminative adaptation where the classification errors of adaptation data are minimized. Because the prior distribution of regression matrix is involved, AAPLR is geared with the Bayesian learning capability. We demonstrate that the difference between AAPLR discriminative adaptation and maximum a posteriori linear regression (MAPLR) adaptation is due to the treatment of the evidence. Different from minimum classification error linear regression (MCELR), AAPLR has closed-form solution to fulfil rapid adaptation. Experimental results reveal that AAPLR speaker adaptation does improve speech recognition performance with moderate computational cost compared to maximum likelihood linear regression (MLLR), MAPLR, MCELR and conditional maximum likelihood linear regression (CMLLR). These results are verified for supervised adaptation as well as unsupervised adaptation for different numbers of adaptation data.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"12 5 1","pages":"797-807"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78635503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Objective Assessment of Speech and Audio Quality - Technology and Applications 语音和音频质量的客观评价-技术和应用
Pub Date : 2006-11-01 DOI: 10.1109/TASL.2006.883260
A. Rix, J. Beerends, Doh-Suk Kim, P. Kroon, O. Ghitza
In the past few years, objective quality assessment models have become increasingly used for assessing or monitoring speech and audio quality. By measuring perceived quality on an easily-understood subjective scale, such as listening quality (excellent, good, fair, poor, bad), these methods provide a quick and repeatable way to estimate customer experience. Typical applications include audio quality evaluation, selection of codecs or other equipment, and measuring the quality of telephone networks. To introduce this special issue, this paper provides an overview of the field, outlining the main approaches to intrusive, nonintrusive and parametric models and discussing some of their limitations and areas of future work
在过去的几年中,客观质量评估模型越来越多地用于评估或监控语音和音频质量。通过以易于理解的主观尺度衡量感知质量,例如听力质量(优秀、良好、一般、差、差),这些方法提供了一种快速且可重复的方法来评估客户体验。典型的应用包括音频质量评估,编解码器或其他设备的选择,以及测量电话网络的质量。为了介绍这一特殊问题,本文提供了该领域的概述,概述了侵入式、非侵入式和参数化模型的主要方法,并讨论了它们的一些局限性和未来工作的领域
{"title":"Objective Assessment of Speech and Audio Quality - Technology and Applications","authors":"A. Rix, J. Beerends, Doh-Suk Kim, P. Kroon, O. Ghitza","doi":"10.1109/TASL.2006.883260","DOIUrl":"https://doi.org/10.1109/TASL.2006.883260","url":null,"abstract":"In the past few years, objective quality assessment models have become increasingly used for assessing or monitoring speech and audio quality. By measuring perceived quality on an easily-understood subjective scale, such as listening quality (excellent, good, fair, poor, bad), these methods provide a quick and repeatable way to estimate customer experience. Typical applications include audio quality evaluation, selection of codecs or other equipment, and measuring the quality of telephone networks. To introduce this special issue, this paper provides an overview of the field, outlining the main approaches to intrusive, nonintrusive and parametric models and discussing some of their limitations and areas of future work","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"36 1","pages":"1890-1901"},"PeriodicalIF":0.0,"publicationDate":"2006-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88067753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 137
Introduction to the Special Issue on Data Mining of Speech, Audio, and Dialog 语音、音频和对话的数据挖掘专题导论
Pub Date : 2005-08-15 DOI: 10.1109/TSA.2005.852677
M. Gilbert, Roger K. Moore, G. Zweig
ATA mining is concerned with the science, technology, and engineering of discovering patterns and extracting potentially useful or interesting information automatically or semi-automatically from data. Data mining was introduced in the 1990s and has deep roots in the fields of statistics, artificial intelligence, and machine learning. With the advent of inexpensive storage space and faster processing over the past decade or so, data mining research has started to penetrate new grounds in areas of speech and audio processing as well as spoken language dialog. It has been fueled by the influx of audio data that are becoming more widely available from a variety of multimedia sources including webcasts, conversations, music, meetings, voice messages, lectures, television, and radio. Algorithmic advances in automatic speech recognition have also been a major, enabling technology behind the growth in data mining. Current state-of-the-art, large-vocabulary, continuous speech recognizers are now trained on a record amount of data—several hundreds of millions of words and thousands of hours of speech. Pioneering research in robust speech processing, large-scale discriminative training, finite state automata, and statistical hidden Markov modeling have resulted in real-time recognizers that are able to transcribe spontaneous speech with a word accuracy exceeding 85%. With this level of accuracy, the technology is now highly attractive for a variety of speech mining applications. Speech mining research includes many ways of applying machine learning, speech processing, and language processing algorithms to benefit and serve commercial applications. It also raises and addresses several new and interesting fundamental research challenges in the areas of prediction, search, explanation, learning, and language understanding. These basic challenges are becoming increasingly important in revolutionizing business processes by providing essential sales and marketing information about services, customers, and product offerings. They are also enabling a new class of learning systems to be created that can infer knowledge and trends automatically from data, analyze and report application performance, and adapt and improve over time with minimal or zero human involvement. Effective techniques for mining speech, audio, and dialog data can impact numerous business and government applications. The technology for monitoring conversational speech to discover patterns, capture useful trends, and generate alarms is essential for intelligence and law enforcement organizations as well as for enhancing call center operation. It is useful for an
ATA挖掘涉及自动或半自动地从数据中发现模式和提取潜在有用或有趣信息的科学、技术和工程。数据挖掘在20世纪90年代被引入,在统计学、人工智能和机器学习领域有着深厚的根基。在过去的十年左右,随着廉价存储空间的出现和更快的处理速度,数据挖掘研究已经开始渗透到语音和音频处理以及口语对话领域的新领域。音频数据的涌入推动了它的发展,这些音频数据越来越广泛地从各种多媒体来源获得,包括网络广播、对话、音乐、会议、语音信息、讲座、电视和广播。自动语音识别的算法进步也是数据挖掘增长背后的主要支持技术。目前,最先进的、大词汇量的、连续的语音识别器正在接受创纪录数量的数据训练——数亿个单词和数千小时的语音。在鲁棒语音处理、大规模判别训练、有限状态自动机和统计隐马尔可夫建模方面的开创性研究已经导致实时识别器能够以超过85%的单词准确率转录自发语音。由于这种精度,该技术现在对各种语音挖掘应用具有很高的吸引力。语音挖掘研究包括许多应用机器学习、语音处理和语言处理算法的方法,以受益和服务于商业应用。它还提出并解决了预测、搜索、解释、学习和语言理解领域的几个新的和有趣的基础研究挑战。通过提供有关服务、客户和产品的基本销售和营销信息,这些基本挑战在革新业务流程方面变得越来越重要。它们还使一种新的学习系统得以创建,这种系统可以从数据中自动推断知识和趋势,分析和报告应用程序的性能,并随着时间的推移进行调整和改进,而无需人工参与。挖掘语音、音频和对话数据的有效技术可以影响许多业务和政府应用程序。监视会话语音以发现模式、捕获有用趋势和生成警报的技术对于情报和执法组织以及增强呼叫中心操作至关重要。它对一个人很有用
{"title":"Introduction to the Special Issue on Data Mining of Speech, Audio, and Dialog","authors":"M. Gilbert, Roger K. Moore, G. Zweig","doi":"10.1109/TSA.2005.852677","DOIUrl":"https://doi.org/10.1109/TSA.2005.852677","url":null,"abstract":"ATA mining is concerned with the science, technology, and engineering of discovering patterns and extracting potentially useful or interesting information automatically or semi-automatically from data. Data mining was introduced in the 1990s and has deep roots in the fields of statistics, artificial intelligence, and machine learning. With the advent of inexpensive storage space and faster processing over the past decade or so, data mining research has started to penetrate new grounds in areas of speech and audio processing as well as spoken language dialog. It has been fueled by the influx of audio data that are becoming more widely available from a variety of multimedia sources including webcasts, conversations, music, meetings, voice messages, lectures, television, and radio. Algorithmic advances in automatic speech recognition have also been a major, enabling technology behind the growth in data mining. Current state-of-the-art, large-vocabulary, continuous speech recognizers are now trained on a record amount of data—several hundreds of millions of words and thousands of hours of speech. Pioneering research in robust speech processing, large-scale discriminative training, finite state automata, and statistical hidden Markov modeling have resulted in real-time recognizers that are able to transcribe spontaneous speech with a word accuracy exceeding 85%. With this level of accuracy, the technology is now highly attractive for a variety of speech mining applications. Speech mining research includes many ways of applying machine learning, speech processing, and language processing algorithms to benefit and serve commercial applications. It also raises and addresses several new and interesting fundamental research challenges in the areas of prediction, search, explanation, learning, and language understanding. These basic challenges are becoming increasingly important in revolutionizing business processes by providing essential sales and marketing information about services, customers, and product offerings. They are also enabling a new class of learning systems to be created that can infer knowledge and trends automatically from data, analyze and report application performance, and adapt and improve over time with minimal or zero human involvement. Effective techniques for mining speech, audio, and dialog data can impact numerous business and government applications. The technology for monitoring conversational speech to discover patterns, capture useful trends, and generate alarms is essential for intelligence and law enforcement organizations as well as for enhancing call center operation. It is useful for an","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"70 1","pages":"633-634"},"PeriodicalIF":0.0,"publicationDate":"2005-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83915576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
IEEE Trans. Speech Audio Process.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1