首页 > 最新文献

IEEE Transactions on Audio Speech and Language Processing最新文献

英文 中文
Improving Graph-Based Dependency Parsing Models With Dependency Language Models 用依赖语言模型改进基于图的依赖解析模型
Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2273715
Min Zhang, Wenliang Chen, Xiangyu Duan, Rong Zhang
For graph-based dependency parsing, how to enrich high-order features without increasing decoding complexity is a very challenging problem. To solve this problem, this paper presents an approach to representing high-order features for graph-based dependency parsing models using a dependency language model and beam search. Firstly, we use a baseline parser to parse a large-amount of unannotated data. Then we build the dependency language model (DLM) on the auto-parsed data. A set of new features is represented based on the DLM. Finally, we integrate the DLM-based features into the parsing model during decoding by beam search. We also utilize the features in bilingual text (bitext) parsing models. The main advantages of our approach are: 1) we utilize rich high-order features defined over a view of large scope and additional large raw corpus; 2) our approach does not increase the decoding complexity. We evaluate the proposed approach on the monotext and bitext parsing tasks. In the monotext parsing task, we conduct the experiments on Chinese and English data. The experimental results show that our new parser achieves the best accuracy on the Chinese data and comparable accuracy with the best known systems on the English data. In the bitext parsing task, we conduct the experiments on a Chinese-English bilingual data and our score is the best reported so far.
对于基于图的依赖项解析,如何在不增加解码复杂度的前提下丰富高阶特征是一个非常具有挑战性的问题。为了解决这一问题,本文提出了一种利用依赖语言模型和束搜索来表示基于图的依赖解析模型的高阶特征的方法。首先,我们使用基线解析器解析大量未注释的数据。然后在自动解析数据的基础上建立依赖语言模型(DLM)。基于DLM表示了一组新特性。最后,在波束搜索解码过程中,将基于dlm的特征整合到解析模型中。我们还利用了双语文本(bitext)解析模型的特征。我们的方法的主要优点是:1)我们利用在大范围和额外的大型原始语料库的视图上定义的丰富的高阶特征;2)我们的方法不会增加解码的复杂度。我们在单文本和双文本解析任务上评估了所提出的方法。在单文本解析任务中,我们对中文和英文数据进行了实验。实验结果表明,我们的解析器在中文数据上达到了最好的准确率,在英文数据上的准确率与目前最知名的系统相当。在文本解析任务中,我们在一个中英文双语数据上进行了实验,我们的成绩是目前报道的最好的。
{"title":"Improving Graph-Based Dependency Parsing Models With Dependency Language Models","authors":"Min Zhang, Wenliang Chen, Xiangyu Duan, Rong Zhang","doi":"10.1109/TASL.2013.2273715","DOIUrl":"https://doi.org/10.1109/TASL.2013.2273715","url":null,"abstract":"For graph-based dependency parsing, how to enrich high-order features without increasing decoding complexity is a very challenging problem. To solve this problem, this paper presents an approach to representing high-order features for graph-based dependency parsing models using a dependency language model and beam search. Firstly, we use a baseline parser to parse a large-amount of unannotated data. Then we build the dependency language model (DLM) on the auto-parsed data. A set of new features is represented based on the DLM. Finally, we integrate the DLM-based features into the parsing model during decoding by beam search. We also utilize the features in bilingual text (bitext) parsing models. The main advantages of our approach are: 1) we utilize rich high-order features defined over a view of large scope and additional large raw corpus; 2) our approach does not increase the decoding complexity. We evaluate the proposed approach on the monotext and bitext parsing tasks. In the monotext parsing task, we conduct the experiments on Chinese and English data. The experimental results show that our new parser achieves the best accuracy on the Chinese data and comparable accuracy with the best known systems on the English data. In the bitext parsing task, we conduct the experiments on a Chinese-English bilingual data and our score is the best reported so far.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2273715","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Acoustic Modeling With Hierarchical Reservoirs 分层储层声学建模
Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2280209
Fabian Triefenbach, A. Jalalvand, Kris Demuynck, J. Martens
Accurate acoustic modeling is an essential requirement of a state-of-the-art continuous speech recognizer. The Acoustic Model (AM) describes the relation between the observed speech signal and the non-observable sequence of phonetic units uttered by the speaker. Nowadays, most recognizers use Hidden Markov Models (HMMs) in combination with Gaussian Mixture Models (GMMs) to model the acoustics, but neural-based architectures are on the rise again. In this work, the recently introduced Reservoir Computing (RC) paradigm is used for acoustic modeling. A reservoir is a fixed - and thus non-trained - Recurrent Neural Network (RNN) that is combined with a trained linear model. This approach combines the ability of an RNN to model the recent past of the input sequence with a simple and reliable training procedure. It is shown here that simple reservoir-based AMs achieve reasonable phone recognition and that deep hierarchical and bi-directional reservoir architectures lead to a very competitive Phone Error Rate (PER) of 23.1% on the well-known TIMIT task.
准确的声学建模是最先进的连续语音识别器的基本要求。声学模型(AM)描述了观察到的语音信号与说话人发出的不可观察的语音单位序列之间的关系。目前,大多数识别器使用隐马尔可夫模型(hmm)结合高斯混合模型(GMMs)来建模声学,但基于神经的架构再次兴起。在这项工作中,最近引入的储层计算(RC)范式被用于声学建模。一个水库是一个固定的,因此是未经训练的循环神经网络(RNN),它与一个训练好的线性模型相结合。这种方法结合了RNN对输入序列最近的过去进行建模的能力和简单可靠的训练过程。研究表明,简单的基于储层的AMs实现了合理的电话识别,而深层分层和双向储层架构在著名的TIMIT任务上的电话错误率(PER)为23.1%,非常具有竞争优势。
{"title":"Acoustic Modeling With Hierarchical Reservoirs","authors":"Fabian Triefenbach, A. Jalalvand, Kris Demuynck, J. Martens","doi":"10.1109/TASL.2013.2280209","DOIUrl":"https://doi.org/10.1109/TASL.2013.2280209","url":null,"abstract":"Accurate acoustic modeling is an essential requirement of a state-of-the-art continuous speech recognizer. The Acoustic Model (AM) describes the relation between the observed speech signal and the non-observable sequence of phonetic units uttered by the speaker. Nowadays, most recognizers use Hidden Markov Models (HMMs) in combination with Gaussian Mixture Models (GMMs) to model the acoustics, but neural-based architectures are on the rise again. In this work, the recently introduced Reservoir Computing (RC) paradigm is used for acoustic modeling. A reservoir is a fixed - and thus non-trained - Recurrent Neural Network (RNN) that is combined with a trained linear model. This approach combines the ability of an RNN to model the recent past of the input sequence with a simple and reliable training procedure. It is shown here that simple reservoir-based AMs achieve reasonable phone recognition and that deep hierarchical and bi-directional reservoir architectures lead to a very competitive Phone Error Rate (PER) of 23.1% on the well-known TIMIT task.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2280209","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
Robust Ultra-Low Latency Soft-Decision Decoding of Linear PCM Audio 线性PCM音频的鲁棒超低延迟软判决解码
Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2273716
Florian Pflug, T. Fingscheidt
Applications such as professional wireless digital microphones require a transmission of practically uncoded high-quality audio with ultra-low latency on the one hand and robustness to error-prone channels on the other hand. The delay restrictions, however, prohibit the utilization of efficient block or convolutional channel codes for error protection. The contribution of this work is fourfold: We revise and summarize concisely a Bayesian framework for soft-decision audio decoding and present three novel approaches to (almost) latency-free robust decoding of uncompressed audio. Bit reliability information from the transmission channel is exploited, as well as short-term and long-term residual redundancy within the audio signal, and optionally some explicit redundancy in terms of a sample-individual block code. In all cases we utilize variants of higher-order linear prediction to compute prediction probabilities in three novel ways: Firstly by employing a serial cascade of multiple predictors, secondly by exploiting explicit redundancy in form of parity bits, and thirdly by utilizing an interpolative forward/backward prediction algorithm. The first two presented approaches work fully delayless, while the third one introduces an ultra-low algorithmic delay of just a few samples. The effectiveness of the proposed algorithms is proven in simulations with BPSK and typical digital microphone FSK modulation schemes on AWGN and bursty fading channels.
专业无线数字麦克风等应用需要传输几乎未编码的高质量音频,一方面具有超低延迟,另一方面具有对易出错通道的鲁棒性。延迟限制,然而,禁止使用有效的块或卷积信道码进行错误保护。这项工作的贡献有四个方面:我们修改并简要总结了用于软判决音频解码的贝叶斯框架,并提出了三种(几乎)无延迟的未压缩音频鲁棒解码的新方法。利用传输信道的位可靠性信息,以及音频信号中的短期和长期剩余冗余,以及可选的一些样本单个块码的显式冗余。在所有情况下,我们利用高阶线性预测的变体以三种新颖的方式计算预测概率:首先通过采用多个预测器的串行级联,其次通过利用奇偶校验位形式的显式冗余,第三通过利用内插式前向/后向预测算法。前两种方法是完全无延迟的,而第三种方法引入了只有几个样本的超低算法延迟。在AWGN和突发衰落信道上,用BPSK和典型数字传声器FSK调制方案进行了仿真,验证了算法的有效性。
{"title":"Robust Ultra-Low Latency Soft-Decision Decoding of Linear PCM Audio","authors":"Florian Pflug, T. Fingscheidt","doi":"10.1109/TASL.2013.2273716","DOIUrl":"https://doi.org/10.1109/TASL.2013.2273716","url":null,"abstract":"Applications such as professional wireless digital microphones require a transmission of practically uncoded high-quality audio with ultra-low latency on the one hand and robustness to error-prone channels on the other hand. The delay restrictions, however, prohibit the utilization of efficient block or convolutional channel codes for error protection. The contribution of this work is fourfold: We revise and summarize concisely a Bayesian framework for soft-decision audio decoding and present three novel approaches to (almost) latency-free robust decoding of uncompressed audio. Bit reliability information from the transmission channel is exploited, as well as short-term and long-term residual redundancy within the audio signal, and optionally some explicit redundancy in terms of a sample-individual block code. In all cases we utilize variants of higher-order linear prediction to compute prediction probabilities in three novel ways: Firstly by employing a serial cascade of multiple predictors, secondly by exploiting explicit redundancy in form of parity bits, and thirdly by utilizing an interpolative forward/backward prediction algorithm. The first two presented approaches work fully delayless, while the third one introduces an ultra-low algorithmic delay of just a few samples. The effectiveness of the proposed algorithms is proven in simulations with BPSK and typical digital microphone FSK modulation schemes on AWGN and bursty fading channels.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2273716","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Scalable Speech Coding for IP Networks: Beyond iLBC IP网络的可扩展语音编码:超越iLBC
Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2274694
Koji Seto, T. Ogunfunmi
High quality speech at low bit rates makes code excited linear prediction (CELP) the dominant choice for a narrowband coding technique despite the susceptibility to packet loss. One of the few techniques which received attention after the introduction of CELP coding technique is the internet low bitrate codec (iLBC) because of inherent high robustness to packet loss. Addition of rate flexibility and scalability makes the iLBC an attractive choice for voice communication over IP networks. In this paper, performance improvement schemes of multi-rate iLBC and its scalable structure are proposed, and the proposed codec enhanced from the previous work is re-designed based on the subjective listening quality instead of the objective quality. In particular, perceptual weighting and the modified discrete cosine transform (MDCT) with short overlap in weighted signal domain are employed along with the improved packet loss concealment (PLC) algorithm. The subjective evaluation results show that the speech quality of the proposed codec is equivalent to that of state-of-the-art codec, G.718, under both a clean channel condition and lossy channel conditions. This result is significant considering that development of the proposed codec is still in early stage.
低比特率下的高质量语音使得编码激发线性预测(CELP)成为窄带编码技术的主要选择,尽管它容易丢包。internet低比特率编解码器(internet low bitrate codec, iLBC)由于其对丢包具有较高的鲁棒性,在引入CELP编码技术后受到关注。速率灵活性和可扩展性使iLBC成为IP网络上语音通信的一个有吸引力的选择。本文提出了多速率iLBC及其可扩展结构的性能改进方案,并基于主观聆听质量而不是客观聆听质量对所提出的编解码器进行了重新设计。该算法采用感知加权和加权信号域短重叠的改进离散余弦变换(MDCT)以及改进的丢包隐藏(PLC)算法。主观评价结果表明,无论在干净信道条件下还是在有损信道条件下,所提编解码器的语音质量都与目前最先进的G.718编解码器相当。考虑到所提出的编解码器的开发仍处于早期阶段,这一结果意义重大。
{"title":"Scalable Speech Coding for IP Networks: Beyond iLBC","authors":"Koji Seto, T. Ogunfunmi","doi":"10.1109/TASL.2013.2274694","DOIUrl":"https://doi.org/10.1109/TASL.2013.2274694","url":null,"abstract":"High quality speech at low bit rates makes code excited linear prediction (CELP) the dominant choice for a narrowband coding technique despite the susceptibility to packet loss. One of the few techniques which received attention after the introduction of CELP coding technique is the internet low bitrate codec (iLBC) because of inherent high robustness to packet loss. Addition of rate flexibility and scalability makes the iLBC an attractive choice for voice communication over IP networks. In this paper, performance improvement schemes of multi-rate iLBC and its scalable structure are proposed, and the proposed codec enhanced from the previous work is re-designed based on the subjective listening quality instead of the objective quality. In particular, perceptual weighting and the modified discrete cosine transform (MDCT) with short overlap in weighted signal domain are employed along with the improved packet loss concealment (PLC) algorithm. The subjective evaluation results show that the speech quality of the proposed codec is equivalent to that of state-of-the-art codec, G.718, under both a clean channel condition and lossy channel conditions. This result is significant considering that development of the proposed codec is still in early stage.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2274694","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Cross Pattern Coherence Algorithm for Spatial Filtering Applications Utilizing Microphone Arrays 利用传声器阵列进行空间滤波的交叉模式相干算法
Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2277928
Symeon Delikaris-Manias, V. Pulkki
A parametric spatial filtering algorithm with a fixed beam direction is proposed in this paper. The algorithm utilizes the normalized cross-spectral density between signals from microphones of different orders as a criterion for focusing in specific directions. The correlation between microphone signals is estimated in the time-frequency domain. A post-filter is calculated from a multichannel input and is used to assign attenuation values to a coincidentally captured audio signal. The proposed algorithm is simple to implement and offers the capability of coping with interfering sources at different azimuthal locations with or without the presence of diffuse sound. It is implemented by using directional microphones placed in the same look direction and have the same magnitude and phase response. Experiments are conducted with simulated and real microphone arrays employing the proposed post-filter and compared to previous coherence-based approaches, such as the McCowan post-filter. A significant improvement is demonstrated in terms of objective quality measures. Formal listening tests conducted to assess the audibility of artifacts of the proposed algorithm in real acoustical scenarios show that no annoying artifacts existed with certain spectral floor values. Examples of the proposed algorithm can be found online at http://www.acoustics.hut.fi/projects/cropac/soundExamples.
提出了一种固定波束方向的参数空间滤波算法。该算法利用不同阶麦克风信号之间的归一化交叉谱密度作为特定方向聚焦的准则。在时频域估计传声器信号之间的相关性。后滤波器是从多通道输入计算出来的,并用于将衰减值分配给巧合捕获的音频信号。该算法实现简单,并提供了在有或没有漫射声存在的情况下处理不同方位位置干扰源的能力。它是通过使用定向麦克风放置在相同的外观方向,具有相同的幅度和相位响应来实现的。采用该后滤波器对模拟和真实麦克风阵列进行了实验,并与先前基于相干的方法(如McCowan后滤波器)进行了比较。在客观质量度量方面有了显著的改进。为评估所提出算法的伪影在真实声学场景中的可听性而进行的正式听力测试表明,不存在具有某些谱底值的令人讨厌的伪影。该算法的示例可以在http://www.acoustics.hut.fi/projects/cropac/soundExamples上找到。
{"title":"Cross Pattern Coherence Algorithm for Spatial Filtering Applications Utilizing Microphone Arrays","authors":"Symeon Delikaris-Manias, V. Pulkki","doi":"10.1109/TASL.2013.2277928","DOIUrl":"https://doi.org/10.1109/TASL.2013.2277928","url":null,"abstract":"A parametric spatial filtering algorithm with a fixed beam direction is proposed in this paper. The algorithm utilizes the normalized cross-spectral density between signals from microphones of different orders as a criterion for focusing in specific directions. The correlation between microphone signals is estimated in the time-frequency domain. A post-filter is calculated from a multichannel input and is used to assign attenuation values to a coincidentally captured audio signal. The proposed algorithm is simple to implement and offers the capability of coping with interfering sources at different azimuthal locations with or without the presence of diffuse sound. It is implemented by using directional microphones placed in the same look direction and have the same magnitude and phase response. Experiments are conducted with simulated and real microphone arrays employing the proposed post-filter and compared to previous coherence-based approaches, such as the McCowan post-filter. A significant improvement is demonstrated in terms of objective quality measures. Formal listening tests conducted to assess the audibility of artifacts of the proposed algorithm in real acoustical scenarios show that no annoying artifacts existed with certain spectral floor values. Examples of the proposed algorithm can be found online at http://www.acoustics.hut.fi/projects/cropac/soundExamples.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2277928","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Passive Temporal Offset Estimation of Multichannel Recordings of an Ad-Hoc Microphone Array Ad-Hoc麦克风阵列多通道记录的无源时间偏移估计
Pub Date : 2013-11-01 DOI: 10.1109/TASLP.2013.2286921
Pasi Pertilä, M. Hämäläinen, Mikael Mieskolainen
In recent years ad-hoc microphone arrays have become ubiquitous, and the capture hardware and quality is increasingly more sophisticated. Ad-hoc arrays hold a vast potential for audio applications, but they are inherently asynchronous, i.e., temporal offset exists in each channel, and furthermore the device locations are generally unknown. Therefore, the data is not directly suitable for traditional microphone array applications such as source localization and beamforming. This work presents a least squares method for temporal offset estimation of a static ad-hoc microphone array. The method utilizes the captured audio content without the need to emit calibration signals, provided that during the recording a sufficient amount of sound sources surround the array. The Cramer-Rao lower bound of the estimator is given and the effect of limited number of surrounding sources on the solution accuracy is investigated. A practical implementation is then presented using non-linear filtering with automatic parameter adjustment. Simulations over a range of reverberation and noise levels demonstrate the algorithm's robustness. Using smartphones an average RMS error of 3.5 samples (at 48 kHz) was reached when the algorithm's assumptions were met.
近年来,自组织麦克风阵列已经变得无处不在,捕获硬件和质量也越来越复杂。Ad-hoc阵列在音频应用中具有巨大的潜力,但它们本质上是异步的,即每个通道中存在时间偏移,而且设备位置通常是未知的。因此,这些数据并不直接适用于传统的麦克风阵列应用,如源定位和波束形成。本文提出了一种用于静态ad-hoc麦克风阵列时间偏移估计的最小二乘方法。该方法利用所捕获的音频内容,而不需要发射校准信号,只要在记录期间有足够数量的声源环绕所述阵列。给出了估计量的Cramer-Rao下界,并研究了有限数量的周围源对解精度的影响。然后给出了采用非线性滤波和自动参数调整的实际实现。在混响和噪声水平范围内的仿真证明了该算法的鲁棒性。使用智能手机,当算法的假设得到满足时,平均均方根误差为3.5个样本(在48 kHz时)。
{"title":"Passive Temporal Offset Estimation of Multichannel Recordings of an Ad-Hoc Microphone Array","authors":"Pasi Pertilä, M. Hämäläinen, Mikael Mieskolainen","doi":"10.1109/TASLP.2013.2286921","DOIUrl":"https://doi.org/10.1109/TASLP.2013.2286921","url":null,"abstract":"In recent years ad-hoc microphone arrays have become ubiquitous, and the capture hardware and quality is increasingly more sophisticated. Ad-hoc arrays hold a vast potential for audio applications, but they are inherently asynchronous, i.e., temporal offset exists in each channel, and furthermore the device locations are generally unknown. Therefore, the data is not directly suitable for traditional microphone array applications such as source localization and beamforming. This work presents a least squares method for temporal offset estimation of a static ad-hoc microphone array. The method utilizes the captured audio content without the need to emit calibration signals, provided that during the recording a sufficient amount of sound sources surround the array. The Cramer-Rao lower bound of the estimator is given and the effect of limited number of surrounding sources on the solution accuracy is investigated. A practical implementation is then presented using non-linear filtering with automatic parameter adjustment. Simulations over a range of reverberation and noise levels demonstrate the algorithm's robustness. Using smartphones an average RMS error of 3.5 samples (at 48 kHz) was reached when the algorithm's assumptions were met.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASLP.2013.2286921","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Second Order Methods for Optimizing Convex Matrix Functions and Sparse Covariance Clustering 二阶优化凸矩阵函数和稀疏协方差聚类方法
Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2263142
Gillian M. Chin, J. Nocedal, P. Olsen, Steven J. Rennie
A variety of first-order methods have recently been proposed for solving matrix optimization problems arising in machine learning. The premise for utilizing such algorithms is that second order information is too expensive to employ, and so simple first-order iterations are likely to be optimal. In this paper, we argue that second-order information is in fact efficiently accessible in many matrix optimization problems, and can be effectively incorporated into optimization algorithms. We begin by reviewing how certain Hessian operations can be conveniently represented in a wide class of matrix optimization problems, and provide the first proofs for these results. Next we consider a concrete problem, namely the minimization of the ℓ1 regularized Jeffreys divergence, and derive formulae for computing Hessians and Hessian vector products. This allows us to propose various second order methods for solving the Jeffreys divergence problem. We present extensive numerical results illustrating the behavior of the algorithms and apply the methods to a speech recognition problem. We compress full covariance Gaussian mixture models utilized for acoustic models in automatic speech recognition. By discovering clusters of (sparse inverse) covariance matrices, we can compress the number of covariance parameters by a factor exceeding 200, while still outperforming the word error rate (WER) performance of a diagonal covariance model that has 20 times less covariance parameters than the original acoustic model.
最近,人们提出了各种一阶方法来解决机器学习中出现的矩阵优化问题。使用这种算法的前提是二阶信息的使用成本太高,因此简单的一阶迭代可能是最优的。在本文中,我们认为二阶信息在许多矩阵优化问题中实际上是可有效访问的,并且可以有效地纳入优化算法。我们首先回顾了如何在一类广泛的矩阵优化问题中方便地表示某些Hessian操作,并为这些结果提供了第一个证明。接下来我们考虑一个具体的问题,即最小化1正则化Jeffreys散度,并推导出计算Hessians和Hessian向量积的公式。这允许我们提出各种二阶方法来解决杰弗里斯散度问题。我们给出了大量的数值结果来说明算法的行为,并将这些方法应用于语音识别问题。我们压缩了用于自动语音识别声学模型的全协方差高斯混合模型。通过发现(稀疏逆)协方差矩阵簇,我们可以将协方差参数的数量压缩超过200个因子,同时仍然优于协方差参数比原始声学模型少20倍的对角协方差模型的单词错误率(WER)性能。
{"title":"Second Order Methods for Optimizing Convex Matrix Functions and Sparse Covariance Clustering","authors":"Gillian M. Chin, J. Nocedal, P. Olsen, Steven J. Rennie","doi":"10.1109/TASL.2013.2263142","DOIUrl":"https://doi.org/10.1109/TASL.2013.2263142","url":null,"abstract":"A variety of first-order methods have recently been proposed for solving matrix optimization problems arising in machine learning. The premise for utilizing such algorithms is that second order information is too expensive to employ, and so simple first-order iterations are likely to be optimal. In this paper, we argue that second-order information is in fact efficiently accessible in many matrix optimization problems, and can be effectively incorporated into optimization algorithms. We begin by reviewing how certain Hessian operations can be conveniently represented in a wide class of matrix optimization problems, and provide the first proofs for these results. Next we consider a concrete problem, namely the minimization of the ℓ1 regularized Jeffreys divergence, and derive formulae for computing Hessians and Hessian vector products. This allows us to propose various second order methods for solving the Jeffreys divergence problem. We present extensive numerical results illustrating the behavior of the algorithms and apply the methods to a speech recognition problem. We compress full covariance Gaussian mixture models utilized for acoustic models in automatic speech recognition. By discovering clusters of (sparse inverse) covariance matrices, we can compress the number of covariance parameters by a factor exceeding 200, while still outperforming the word error rate (WER) performance of a diagonal covariance model that has 20 times less covariance parameters than the original acoustic model.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2263142","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62889848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Difference of Convex Functions Approach to Large-Scale Log-Linear Model Estimation 大规模对数线性模型估计的凸函数差分法
Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2271592
Theodoros Tsiligkaridis, E. Marcheret, V. Goel
We introduce a new class of parameter estimation methods for log-linear models. Our approach relies on the fact that minimizing a rational function of mixtures of exponentials is equivalent to minimizing a difference of convex functions. This allows us to construct convex auxiliary functions by applying the concave-convex procedure (CCCP). We consider a modification of CCCP where a proximal term is added (ProxCCCP), and extend it further by introducing an ℓ1 penalty. For solving the ` convex + ℓ1' auxiliary problem, we propose an approach called SeqGPSR that is based on sequential application of the GPSR procedure. We present convergence analysis of the algorithms, including sufficient conditions for convergence to a critical point of the objective function. We propose an adaptive procedure for varying the strength of the proximal regularization term in each ProxCCCP iteration, and show this procedure (AProxCCCP) is effective in practice and stable under some mild conditions. The CCCP procedure and proposed variants are applied to the task of optimizing the cross-entropy objective function for an audio frame classification problem. Class posteriors are modeled using log-linear models consisting of approximately 6 million parameters. Our results show that CCCP variants achieve a much better cross-entropy objective value as compared to direct optimization of the objective function by a first order gradient based approach, stochastic gradient descent or the L-BFGS procedure.
介绍了一类新的对数线性模型参数估计方法。我们的方法依赖于这样一个事实,即最小化指数混合的有理函数等价于最小化凸函数的差。这允许我们通过应用凹凸过程(CCCP)来构造凸辅助函数。我们考虑了CCCP的一个修改,其中增加了一个近项(ProxCCCP),并通过引入一个1惩罚进一步扩展了它。为了解决“凸+ 1”辅助问题,我们提出了一种基于GPSR过程的顺序应用的方法,称为SeqGPSR。我们给出了算法的收敛性分析,包括收敛到目标函数临界点的充分条件。我们提出了一种自适应过程来改变每次ProxCCCP迭代中近端正则化项的强度,并证明了该过程(AProxCCCP)在实践中是有效的,在一些温和的条件下是稳定的。将CCCP过程及其提出的变体应用于音频帧分类问题的交叉熵目标函数优化任务。类后验使用由大约600万个参数组成的对数线性模型建模。我们的研究结果表明,与使用基于一阶梯度的方法、随机梯度下降或L-BFGS过程直接优化目标函数相比,CCCP变量获得了更好的交叉熵目标值。
{"title":"A Difference of Convex Functions Approach to Large-Scale Log-Linear Model Estimation","authors":"Theodoros Tsiligkaridis, E. Marcheret, V. Goel","doi":"10.1109/TASL.2013.2271592","DOIUrl":"https://doi.org/10.1109/TASL.2013.2271592","url":null,"abstract":"We introduce a new class of parameter estimation methods for log-linear models. Our approach relies on the fact that minimizing a rational function of mixtures of exponentials is equivalent to minimizing a difference of convex functions. This allows us to construct convex auxiliary functions by applying the concave-convex procedure (CCCP). We consider a modification of CCCP where a proximal term is added (ProxCCCP), and extend it further by introducing an ℓ1 penalty. For solving the ` convex + ℓ1' auxiliary problem, we propose an approach called SeqGPSR that is based on sequential application of the GPSR procedure. We present convergence analysis of the algorithms, including sufficient conditions for convergence to a critical point of the objective function. We propose an adaptive procedure for varying the strength of the proximal regularization term in each ProxCCCP iteration, and show this procedure (AProxCCCP) is effective in practice and stable under some mild conditions. The CCCP procedure and proposed variants are applied to the task of optimizing the cross-entropy objective function for an audio frame classification problem. Class posteriors are modeled using log-linear models consisting of approximately 6 million parameters. Our results show that CCCP variants achieve a much better cross-entropy objective value as compared to direct optimization of the objective function by a first order gradient based approach, stochastic gradient descent or the L-BFGS procedure.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2271592","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Large Vocabulary Speech Recognition on Parallel Architectures 基于并行结构的大词汇语音识别
Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2271591
P. Cardinal, P. Dumouchel, Gilles Boulianne
The speed of modern processors has remained constant over the last few years but the integration capacity continues to follow Moore's law and thus, to be scalable, applications must be parallelized. The parallelization of the classical Viterbi beam search has been shown to be very difficult on multi-core processor architectures or massively threaded architectures such as Graphics Processing Unit (GPU). The problem with this approach is that active states are scattered in memory and thus, they cannot be efficiently transferred to the processor memory. This problem can be circumvented by using the A* search which uses a heuristic to significantly reduce the number of explored hypotheses. The main advantage of this algorithm is that the processing time is moved from the search in the recognition network to the computation of heuristic costs, which can be designed to take advantage of parallel architectures. Our parallel implementation of the A* decoder on a 4-core processor with a GPU led to a speed-up factor of 6.13 compared to the Viterbi beam search at its maximum capacity and an improvement of 4% absolute in accuracy at real-time.
在过去几年中,现代处理器的速度一直保持不变,但集成能力继续遵循摩尔定律,因此,为了实现可扩展,应用程序必须并行化。经典维特比波束搜索的并行化在多核处理器架构或大规模线程架构(如图形处理单元(GPU))上是非常困难的。这种方法的问题是活动状态分散在内存中,因此,它们不能有效地转移到处理器内存中。这个问题可以通过使用A*搜索来规避,它使用启发式来显著减少探索假设的数量。该算法的主要优点是将处理时间从识别网络中的搜索转移到启发式代价的计算上,可以设计成利用并行架构的优势。我们在带有GPU的4核处理器上并行实现了A*解码器,与最大容量的维特比波束搜索相比,加速系数为6.13,实时精度绝对提高了4%。
{"title":"Large Vocabulary Speech Recognition on Parallel Architectures","authors":"P. Cardinal, P. Dumouchel, Gilles Boulianne","doi":"10.1109/TASL.2013.2271591","DOIUrl":"https://doi.org/10.1109/TASL.2013.2271591","url":null,"abstract":"The speed of modern processors has remained constant over the last few years but the integration capacity continues to follow Moore's law and thus, to be scalable, applications must be parallelized. The parallelization of the classical Viterbi beam search has been shown to be very difficult on multi-core processor architectures or massively threaded architectures such as Graphics Processing Unit (GPU). The problem with this approach is that active states are scattered in memory and thus, they cannot be efficiently transferred to the processor memory. This problem can be circumvented by using the A* search which uses a heuristic to significantly reduce the number of explored hypotheses. The main advantage of this algorithm is that the processing time is moved from the search in the recognition network to the computation of heuristic costs, which can be designed to take advantage of parallel architectures. Our parallel implementation of the A* decoder on a 4-core processor with a GPU led to a speed-up factor of 6.13 compared to the Viterbi beam search at its maximum capacity and an improvement of 4% absolute in accuracy at real-time.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2271591","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Diffused Sensing for Sharp Directive Beamforming 锐利指令波束形成的扩散传感
Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2274695
K. Niwa, Yusuke Hioka, K. Furuya, Y. Haneda
We generalized our previously proposed diffused sensing for a microphone array design to achieve sharp directive beamforming to enable various filter design methods to be applied. In the conventional microphone array, various filter design methods have been studied to narrow the directivity beam width. However, it is difficult to minimize the power of interference sources in the beamforming output (output interference power) over a broad frequency range since the cross-correlation between transfer functions from sound sources to microphones increases in some frequencies. With the diffused sensing, the cross-correlation is minimized by physically varying the transfer functions. We investigated how a microphone array should be designed in order to minimize the cross-correlation between transfer functions and found that placing the array in a diffuse acoustic field produces optimum results. Because the transfer functions are known a priori, this finding makes it possible to narrow the directivity beam width over a broad frequency range. This observation can be practically achieved by placing microphones inside a reflective enclosure, part of which is open to let sound waves enter. We conducted experiments using 24 microphones and confirmed that the output interference power was reduced over a broad frequency range and the beam width was narrowed by using the diffused sensing.
我们推广了之前提出的扩散传感麦克风阵列设计,以实现尖锐的指示波束形成,使各种滤波器设计方法得以应用。在传统的传声器阵列中,研究了各种滤波器设计方法来缩小指向性波束宽度。然而,在较宽的频率范围内,由于声源到传声器的传递函数之间的相互关系在某些频率上增加,因此很难使波束形成输出中的干扰源功率(输出干扰功率)最小。在扩散传感中,通过物理改变传递函数来最小化相互关系。我们研究了如何设计传声器阵列以最小化传递函数之间的相互关联,并发现将阵列放置在漫射声场中可以产生最佳效果。因为传递函数是已知的先验的,这一发现使得有可能缩小指向性波束宽度在一个很宽的频率范围内。这种观察实际上可以通过将麦克风放置在一个反射罩内来实现,反射罩的一部分是开放的,可以让声波进入。我们用24个麦克风进行了实验,证实了扩散式传感在较宽的频率范围内降低了输出干扰功率,并缩小了波束宽度。
{"title":"Diffused Sensing for Sharp Directive Beamforming","authors":"K. Niwa, Yusuke Hioka, K. Furuya, Y. Haneda","doi":"10.1109/TASL.2013.2274695","DOIUrl":"https://doi.org/10.1109/TASL.2013.2274695","url":null,"abstract":"We generalized our previously proposed diffused sensing for a microphone array design to achieve sharp directive beamforming to enable various filter design methods to be applied. In the conventional microphone array, various filter design methods have been studied to narrow the directivity beam width. However, it is difficult to minimize the power of interference sources in the beamforming output (output interference power) over a broad frequency range since the cross-correlation between transfer functions from sound sources to microphones increases in some frequencies. With the diffused sensing, the cross-correlation is minimized by physically varying the transfer functions. We investigated how a microphone array should be designed in order to minimize the cross-correlation between transfer functions and found that placing the array in a diffuse acoustic field produces optimum results. Because the transfer functions are known a priori, this finding makes it possible to narrow the directivity beam width over a broad frequency range. This observation can be practically achieved by placing microphones inside a reflective enclosure, part of which is open to let sound waves enter. We conducted experiments using 24 microphones and confirmed that the output interference power was reduced over a broad frequency range and the beam width was narrowed by using the diffused sensing.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2274695","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
期刊
IEEE Transactions on Audio Speech and Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1