首页 > 最新文献

2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文 中文
Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks 基于深度信念网络的隐马尔可夫模型的自发语音情感识别
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707732
Duc Le, E. Provost
Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition [1]. Our work provides insights into important similarities and differences between speech and emotion.
情绪识别的研究旨在深入了解情绪的时间特性。然而,由于不理想的记录条件和高度模糊的基础真值标签,从自发语音中自动识别情感是具有挑战性的。此外,情绪识别系统通常处理嘈杂的高维数据,因此很难找到具有代表性的特征并训练有效的分类器。我们通过使用深度信念网络来解决这个问题,它可以在低级特征之间建立复杂和非线性的高级关系。我们提出并评估了一套基于隐马尔可夫模型和深度信念网络的混合分类器。我们在FAU Aibo上取得了最先进的结果,FAU Aibo是情感识别领域的一个基准数据集。我们的工作为语言和情感之间重要的异同提供了洞见。
{"title":"Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks","authors":"Duc Le, E. Provost","doi":"10.1109/ASRU.2013.6707732","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707732","url":null,"abstract":"Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition [1]. Our work provides insights into important similarities and differences between speech and emotion.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126311824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 102
An SVD-based scheme for MFCC compression in distributed speech recognition system 分布式语音识别系统中基于奇异值分解的MFCC压缩方案
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707738
A. Touazi, M. Debyeche
This paper proposes a new scheme for low bit-rate source coding of Mel Frequency Cepstral Coefficients (MFCCs) in Distributed Speech Recognition (DSR) system. The method uses the compressed ETSI Advanced Front-End (ETSI-AFE) features factorized into SVD components. By investigating the correlation property between successive MFCC frames, the odd ones are encoded using ETSI-AFE, while only the singular values and the nearest left singular vectors index are encoded and transmitted for the even frames. At the server side, the non-transmitted MFCCs are evaluated through their quantized singular values and the nearest left singular vectors. The system provides a compression bit-rate of 2.7 kbps. The recognition experiments were carried out on the Aurora-2 database for clean and multi-condition training modes. The simulation results show good recognition performance without significant degradation, with respect to the ETSI-AFE encoder.
提出了一种分布式语音识别(DSR)系统中低频倒谱系数(MFCCs)的低比特率源编码方案。该方法将压缩后的ETSI高级前端(ETSI- afe)特征分解为SVD分量。通过研究连续MFCC帧之间的相关性,采用ETSI-AFE对奇数帧进行编码,而对偶数帧只编码并传输奇异值和最接近的左奇异向量索引。在服务器端,通过量化奇异值和最接近的左奇异向量来评估非传输mfc。系统提供2.7 kbps的压缩比特率。在Aurora-2数据库上进行了清洁和多条件训练模式的识别实验。仿真结果表明,相对于ETSI-AFE编码器,该算法具有良好的识别性能,且没有明显的退化。
{"title":"An SVD-based scheme for MFCC compression in distributed speech recognition system","authors":"A. Touazi, M. Debyeche","doi":"10.1109/ASRU.2013.6707738","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707738","url":null,"abstract":"This paper proposes a new scheme for low bit-rate source coding of Mel Frequency Cepstral Coefficients (MFCCs) in Distributed Speech Recognition (DSR) system. The method uses the compressed ETSI Advanced Front-End (ETSI-AFE) features factorized into SVD components. By investigating the correlation property between successive MFCC frames, the odd ones are encoded using ETSI-AFE, while only the singular values and the nearest left singular vectors index are encoded and transmitted for the even frames. At the server side, the non-transmitted MFCCs are evaluated through their quantized singular values and the nearest left singular vectors. The system provides a compression bit-rate of 2.7 kbps. The recognition experiments were carried out on the Aurora-2 database for clean and multi-condition training modes. The simulation results show good recognition performance without significant degradation, with respect to the ETSI-AFE encoder.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126447524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A generalized discriminative training framework for system combination 系统组合的广义判别训练框架
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707703
Yuuki Tachioka, Shinji Watanabe, Jonathan Le Roux, J. Hershey
This paper proposes a generalized discriminative training framework for system combination, which encompasses acoustic modeling (Gaussian mixture models and deep neural networks) and discriminative feature transformation. To improve the performance by combining base systems with complementary systems, complementary systems should have reasonably good performance while tending to have different outputs compared with the base system. Although it is difficult to balance these two somewhat opposite targets in conventional heuristic combination approaches, our framework provides a new objective function that enables to adjust the balance within a sequential discriminative training criterion. We also describe how the proposed method relates to boosting methods. Experiments on highly noisy middle vocabulary speech recognition task (2nd CHiME challenge track 2) and LVCSR task (Corpus of Spontaneous Japanese) show the effectiveness of the proposed method, compared with a conventional system combination approach.
本文提出了一种用于系统组合的广义判别训练框架,该框架包括声学建模(高斯混合模型和深度神经网络)和判别特征变换。为了通过基础系统与互补系统的结合来提高性能,互补系统应具有合理的良好性能,但与基础系统相比,互补系统的输出往往不同。尽管在传统的启发式组合方法中很难平衡这两个有点相反的目标,但我们的框架提供了一个新的目标函数,可以在顺序判别性训练标准中调整平衡。我们还描述了所提出的方法与增强方法的关系。在高噪声中词汇语音识别任务(第2次CHiME挑战轨道2)和LVCSR任务(自发日语语料库)上的实验表明,与传统的系统组合方法相比,该方法是有效的。
{"title":"A generalized discriminative training framework for system combination","authors":"Yuuki Tachioka, Shinji Watanabe, Jonathan Le Roux, J. Hershey","doi":"10.1109/ASRU.2013.6707703","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707703","url":null,"abstract":"This paper proposes a generalized discriminative training framework for system combination, which encompasses acoustic modeling (Gaussian mixture models and deep neural networks) and discriminative feature transformation. To improve the performance by combining base systems with complementary systems, complementary systems should have reasonably good performance while tending to have different outputs compared with the base system. Although it is difficult to balance these two somewhat opposite targets in conventional heuristic combination approaches, our framework provides a new objective function that enables to adjust the balance within a sequential discriminative training criterion. We also describe how the proposed method relates to boosting methods. Experiments on highly noisy middle vocabulary speech recognition task (2nd CHiME challenge track 2) and LVCSR task (Corpus of Spontaneous Japanese) show the effectiveness of the proposed method, compared with a conventional system combination approach.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121489053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Expert-based reward shaping and exploration scheme for boosting policy learning of dialogue management 基于专家的奖励塑造及促进对话管理政策学习的探索方案
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707714
Emmanuel Ferreira, F. Lefèvre
This paper investigates the conditions under which expert knowledge can be used to accelerate the policy optimization of a learning agent. Recent works on reinforcement learning for dialogue management allowed to devise sophisticated methods for value estimation in order to deal all together with exploration/exploitation dilemma, sample-efficiency and non-stationary environments. In this paper, a reward shaping method and an exploration scheme, both based on some intuitive hand-coded expert advices, are combined with an efficient temporal difference-based learning procedure. The key objective is to boost the initial training stage, when the system is not sufficiently reliable to interact with real users (e.g. clients). Our claims are illustrated by experiments based on simulation and carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS).
研究了利用专家知识加速学习智能体策略优化的条件。最近关于对话管理的强化学习的工作允许设计复杂的价值估计方法,以处理所有的探索/开发困境,样本效率和非平稳环境。本文将基于直观的手工编码专家建议的奖励塑造方法和探索方案与有效的基于时间差异的学习过程相结合。关键目标是提高初始训练阶段,当系统不够可靠,无法与真实用户(例如客户端)交互时。我们的主张是通过基于模拟的实验来说明的,并使用了最先进的面向目标的对话管理框架——隐藏信息状态(HIS)。
{"title":"Expert-based reward shaping and exploration scheme for boosting policy learning of dialogue management","authors":"Emmanuel Ferreira, F. Lefèvre","doi":"10.1109/ASRU.2013.6707714","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707714","url":null,"abstract":"This paper investigates the conditions under which expert knowledge can be used to accelerate the policy optimization of a learning agent. Recent works on reinforcement learning for dialogue management allowed to devise sophisticated methods for value estimation in order to deal all together with exploration/exploitation dilemma, sample-efficiency and non-stationary environments. In this paper, a reward shaping method and an exploration scheme, both based on some intuitive hand-coded expert advices, are combined with an efficient temporal difference-based learning procedure. The key objective is to boost the initial training stage, when the system is not sufficiently reliable to interact with real users (e.g. clients). Our claims are illustrated by experiments based on simulation and carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS).","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121518288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling 隐式预处理和采样加速深度神经网络无hessian优化
Pub Date : 2013-09-05 DOI: 10.1109/ASRU.2013.6707747
Tara N. Sainath, L. Horesh, Brian Kingsbury, A. Aravkin, B. Ramabhadran
Hessian-free training has become a popular parallel second order optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5× speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3× speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
无Hessian-free训练已成为一种流行的深度神经网络并行二阶优化训练方法。本研究旨在通过减少用于训练的数据量以及减少用于隐式估计Hessian的Krylov子空间求解器迭代次数来加速无Hessian训练。在本文中,我们开发了一种基于L-BFGS的预处理方案,避免了显式访问Hessian的需要。由于L-BFGS不能被视为不动点迭代,我们进一步提出使用灵活的Krylov子空间解算器,它保留了传统对子空间解算器所需的理论收敛保证。其次,我们提出了一种新的采样算法,该算法以几何方式增加了梯度和Krylov子空间迭代计算所使用的数据量。在50小时的英语广播新闻任务中,我们发现这些方法提供了大约1.5倍的加速,而在300小时的总机任务中,这些技术提供了超过2.3倍的加速,并且没有损失WER。这些结果表明,随着问题规模和复杂性的增长,预期会有进一步的加速。
{"title":"Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling","authors":"Tara N. Sainath, L. Horesh, Brian Kingsbury, A. Aravkin, B. Ramabhadran","doi":"10.1109/ASRU.2013.6707747","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707747","url":null,"abstract":"Hessian-free training has become a popular parallel second order optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5× speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3× speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117244156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Improvements to Deep Convolutional Neural Networks for LVCSR LVCSR中深度卷积神经网络的改进
Pub Date : 2013-09-05 DOI: 10.1109/ASRU.2013.6707749
Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, G. Saon, H. Soltau, T. Beran, A. Aravkin, B. Ramabhadran
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
深度卷积神经网络(cnn)比深度神经网络(DNN)更强大,因为它们能够更好地减少输入信号的频谱变化。实验也证实了这一点,与dnn相比,cnn在各种LVCSR任务中的单词错误率(WER)在4-12%之间有所改善。在本文中,我们描述了进一步提高CNN性能的不同方法。首先,我们进行了深入的分析,比较有限权值共享和全权值共享与最先进的功能。其次,我们将各种池化策略应用于LVCSR语音任务中,这些策略在计算机视觉上已经有所改善。第三,我们引入了一种有效地将说话人自适应(fMLLR)纳入对数特征的方法。第四,在无hessian序列训练中引入了一种有效的dropout策略。我们发现,通过这些改进,特别是在fMLLR和dropout方面,我们能够在50小时的广播新闻任务中实现额外的2-3%的相对改进。在一个更大的400小时的BN任务中,我们发现比之前最好的CNN基线额外提高了4-5%。
{"title":"Improvements to Deep Convolutional Neural Networks for LVCSR","authors":"Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, G. Saon, H. Soltau, T. Beran, A. Aravkin, B. Ramabhadran","doi":"10.1109/ASRU.2013.6707749","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707749","url":null,"abstract":"Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127609422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 218
Investigation of multilingual deep neural networks for spoken term detection 多语种深度神经网络口语词汇检测研究
Pub Date : 2013-09-03 DOI: 10.1109/ASRU.2013.6707719
K. Knill, M. Gales, S. Rath, P. Woodland, Chao Zhang, Shi-Xiong Zhang
The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (~10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance.
开发针对低资源语言的高性能语音处理系统是一个具有挑战性的领域。解决资源缺乏的一种方法是利用来自多种语言的数据。近年来一个流行的方向是在语音到文本(STT)系统中使用多语言数据训练的瓶颈特征或混合系统。本文对这些多语言方法在口语术语检测中的应用进行了研究。实验使用IARPA Babel有限语言包语料库(~10小时/语言),使用4种语言进行初始多语言系统开发和额外的目标语言。通过在Tandem配置中使用多语言瓶颈特性获得的STT收益也适用于关键字搜索(KWS)。通过将语言问题合并到训练集语言的串联GMM-HMM决策树中,观察到STT和KWS的进一步改进。适应混合系统的平均表现略差于适应串联系统。对目标语言进行的独立于语言的声学模型测试表明,为了达到合理的性能,目前最少需要对声学模型进行再训练或适应目标语言。
{"title":"Investigation of multilingual deep neural networks for spoken term detection","authors":"K. Knill, M. Gales, S. Rath, P. Woodland, Chao Zhang, Shi-Xiong Zhang","doi":"10.1109/ASRU.2013.6707719","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707719","url":null,"abstract":"The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (~10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121261364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
Modified splice and its extension to non-stereo data for noise robust speech recognition 改进的拼接及其扩展到非立体声数据的噪声鲁棒语音识别
Pub Date : 2013-07-15 DOI: 10.1109/ASRU.2013.6707725
D. S. P. Kumar, N. Prasad, Vikas Joshi, S. Umesh
In this paper, a modification to the training process of the popular SPLICE algorithm has been proposed for noise robust speech recognition. The modification is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean and noisy training utterances, but not stereo counterparts, are required. Finally, an MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed. The modified SPLICE shows 8.6% absolute improvement over SPLICE in Test C of Aurora-2 database, and 2.93% overall. Non-stereo method shows 10.37% and 6.93% absolute improvements over Aurora-2 and Aurora-4 baseline models respectively. Run-time adaptation shows 9.89% absolute improvement in modified framework as compared to SPLICE for Test C, and 4.96% overall w.r.t. standard MLLR adaptation on HMMs.
本文对流行的SPLICE算法的训练过程进行了改进,用于噪声鲁棒语音识别。这种改进是基于特征相关性的,使这种基于立体的算法能够提高在所有噪声条件下的性能,特别是在看不见的情况下。此外,改进的框架被扩展到非立体数据集,其中需要干净和嘈杂的训练话语,但不需要立体对口。最后,在SPLICE框架下提出了一种基于mllr的高效运行时噪声自适应方法。改进后的SPLICE在Aurora-2数据库的C测试中比SPLICE的绝对性能提高了8.6%,总体性能提高了2.93%。与Aurora-2和Aurora-4基线模型相比,非立体方法的绝对改进率分别为10.37%和6.93%。与SPLICE测试C相比,修改后的框架的运行时适应性提高了9.89%,在hmm上的w.r.t.标准MLLR适应性提高了4.96%。
{"title":"Modified splice and its extension to non-stereo data for noise robust speech recognition","authors":"D. S. P. Kumar, N. Prasad, Vikas Joshi, S. Umesh","doi":"10.1109/ASRU.2013.6707725","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707725","url":null,"abstract":"In this paper, a modification to the training process of the popular SPLICE algorithm has been proposed for noise robust speech recognition. The modification is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean and noisy training utterances, but not stereo counterparts, are required. Finally, an MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed. The modified SPLICE shows 8.6% absolute improvement over SPLICE in Test C of Aurora-2 database, and 2.93% overall. Non-stereo method shows 10.37% and 6.93% absolute improvements over Aurora-2 and Aurora-4 baseline models respectively. Run-time adaptation shows 9.89% absolute improvement in modified framework as compared to SPLICE for Test C, and 4.96% overall w.r.t. standard MLLR adaptation on HMMs.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115416820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
NMF-based keyword learning from scarce data 基于nmf的稀缺数据关键字学习
Pub Date : 1900-01-01 DOI: 10.1109/ASRU.2013.6707762
B. Ons, J. Gemmeke, H. V. hamme
This research is situated in a project aimed at the development of a vocal user interface (VUI) that learns to understand its users specifically persons with a speech impairment. The vocal interface adapts to the speech of the user by learning the vocabulary from interaction examples. Word learning is implemented through weakly supervised non-negative matrix factorization (NMF). The goal of this study is to investigate how we can improve word learning when the number of interaction examples is low. We demonstrate two approaches to train NMF models on scarce data: 1) training word models using smoothed training data, and 2) training word models that strictly correspond to the grounding information derived from a few interaction examples. We found that both approaches can substantially improve word learning from scarce training data.
本研究位于一个旨在开发语音用户界面(VUI)的项目中,该界面可以学习理解其用户,特别是有语言障碍的人。语音界面通过学习交互示例中的词汇来适应用户的语音。单词学习是通过弱监督非负矩阵分解(NMF)实现的。本研究的目的是探讨如何在交互示例数量较低的情况下提高单词学习。我们展示了在稀缺数据上训练NMF模型的两种方法:1)使用平滑的训练数据训练词模型,以及2)训练严格对应于从几个交互示例中获得的基础信息的词模型。我们发现这两种方法都可以从稀缺的训练数据中大大提高单词学习。
{"title":"NMF-based keyword learning from scarce data","authors":"B. Ons, J. Gemmeke, H. V. hamme","doi":"10.1109/ASRU.2013.6707762","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707762","url":null,"abstract":"This research is situated in a project aimed at the development of a vocal user interface (VUI) that learns to understand its users specifically persons with a speech impairment. The vocal interface adapts to the speech of the user by learning the vocabulary from interaction examples. Word learning is implemented through weakly supervised non-negative matrix factorization (NMF). The goal of this study is to investigate how we can improve word learning when the number of interaction examples is low. We demonstrate two approaches to train NMF models on scarce data: 1) training word models using smoothed training data, and 2) training word models that strictly correspond to the grounding information derived from a few interaction examples. We found that both approaches can substantially improve word learning from scarce training data.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133879205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A study of supervised intrinsic spectral analysis for TIMIT phone classification 监督本征频谱分析在TIMIT手机分类中的应用研究
Pub Date : 1900-01-01 DOI: 10.1109/ASRU.2013.6707739
Reza Sahraeian, Dirk Van Compernolle
Intrinsic Spectral Analysis (ISA) has been formulated within a manifold learning setting allowing natural extensions to out-of-sample data together with feature reduction in a learning framework. In this paper, we propose two approaches to improve the performance of supervised ISA, and then we examine the effect of applying Linear Discriminant technique in the intrinsic subspace compared with the extrinsic one. In the interest of reducing complexity, we propose a preprocessing operation to find a small subset of data points being well representative of the manifold structure; this is accomplished by maximizing the quadratic Renyi entropy. Furthermore, we use class based graphs which not only simplify our problem but also can be helpful in a classification task. Experimental results for phone classification task on TIMIT dataset showed that ISA features improve the performance compared with traditional features, and supervised discriminant techniques outperform in the ISA subspace compared to conventional feature spaces.
本征谱分析(ISA)已在流形学习设置中制定,允许对样本外数据进行自然扩展,并在学习框架中进行特征缩减。在本文中,我们提出了两种方法来提高有监督ISA的性能,然后比较了在本征子空间中应用线性判别技术与在外在子空间中应用线性判别技术的效果。为了降低复杂性,我们提出了一种预处理操作,以找到一个小的数据点子集,很好地代表流形结构;这是通过最大化二次Renyi熵来实现的。此外,我们使用了基于类的图,这不仅简化了我们的问题,而且在分类任务中也很有帮助。在TIMIT数据集上进行手机分类任务的实验结果表明,ISA特征比传统特征的性能有所提高,监督判别技术在ISA子空间上的表现优于传统特征空间。
{"title":"A study of supervised intrinsic spectral analysis for TIMIT phone classification","authors":"Reza Sahraeian, Dirk Van Compernolle","doi":"10.1109/ASRU.2013.6707739","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707739","url":null,"abstract":"Intrinsic Spectral Analysis (ISA) has been formulated within a manifold learning setting allowing natural extensions to out-of-sample data together with feature reduction in a learning framework. In this paper, we propose two approaches to improve the performance of supervised ISA, and then we examine the effect of applying Linear Discriminant technique in the intrinsic subspace compared with the extrinsic one. In the interest of reducing complexity, we propose a preprocessing operation to find a small subset of data points being well representative of the manifold structure; this is accomplished by maximizing the quadratic Renyi entropy. Furthermore, we use class based graphs which not only simplify our problem but also can be helpful in a classification task. Experimental results for phone classification task on TIMIT dataset showed that ISA features improve the performance compared with traditional features, and supervised discriminant techniques outperform in the ISA subspace compared to conventional feature spaces.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131965972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2013 IEEE Workshop on Automatic Speech Recognition and Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1