首页 > 最新文献

2013 IEEE Workshop on Automatic Speech Recognition and Understanding最新文献

英文 中文
Discriminative piecewise linear transformation based on deep learning for noise robust automatic speech recognition 基于深度学习的判别分段线性变换噪声鲁棒自动语音识别
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707755
Yosuke Kashiwagi, D. Saito, N. Minematsu, K. Hirose
In this paper, we propose the use of deep neural networks to expand conventional methods of statistical feature enhancement based on piecewise linear transformation. Stereo-based piecewise linear compensation for environments (SPLICE), which is a powerful statistical approach for feature enhancement, models the probabilistic distribution of input noisy features as a mixture of Gaussians. However, soft assignment of an input vector to divided regions is sometimes done inadequately and the vector comes to go through inadequate conversion. Especially when conversion has to be linear, the conversion performance will be easily degraded. Feature enhancement using neural networks is another powerful approach which can directly model a non-linear relationship between noisy and clean feature spaces. In this case, however, it tends to suffer from over-fitting problems. In this paper, we attempt to mitigate this problem by reducing the number of model parameters to estimate. Our neural network is trained whose output layer is associated with the states in the clean feature space, not in the noisy feature space. This strategy makes the size of the output layer independent of the kind of a given noisy environment. Firstly, we characterize the distribution of clean features as a Gaussian mixture model and then, by using deep neural networks, estimate discriminatively the state in the clean space that an input noisy feature corresponds to. Experimental evaluations using the Aurora 2 dataset demonstrate that our proposed method has the best performance compared to conventional methods.
在本文中,我们提出使用深度神经网络来扩展传统的基于分段线性变换的统计特征增强方法。基于立体的环境分段线性补偿(SPLICE)是一种功能强大的特征增强统计方法,它将输入噪声特征的概率分布建模为高斯分布的混合。然而,输入向量对分割区域的软赋值有时做得不充分,向量会经过不充分的转换。特别是当转换必须是线性时,转换性能很容易下降。利用神经网络的特征增强是另一种强大的方法,它可以直接模拟噪声和干净特征空间之间的非线性关系。然而,在这种情况下,它往往会出现过拟合问题。在本文中,我们试图通过减少模型参数估计的数量来缓解这个问题。我们训练的神经网络的输出层与干净特征空间中的状态相关联,而不是在有噪声的特征空间中。这种策略使得输出层的大小与给定噪声环境的类型无关。首先,我们将干净特征的分布描述为高斯混合模型,然后通过深度神经网络判别估计输入噪声特征对应的干净空间状态。使用极光2号数据集的实验评估表明,与传统方法相比,我们提出的方法具有最佳性能。
{"title":"Discriminative piecewise linear transformation based on deep learning for noise robust automatic speech recognition","authors":"Yosuke Kashiwagi, D. Saito, N. Minematsu, K. Hirose","doi":"10.1109/ASRU.2013.6707755","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707755","url":null,"abstract":"In this paper, we propose the use of deep neural networks to expand conventional methods of statistical feature enhancement based on piecewise linear transformation. Stereo-based piecewise linear compensation for environments (SPLICE), which is a powerful statistical approach for feature enhancement, models the probabilistic distribution of input noisy features as a mixture of Gaussians. However, soft assignment of an input vector to divided regions is sometimes done inadequately and the vector comes to go through inadequate conversion. Especially when conversion has to be linear, the conversion performance will be easily degraded. Feature enhancement using neural networks is another powerful approach which can directly model a non-linear relationship between noisy and clean feature spaces. In this case, however, it tends to suffer from over-fitting problems. In this paper, we attempt to mitigate this problem by reducing the number of model parameters to estimate. Our neural network is trained whose output layer is associated with the states in the clean feature space, not in the noisy feature space. This strategy makes the size of the output layer independent of the kind of a given noisy environment. Firstly, we characterize the distribution of clean features as a Gaussian mixture model and then, by using deep neural networks, estimate discriminatively the state in the clean space that an input noisy feature corresponds to. Experimental evaluations using the Aurora 2 dataset demonstrate that our proposed method has the best performance compared to conventional methods.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"341 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115673587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A hierarchical system for word discovery exploiting DTW-based initialization 利用基于dwt的初始化进行单词发现的分层系统
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707761
Oliver Walter, Timo Korthals, Reinhold Häb-Umbach, B. Raj
Discovering the linguistic structure of a language solely from spoken input asks for two steps: phonetic and lexical discovery. The first is concerned with identifying the categorical subword unit inventory and relating it to the underlying acoustics, while the second aims at discovering words as repeated patterns of subword units. The hierarchical approach presented here accounts for classification errors in the first stage by modelling the pronunciation of a word in terms of subword units probabilistically: a hidden Markov model with discrete emission probabilities, emitting the observed subword unit sequences. We describe how the system can be learned in a completely unsupervised fashion from spoken input. To improve the initialization of the training of the word pronunciations, the output of a dynamic time warping based acoustic pattern discovery system is used, as it is able to discover similar temporal sequences in the input data. This improved initialization, using only weak supervision, has led to a 40% reduction in word error rate on a digit recognition task.
仅从语音输入中发现语言结构需要两个步骤:语音发现和词汇发现。第一种方法是识别分类子词单元清单并将其与潜在声学联系起来,而第二种方法旨在发现作为子词单元重复模式的单词。本文提出的分层方法通过根据子词单元概率建模单词的发音来解释第一阶段的分类错误:一个具有离散发射概率的隐马尔可夫模型,发射观察到的子词单元序列。我们描述了系统如何以完全无监督的方式从语音输入中学习。为了改善单词发音训练的初始化,使用了基于动态时间翘曲的声学模式发现系统的输出,因为它能够在输入数据中发现相似的时间序列。这种改进的初始化,只使用弱监督,导致数字识别任务中的单词错误率降低了40%。
{"title":"A hierarchical system for word discovery exploiting DTW-based initialization","authors":"Oliver Walter, Timo Korthals, Reinhold Häb-Umbach, B. Raj","doi":"10.1109/ASRU.2013.6707761","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707761","url":null,"abstract":"Discovering the linguistic structure of a language solely from spoken input asks for two steps: phonetic and lexical discovery. The first is concerned with identifying the categorical subword unit inventory and relating it to the underlying acoustics, while the second aims at discovering words as repeated patterns of subword units. The hierarchical approach presented here accounts for classification errors in the first stage by modelling the pronunciation of a word in terms of subword units probabilistically: a hidden Markov model with discrete emission probabilities, emitting the observed subword unit sequences. We describe how the system can be learned in a completely unsupervised fashion from spoken input. To improve the initialization of the training of the word pronunciations, the output of a dynamic time warping based acoustic pattern discovery system is used, as it is able to discover similar temporal sequences in the input data. This improved initialization, using only weak supervision, has led to a 40% reduction in word error rate on a digit recognition task.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130532728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Neighbour selection and adaptation for rapid speaker-dependent ASR 快速说话人依赖ASR的邻居选择与适应
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707706
Udhyakumar Nallasamy, Mark C. Fuhs, M. Woszczyna, Florian Metze, Tanja Schultz
Speaker dependent (SD) ASR systems have significantly lower word error rates (WER) compared to speaker independent (SI) systems. However, SD systems require sufficient training data from the target speaker, which is impractical to collect in a short time. We present a technique for training SD models using just few minutes of speaker's data. We compensate for the lack of adequate speaker-specific data by selecting neighbours from a database of existing speakers who are acoustically close to the target speaker. These neighbours provide ample training data, which is used to adapt the SI model to obtain an initial SD model for the new speaker with significantly lower WER. We evaluate various neighbour selection algorithms on a large-scale medical transcription task and report significant reduction in WER using only 5 mins of speaker-specific data. We conduct a detailed analysis of various factors such as gender and accent in the neighbour selection. Finally, we study neighbour selection and adaptation in the context of discriminative objective functions.
与独立说话人(SI)系统相比,依赖说话人(SD)的ASR系统具有明显较低的单词错误率(WER)。然而,SD系统需要从目标说话者那里获得足够的训练数据,这在短时间内收集是不切实际的。我们提出了一种只用几分钟说话人数据训练SD模型的技术。我们通过从声学上接近目标说话者的现有说话者数据库中选择邻居来弥补缺乏足够的说话者特定数据。这些邻居提供了充足的训练数据,用于调整SI模型,以获得具有显著较低WER的新扬声器的初始SD模型。我们在大规模医学转录任务中评估了各种邻居选择算法,并报告了仅使用5分钟特定讲话者数据就显著降低了WER。我们对邻居选择中的性别、口音等因素进行了详细的分析。最后,我们研究了在判别目标函数背景下的邻居选择和适应。
{"title":"Neighbour selection and adaptation for rapid speaker-dependent ASR","authors":"Udhyakumar Nallasamy, Mark C. Fuhs, M. Woszczyna, Florian Metze, Tanja Schultz","doi":"10.1109/ASRU.2013.6707706","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707706","url":null,"abstract":"Speaker dependent (SD) ASR systems have significantly lower word error rates (WER) compared to speaker independent (SI) systems. However, SD systems require sufficient training data from the target speaker, which is impractical to collect in a short time. We present a technique for training SD models using just few minutes of speaker's data. We compensate for the lack of adequate speaker-specific data by selecting neighbours from a database of existing speakers who are acoustically close to the target speaker. These neighbours provide ample training data, which is used to adapt the SI model to obtain an initial SD model for the new speaker with significantly lower WER. We evaluate various neighbour selection algorithms on a large-scale medical transcription task and report significant reduction in WER using only 5 mins of speaker-specific data. We conduct a detailed analysis of various factors such as gender and accent in the neighbour selection. Finally, we study neighbour selection and adaptation in the context of discriminative objective functions.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129717422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Phonetic and anthropometric conditioning of MSA-KST cognitive impairment characterization system MSA-KST认知障碍表征系统的语音和人体测量调节
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707734
A. Ivanov, S. Jalalvand, R. Gretter, D. Falavigna
We explore the impact of speech- and speaker-specific modeling onto the Modulation Spectrum Analysis - Kolmogorov-Smirnov feature Testing (MSA-KST) characterization method in the task of automated prediction of the cognitive impairment diagnosis, namely dysphasia and pervasive development disorder. Phoneme-synchronous capturing of speech dynamics is a reasonable choice for a segmental speech characterization system as it allows comparing speech dynamics in the similar phonetic contexts. Speaker-specific modeling aims at reducing the “within-the-class” variability of the characterized speech or speaker population by removing the effect of speaker properties that should have no relation to the characterization. Specifically the vocal tract length of a speaker has nothing to do with the diagnosis attribution and, thus, the feature set shall be normalized accordingly. The resulting system compares favorably to the baseline system of the Interspeech'2013 Computational Paralinguistics Challenge.
我们探讨了语音和说话人特定建模对调制频谱分析- Kolmogorov-Smirnov特征测试(MSA-KST)表征方法在自动预测认知障碍诊断任务中的影响,即语言障碍和广泛性发育障碍。语音动态的音素同步捕获是分段语音表征系统的合理选择,因为它允许在相似的语音上下文中比较语音动态。特定于说话人的建模旨在通过消除与特征无关的说话人属性的影响,减少特征语音或说话人群体的“类内”可变性。具体来说,说话人的声道长度与诊断归因无关,因此需要对特征集进行归一化处理。由此产生的系统与Interspeech 2013年计算副语言学挑战赛的基准系统相比具有优势。
{"title":"Phonetic and anthropometric conditioning of MSA-KST cognitive impairment characterization system","authors":"A. Ivanov, S. Jalalvand, R. Gretter, D. Falavigna","doi":"10.1109/ASRU.2013.6707734","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707734","url":null,"abstract":"We explore the impact of speech- and speaker-specific modeling onto the Modulation Spectrum Analysis - Kolmogorov-Smirnov feature Testing (MSA-KST) characterization method in the task of automated prediction of the cognitive impairment diagnosis, namely dysphasia and pervasive development disorder. Phoneme-synchronous capturing of speech dynamics is a reasonable choice for a segmental speech characterization system as it allows comparing speech dynamics in the similar phonetic contexts. Speaker-specific modeling aims at reducing the “within-the-class” variability of the characterized speech or speaker population by removing the effect of speaker properties that should have no relation to the characterization. Specifically the vocal tract length of a speaker has nothing to do with the diagnosis attribution and, thus, the feature set shall be normalized accordingly. The resulting system compares favorably to the baseline system of the Interspeech'2013 Computational Paralinguistics Challenge.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125468980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks 基于深度信念网络的隐马尔可夫模型的自发语音情感识别
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707732
Duc Le, E. Provost
Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition [1]. Our work provides insights into important similarities and differences between speech and emotion.
情绪识别的研究旨在深入了解情绪的时间特性。然而,由于不理想的记录条件和高度模糊的基础真值标签,从自发语音中自动识别情感是具有挑战性的。此外,情绪识别系统通常处理嘈杂的高维数据,因此很难找到具有代表性的特征并训练有效的分类器。我们通过使用深度信念网络来解决这个问题,它可以在低级特征之间建立复杂和非线性的高级关系。我们提出并评估了一套基于隐马尔可夫模型和深度信念网络的混合分类器。我们在FAU Aibo上取得了最先进的结果,FAU Aibo是情感识别领域的一个基准数据集。我们的工作为语言和情感之间重要的异同提供了洞见。
{"title":"Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks","authors":"Duc Le, E. Provost","doi":"10.1109/ASRU.2013.6707732","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707732","url":null,"abstract":"Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition [1]. Our work provides insights into important similarities and differences between speech and emotion.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126311824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 102
An SVD-based scheme for MFCC compression in distributed speech recognition system 分布式语音识别系统中基于奇异值分解的MFCC压缩方案
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707738
A. Touazi, M. Debyeche
This paper proposes a new scheme for low bit-rate source coding of Mel Frequency Cepstral Coefficients (MFCCs) in Distributed Speech Recognition (DSR) system. The method uses the compressed ETSI Advanced Front-End (ETSI-AFE) features factorized into SVD components. By investigating the correlation property between successive MFCC frames, the odd ones are encoded using ETSI-AFE, while only the singular values and the nearest left singular vectors index are encoded and transmitted for the even frames. At the server side, the non-transmitted MFCCs are evaluated through their quantized singular values and the nearest left singular vectors. The system provides a compression bit-rate of 2.7 kbps. The recognition experiments were carried out on the Aurora-2 database for clean and multi-condition training modes. The simulation results show good recognition performance without significant degradation, with respect to the ETSI-AFE encoder.
提出了一种分布式语音识别(DSR)系统中低频倒谱系数(MFCCs)的低比特率源编码方案。该方法将压缩后的ETSI高级前端(ETSI- afe)特征分解为SVD分量。通过研究连续MFCC帧之间的相关性,采用ETSI-AFE对奇数帧进行编码,而对偶数帧只编码并传输奇异值和最接近的左奇异向量索引。在服务器端,通过量化奇异值和最接近的左奇异向量来评估非传输mfc。系统提供2.7 kbps的压缩比特率。在Aurora-2数据库上进行了清洁和多条件训练模式的识别实验。仿真结果表明,相对于ETSI-AFE编码器,该算法具有良好的识别性能,且没有明显的退化。
{"title":"An SVD-based scheme for MFCC compression in distributed speech recognition system","authors":"A. Touazi, M. Debyeche","doi":"10.1109/ASRU.2013.6707738","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707738","url":null,"abstract":"This paper proposes a new scheme for low bit-rate source coding of Mel Frequency Cepstral Coefficients (MFCCs) in Distributed Speech Recognition (DSR) system. The method uses the compressed ETSI Advanced Front-End (ETSI-AFE) features factorized into SVD components. By investigating the correlation property between successive MFCC frames, the odd ones are encoded using ETSI-AFE, while only the singular values and the nearest left singular vectors index are encoded and transmitted for the even frames. At the server side, the non-transmitted MFCCs are evaluated through their quantized singular values and the nearest left singular vectors. The system provides a compression bit-rate of 2.7 kbps. The recognition experiments were carried out on the Aurora-2 database for clean and multi-condition training modes. The simulation results show good recognition performance without significant degradation, with respect to the ETSI-AFE encoder.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126447524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A generalized discriminative training framework for system combination 系统组合的广义判别训练框架
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707703
Yuuki Tachioka, Shinji Watanabe, Jonathan Le Roux, J. Hershey
This paper proposes a generalized discriminative training framework for system combination, which encompasses acoustic modeling (Gaussian mixture models and deep neural networks) and discriminative feature transformation. To improve the performance by combining base systems with complementary systems, complementary systems should have reasonably good performance while tending to have different outputs compared with the base system. Although it is difficult to balance these two somewhat opposite targets in conventional heuristic combination approaches, our framework provides a new objective function that enables to adjust the balance within a sequential discriminative training criterion. We also describe how the proposed method relates to boosting methods. Experiments on highly noisy middle vocabulary speech recognition task (2nd CHiME challenge track 2) and LVCSR task (Corpus of Spontaneous Japanese) show the effectiveness of the proposed method, compared with a conventional system combination approach.
本文提出了一种用于系统组合的广义判别训练框架,该框架包括声学建模(高斯混合模型和深度神经网络)和判别特征变换。为了通过基础系统与互补系统的结合来提高性能,互补系统应具有合理的良好性能,但与基础系统相比,互补系统的输出往往不同。尽管在传统的启发式组合方法中很难平衡这两个有点相反的目标,但我们的框架提供了一个新的目标函数,可以在顺序判别性训练标准中调整平衡。我们还描述了所提出的方法与增强方法的关系。在高噪声中词汇语音识别任务(第2次CHiME挑战轨道2)和LVCSR任务(自发日语语料库)上的实验表明,与传统的系统组合方法相比,该方法是有效的。
{"title":"A generalized discriminative training framework for system combination","authors":"Yuuki Tachioka, Shinji Watanabe, Jonathan Le Roux, J. Hershey","doi":"10.1109/ASRU.2013.6707703","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707703","url":null,"abstract":"This paper proposes a generalized discriminative training framework for system combination, which encompasses acoustic modeling (Gaussian mixture models and deep neural networks) and discriminative feature transformation. To improve the performance by combining base systems with complementary systems, complementary systems should have reasonably good performance while tending to have different outputs compared with the base system. Although it is difficult to balance these two somewhat opposite targets in conventional heuristic combination approaches, our framework provides a new objective function that enables to adjust the balance within a sequential discriminative training criterion. We also describe how the proposed method relates to boosting methods. Experiments on highly noisy middle vocabulary speech recognition task (2nd CHiME challenge track 2) and LVCSR task (Corpus of Spontaneous Japanese) show the effectiveness of the proposed method, compared with a conventional system combination approach.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121489053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Expert-based reward shaping and exploration scheme for boosting policy learning of dialogue management 基于专家的奖励塑造及促进对话管理政策学习的探索方案
Pub Date : 2013-12-01 DOI: 10.1109/ASRU.2013.6707714
Emmanuel Ferreira, F. Lefèvre
This paper investigates the conditions under which expert knowledge can be used to accelerate the policy optimization of a learning agent. Recent works on reinforcement learning for dialogue management allowed to devise sophisticated methods for value estimation in order to deal all together with exploration/exploitation dilemma, sample-efficiency and non-stationary environments. In this paper, a reward shaping method and an exploration scheme, both based on some intuitive hand-coded expert advices, are combined with an efficient temporal difference-based learning procedure. The key objective is to boost the initial training stage, when the system is not sufficiently reliable to interact with real users (e.g. clients). Our claims are illustrated by experiments based on simulation and carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS).
研究了利用专家知识加速学习智能体策略优化的条件。最近关于对话管理的强化学习的工作允许设计复杂的价值估计方法,以处理所有的探索/开发困境,样本效率和非平稳环境。本文将基于直观的手工编码专家建议的奖励塑造方法和探索方案与有效的基于时间差异的学习过程相结合。关键目标是提高初始训练阶段,当系统不够可靠,无法与真实用户(例如客户端)交互时。我们的主张是通过基于模拟的实验来说明的,并使用了最先进的面向目标的对话管理框架——隐藏信息状态(HIS)。
{"title":"Expert-based reward shaping and exploration scheme for boosting policy learning of dialogue management","authors":"Emmanuel Ferreira, F. Lefèvre","doi":"10.1109/ASRU.2013.6707714","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707714","url":null,"abstract":"This paper investigates the conditions under which expert knowledge can be used to accelerate the policy optimization of a learning agent. Recent works on reinforcement learning for dialogue management allowed to devise sophisticated methods for value estimation in order to deal all together with exploration/exploitation dilemma, sample-efficiency and non-stationary environments. In this paper, a reward shaping method and an exploration scheme, both based on some intuitive hand-coded expert advices, are combined with an efficient temporal difference-based learning procedure. The key objective is to boost the initial training stage, when the system is not sufficiently reliable to interact with real users (e.g. clients). Our claims are illustrated by experiments based on simulation and carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS).","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121518288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling 隐式预处理和采样加速深度神经网络无hessian优化
Pub Date : 2013-09-05 DOI: 10.1109/ASRU.2013.6707747
Tara N. Sainath, L. Horesh, Brian Kingsbury, A. Aravkin, B. Ramabhadran
Hessian-free training has become a popular parallel second order optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5× speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3× speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
无Hessian-free训练已成为一种流行的深度神经网络并行二阶优化训练方法。本研究旨在通过减少用于训练的数据量以及减少用于隐式估计Hessian的Krylov子空间求解器迭代次数来加速无Hessian训练。在本文中,我们开发了一种基于L-BFGS的预处理方案,避免了显式访问Hessian的需要。由于L-BFGS不能被视为不动点迭代,我们进一步提出使用灵活的Krylov子空间解算器,它保留了传统对子空间解算器所需的理论收敛保证。其次,我们提出了一种新的采样算法,该算法以几何方式增加了梯度和Krylov子空间迭代计算所使用的数据量。在50小时的英语广播新闻任务中,我们发现这些方法提供了大约1.5倍的加速,而在300小时的总机任务中,这些技术提供了超过2.3倍的加速,并且没有损失WER。这些结果表明,随着问题规模和复杂性的增长,预期会有进一步的加速。
{"title":"Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling","authors":"Tara N. Sainath, L. Horesh, Brian Kingsbury, A. Aravkin, B. Ramabhadran","doi":"10.1109/ASRU.2013.6707747","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707747","url":null,"abstract":"Hessian-free training has become a popular parallel second order optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5× speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3× speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117244156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Improvements to Deep Convolutional Neural Networks for LVCSR LVCSR中深度卷积神经网络的改进
Pub Date : 2013-09-05 DOI: 10.1109/ASRU.2013.6707749
Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, G. Saon, H. Soltau, T. Beran, A. Aravkin, B. Ramabhadran
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
深度卷积神经网络(cnn)比深度神经网络(DNN)更强大,因为它们能够更好地减少输入信号的频谱变化。实验也证实了这一点,与dnn相比,cnn在各种LVCSR任务中的单词错误率(WER)在4-12%之间有所改善。在本文中,我们描述了进一步提高CNN性能的不同方法。首先,我们进行了深入的分析,比较有限权值共享和全权值共享与最先进的功能。其次,我们将各种池化策略应用于LVCSR语音任务中,这些策略在计算机视觉上已经有所改善。第三,我们引入了一种有效地将说话人自适应(fMLLR)纳入对数特征的方法。第四,在无hessian序列训练中引入了一种有效的dropout策略。我们发现,通过这些改进,特别是在fMLLR和dropout方面,我们能够在50小时的广播新闻任务中实现额外的2-3%的相对改进。在一个更大的400小时的BN任务中,我们发现比之前最好的CNN基线额外提高了4-5%。
{"title":"Improvements to Deep Convolutional Neural Networks for LVCSR","authors":"Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, G. Saon, H. Soltau, T. Beran, A. Aravkin, B. Ramabhadran","doi":"10.1109/ASRU.2013.6707749","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707749","url":null,"abstract":"Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127609422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 218
期刊
2013 IEEE Workshop on Automatic Speech Recognition and Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1