首页 > 最新文献

2012 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
On the generalization of Shannon entropy for speech recognition 香农熵在语音识别中的推广
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424204
Nicolas Obin, M. Liuni
This paper introduces an entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition. The proposed representation is based on the Rényi entropy, which is a generalization of the Shannon entropy. In audio signal representation, Rényi entropy presents the advantage of focusing either on the harmonic content (prominent amplitude within a distribution) or on the noise content (equal distribution of amplitudes). The proposed representation outperforms all other noisiness measures - including Shannon and Wiener entropies - in a large-scale classification of vocal effort (whispered-soft/normal/loud-shouted) in the real scenario of multi-language massive role-playing video games. The improvement is around 10% in relative error reduction, and is particularly significant for the recognition of noisy speech - i.e., whispery/breathy speech. This confirms the role of noisiness for speech recognition, and will further be extended to the classification of voice quality for the design of an automatic voice casting system in video games.
本文介绍了一种基于熵的频谱表示,作为音频信号噪声程度的度量,补充了音频和语音识别的标准mfc。所提出的表示是基于rsamunyi熵,这是香农熵的推广。在音频信号表示中,r尼米熵的优点是既可以关注谐波内容(分布内的显著振幅),也可以关注噪声内容(振幅的均匀分布)。在多语言大型角色扮演视频游戏的真实场景中,在大规模的声音努力分类(低语-柔和/正常/大声喊叫)中,所提出的表示优于所有其他噪音度量——包括香农和维纳熵。在相对误差减少方面,改进幅度约为10%,对于嘈杂语音的识别尤其显著,例如低语/呼吸语音。这证实了噪声在语音识别中的作用,并将进一步扩展到语音质量分类,用于设计电子游戏中的自动语音分配系统。
{"title":"On the generalization of Shannon entropy for speech recognition","authors":"Nicolas Obin, M. Liuni","doi":"10.1109/SLT.2012.6424204","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424204","url":null,"abstract":"This paper introduces an entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition. The proposed representation is based on the Rényi entropy, which is a generalization of the Shannon entropy. In audio signal representation, Rényi entropy presents the advantage of focusing either on the harmonic content (prominent amplitude within a distribution) or on the noise content (equal distribution of amplitudes). The proposed representation outperforms all other noisiness measures - including Shannon and Wiener entropies - in a large-scale classification of vocal effort (whispered-soft/normal/loud-shouted) in the real scenario of multi-language massive role-playing video games. The improvement is around 10% in relative error reduction, and is particularly significant for the recognition of noisy speech - i.e., whispery/breathy speech. This confirms the role of noisiness for speech recognition, and will further be extended to the classification of voice quality for the design of an automatic voice casting system in video games.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126208023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion 基于混合核和阈值融合的多类支持向量机语音情感分类
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424267
Na Yang, R. Muraleedharan, J. Kohl, I. Demirkol, W. Heinzelman, Melissa L. Sturge‐Apple
Emotion classification is essential for understanding human interactions and hence is a vital component of behavioral studies. Although numerous algorithms have been developed, the emotion classification accuracy is still short of what is desired for the algorithms to be used in real systems. In this paper, we evaluate an approach where basic acoustic features are extracted from speech samples, and the One-Against-All (OAA) Support Vector Machine (SVM) learning algorithm is used. We use a novel hybrid kernel, where we choose the optimal kernel functions for the individual OAA classifiers. Outputs from the OAA classifiers are normalized and combined using a thresholding fusion mechanism to finally classify the emotion. Samples with low `relative confidence' are left as `unclassified' to further improve the classification accuracy. Results show that the decision-level recall of our approach for six-class emotion classification is 80.5%, outperforming a state-of-the-art approach that uses the same dataset.
情绪分类对于理解人类互动至关重要,因此也是行为研究的重要组成部分。虽然已经开发了许多算法,但情感分类的精度仍然不能满足算法在实际系统中使用的要求。在本文中,我们评估了一种从语音样本中提取基本声学特征的方法,该方法使用了一对全(OAA)支持向量机(SVM)学习算法。我们使用了一种新的混合核,其中我们为单个OAA分类器选择最优的核函数。OAA分类器的输出被归一化,并使用阈值融合机制进行组合,最终对情感进行分类。“相对置信度”较低的样本被保留为“未分类”,以进一步提高分类精度。结果表明,我们的方法对六类情绪分类的决策级召回率为80.5%,优于使用相同数据集的最先进方法。
{"title":"Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion","authors":"Na Yang, R. Muraleedharan, J. Kohl, I. Demirkol, W. Heinzelman, Melissa L. Sturge‐Apple","doi":"10.1109/SLT.2012.6424267","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424267","url":null,"abstract":"Emotion classification is essential for understanding human interactions and hence is a vital component of behavioral studies. Although numerous algorithms have been developed, the emotion classification accuracy is still short of what is desired for the algorithms to be used in real systems. In this paper, we evaluate an approach where basic acoustic features are extracted from speech samples, and the One-Against-All (OAA) Support Vector Machine (SVM) learning algorithm is used. We use a novel hybrid kernel, where we choose the optimal kernel functions for the individual OAA classifiers. Outputs from the OAA classifiers are normalized and combined using a thresholding fusion mechanism to finally classify the emotion. Samples with low `relative confidence' are left as `unclassified' to further improve the classification accuracy. Results show that the decision-level recall of our approach for six-class emotion classification is 80.5%, outperforming a state-of-the-art approach that uses the same dataset.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115376629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Policy optimisation of POMDP-based dialogue systems without state space compression 无状态空间压缩的基于pomdp的对话系统策略优化
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424165
Milica Gasic, Matthew Henderson, Blaise Thomson, P. Tsiakoulis, S. Young
The partially observable Markov decision process (POMDP) has been proposed as a dialogue model that enables automatic improvement of the dialogue policy and robustness to speech understanding errors. It requires, however, a large number of dialogues to train the dialogue policy. Gaussian processes (GP) have recently been applied to POMDP dialogue management optimisation showing an ability to substantially increase the speed of learning. Here, we investigate this further using the Bayesian Update of Dialogue State dialogue manager. We show that it is possible to apply Gaussian processes directly to the belief state, removing the need for a parametric policy representation. In addition, the resulting policy learns significantly faster while maintaining operational performance.
部分可观察马尔可夫决策过程(POMDP)作为一种对话模型被提出,它能够自动改进对话策略和对语音理解错误的鲁棒性。然而,这需要大量的对话来训练对话政策。高斯过程(GP)最近被应用于POMDP对话管理优化,显示出大大提高学习速度的能力。在这里,我们使用对话状态对话管理器的贝叶斯更新来进一步研究这个问题。我们证明了直接将高斯过程应用于信念状态是可能的,从而消除了对参数策略表示的需要。此外,生成的策略在保持操作性能的同时学习速度显著加快。
{"title":"Policy optimisation of POMDP-based dialogue systems without state space compression","authors":"Milica Gasic, Matthew Henderson, Blaise Thomson, P. Tsiakoulis, S. Young","doi":"10.1109/SLT.2012.6424165","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424165","url":null,"abstract":"The partially observable Markov decision process (POMDP) has been proposed as a dialogue model that enables automatic improvement of the dialogue policy and robustness to speech understanding errors. It requires, however, a large number of dialogues to train the dialogue policy. Gaussian processes (GP) have recently been applied to POMDP dialogue management optimisation showing an ability to substantially increase the speed of learning. Here, we investigate this further using the Bayesian Update of Dialogue State dialogue manager. We show that it is possible to apply Gaussian processes directly to the belief state, removing the need for a parametric policy representation. In addition, the resulting policy learns significantly faster while maintaining operational performance.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116297449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications 个性化的语言建模,通过群体外包与社会网络数据的云应用程序的语音访问
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424220
Tsung-Hsien Wen, Hung-yi Lee, Tai-Yuan Chen, Lin-Shan Lee
Voice access of cloud applications via smartphones is very attractive today, specifically because a smartphones is used by a single user, so personalized acoustic/language models become feasible. In particular, huge quantities of texts are available within the social networks over the Internet with known authors and given relationships, it is possible to train personalized language models because it is reasonable to assume users with those relationships may share some common subject topics, wording habits and linguistic patterns. In this paper, we propose an adaptation framework for building a robust personalized language model by incorporating the texts the target user and other users had posted on the social networks over the Internet to take care of the linguistic mismatch across different users. Experiments on Facebook dataset showed encouraging improvements in terms of both model perplexity and recognition accuracy with proposed approaches considering relationships among users, similarity based on latent topics, and random walk over a user graph.
如今,通过智能手机访问云应用程序的语音非常有吸引力,特别是因为智能手机是由单个用户使用的,所以个性化的声学/语言模型变得可行。特别是,在互联网上的社交网络中有大量具有已知作者和给定关系的文本,因此可以训练个性化的语言模型,因为假设具有这些关系的用户可能共享一些共同的主题主题、措辞习惯和语言模式是合理的。在本文中,我们提出了一个适应框架,通过整合目标用户和其他用户在互联网社交网络上发布的文本来构建鲁棒的个性化语言模型,以照顾不同用户之间的语言不匹配。在Facebook数据集上的实验表明,通过考虑用户之间的关系、基于潜在主题的相似性和用户图上的随机漫步,所提出的方法在模型困惑度和识别精度方面都有了令人鼓舞的改进。
{"title":"Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications","authors":"Tsung-Hsien Wen, Hung-yi Lee, Tai-Yuan Chen, Lin-Shan Lee","doi":"10.1109/SLT.2012.6424220","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424220","url":null,"abstract":"Voice access of cloud applications via smartphones is very attractive today, specifically because a smartphones is used by a single user, so personalized acoustic/language models become feasible. In particular, huge quantities of texts are available within the social networks over the Internet with known authors and given relationships, it is possible to train personalized language models because it is reasonable to assume users with those relationships may share some common subject topics, wording habits and linguistic patterns. In this paper, we propose an adaptation framework for building a robust personalized language model by incorporating the texts the target user and other users had posted on the social networks over the Internet to take care of the linguistic mismatch across different users. Experiments on Facebook dataset showed encouraging improvements in terms of both model perplexity and recognition accuracy with proposed approaches considering relationships among users, similarity based on latent topics, and random walk over a user graph.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130066166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Lexical entrainment and success in student engineering groups 学生工程小组的词汇学习与成功
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424258
Heather Friedberg, D. Litman, Susannah B. F. Paletz
Lexical entrainment is a measure of how the words that speakers use in a conversation become more similar over time. In this paper, we propose a measure of lexical entrainment for multi-party speaking situations. We apply this score to a corpus of student engineering groups using high-frequency words and project words, and investigate the relationship between lexical entrainment and group success on a class project. Our initial findings show that, using the entrainment score with project-related words, there is a significant difference between the lexical entrainment of high performing groups, which tended to increase with time, and the entrainment for low performing groups, which tended to decrease with time.
词汇吸收是衡量说话者在对话中使用的词汇如何随着时间的推移变得越来越相似。在本文中,我们提出了一种针对多人说话情况的词汇夹带测量方法。我们将这个分数应用到使用高频词和项目词的学生工程小组语料库中,并研究词汇娱乐与班级项目小组成功之间的关系。我们的初步研究结果表明,使用项目相关词汇的夹带得分,高绩效组的词汇夹带随着时间的推移而增加,低绩效组的词汇夹带随着时间的推移而减少,两者之间存在显著差异。
{"title":"Lexical entrainment and success in student engineering groups","authors":"Heather Friedberg, D. Litman, Susannah B. F. Paletz","doi":"10.1109/SLT.2012.6424258","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424258","url":null,"abstract":"Lexical entrainment is a measure of how the words that speakers use in a conversation become more similar over time. In this paper, we propose a measure of lexical entrainment for multi-party speaking situations. We apply this score to a corpus of student engineering groups using high-frequency words and project words, and investigate the relationship between lexical entrainment and group success on a class project. Our initial findings show that, using the entrainment score with project-related words, there is a significant difference between the lexical entrainment of high performing groups, which tended to increase with time, and the entrainment for low performing groups, which tended to decrease with time.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125009370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Automatic transcription of academic lectures from diverse disciplines 自动转录来自不同学科的学术讲座
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424257
Ghada Alharbi, Thomas Hain
In a multimedia world it is now common to record professional presentations, on video or with audio only. Such recordings include talks and academic lectures, which are becoming a valuable resource for students and professionals alike. However, organising such material from a diverse set of disciplines seems to be not an easy task. One way to address this problem is to build an Automatic Speech Recognition (ASR) system in order to use its output for analysing such materials. In this work ASR results for lectures from diverse sources are presented. The work is based on a new collection of data, obtained by the Liberated Learning Consortium (LLC). The study's primary goals are two-fold: first to show variability across disciplines from an ASR perspective, and how to choose sources for the construction of language models (LMs); second, to provide an analysis of the lecture transcription for automatic determination of structures in lecture discourse. In particular, we investigate whether there are properties common to lectures from different disciplines. This study focuses on textual features. Lectures are multimodal experiences - it is not clear whether textual features alone are sufficient for the recognition of such common elements, or other features, e.g. acoustic features such as the speaking rate, are needed. The results show that such common properties are retained across disciplines even on ASR output with a Word Error Rate (WER) of 30%.
在一个多媒体的世界里,现在录制专业的演示文稿是很常见的,无论是视频还是音频。这些录音包括谈话和学术讲座,它们正成为学生和专业人士的宝贵资源。然而,组织这些来自不同学科的材料似乎不是一件容易的事。解决这个问题的一种方法是建立一个自动语音识别(ASR)系统,以便使用其输出来分析这些材料。在这项工作中,ASR结果的讲座从不同的来源提出。这项工作是基于由自由学习联盟(LLC)获得的一组新数据。该研究的主要目标有两个:首先,从ASR的角度显示跨学科的可变性,以及如何选择构建语言模型(LMs)的来源;第二,为讲座语篇结构的自动确定提供讲座转录分析。特别是,我们调查了不同学科的讲座是否有共同的性质。本文主要研究文本特征。讲座是一种多模态的体验——目前尚不清楚是否仅靠文本特征就足以识别这些共同元素,还是需要其他特征,例如语速等声学特征。结果表明,即使在单词错误率(WER)为30%的ASR输出上,这些共同属性仍然保留在各个学科上。
{"title":"Automatic transcription of academic lectures from diverse disciplines","authors":"Ghada Alharbi, Thomas Hain","doi":"10.1109/SLT.2012.6424257","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424257","url":null,"abstract":"In a multimedia world it is now common to record professional presentations, on video or with audio only. Such recordings include talks and academic lectures, which are becoming a valuable resource for students and professionals alike. However, organising such material from a diverse set of disciplines seems to be not an easy task. One way to address this problem is to build an Automatic Speech Recognition (ASR) system in order to use its output for analysing such materials. In this work ASR results for lectures from diverse sources are presented. The work is based on a new collection of data, obtained by the Liberated Learning Consortium (LLC). The study's primary goals are two-fold: first to show variability across disciplines from an ASR perspective, and how to choose sources for the construction of language models (LMs); second, to provide an analysis of the lecture transcription for automatic determination of structures in lecture discourse. In particular, we investigate whether there are properties common to lectures from different disciplines. This study focuses on textual features. Lectures are multimodal experiences - it is not clear whether textual features alone are sufficient for the recognition of such common elements, or other features, e.g. acoustic features such as the speaking rate, are needed. The results show that such common properties are retained across disciplines even on ASR output with a Word Error Rate (WER) of 30%.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126434240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Combining cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition 结合倒谱归一化与人工耳蜗类语音处理的麦克风阵列语音识别
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424211
Cong-Thanh Do, M. Taghizadeh, Philip N. Garner
This paper investigates the combination of cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition. Testing speech signals are recorded by a circular microphone array and are subsequently processed with superdirective beamforming and McCowan post-filtering. Training speech signals, from the multichannel overlapping Number corpus (MONC), are clean and not overlapping. Cochlear implant-like speech processing, which is inspired from the speech processing strategy in cochlear implants, is applied on the training and testing speech signals. Cepstral normalization, including cepstral mean and variance normalization (CMN and CVN), are applied on the training and testing cepstra. Experiments show that implementing either cepstral normalization or cochlear implant-like speech processing helps in reducing the WERs of microphone array-based speech recognition. Combining cepstral normalization and cochlear implant-like speech processing reduces further the WERs, when there is overlapping speech. Train/test mismatches are measured using the Kullback-Leibler divergence (KLD), between the global probability density functions (PDFs) of training and testing cepstral vectors. This measure reveals a train/test mismatch reduction when either cepstral normalization or cochlear implant-like speech processing is used. It reveals also that combining these two processing reduces further the train/test mismatches as well as the WERs.
本文研究了将倒谱归一化与类人工耳蜗语音处理相结合的方法用于基于麦克风阵列的语音识别。测试语音信号由圆形麦克风阵列记录,随后进行超指令波束形成和McCowan后滤波处理。训练语音信号,从多通道重叠数语料库(MONC),是干净的,不重叠。仿人工耳蜗语音处理受人工耳蜗语音处理策略的启发,应用于语音信号的训练和测试。倒谱归一化包括倒谱均值和方差归一化(CMN和CVN),用于倒谱的训练和测试。实验表明,采用倒谱归一化或类似人工耳蜗的语音处理方法都有助于降低基于麦克风阵列的语音识别的功率。当存在重叠语音时,倒谱归一化与人工耳蜗样语音处理相结合可进一步降低脑电损伤。训练/测试不匹配使用Kullback-Leibler散度(KLD)来测量,在训练和测试倒谱向量的全局概率密度函数(pdf)之间。当使用倒谱归一化或人工耳蜗类语音处理时,该测量结果显示训练/测试不匹配减少。结果还表明,结合这两种处理方法可以进一步减少训练/测试不匹配以及wwer。
{"title":"Combining cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition","authors":"Cong-Thanh Do, M. Taghizadeh, Philip N. Garner","doi":"10.1109/SLT.2012.6424211","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424211","url":null,"abstract":"This paper investigates the combination of cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition. Testing speech signals are recorded by a circular microphone array and are subsequently processed with superdirective beamforming and McCowan post-filtering. Training speech signals, from the multichannel overlapping Number corpus (MONC), are clean and not overlapping. Cochlear implant-like speech processing, which is inspired from the speech processing strategy in cochlear implants, is applied on the training and testing speech signals. Cepstral normalization, including cepstral mean and variance normalization (CMN and CVN), are applied on the training and testing cepstra. Experiments show that implementing either cepstral normalization or cochlear implant-like speech processing helps in reducing the WERs of microphone array-based speech recognition. Combining cepstral normalization and cochlear implant-like speech processing reduces further the WERs, when there is overlapping speech. Train/test mismatches are measured using the Kullback-Leibler divergence (KLD), between the global probability density functions (PDFs) of training and testing cepstral vectors. This measure reveals a train/test mismatch reduction when either cepstral normalization or cochlear implant-like speech processing is used. It reveals also that combining these two processing reduces further the train/test mismatches as well as the WERs.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116829370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A noise-robust speech recognition method composed of weak noise suppression and weak Vector Taylor Series Adaptation 一种由弱噪声抑制和弱矢量泰勒级数自适应组成的噪声鲁棒语音识别方法
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424205
Shuji Komeiji, T. Arakawa, Takafumi Koshinaka
This paper proposes a noise-robust speech recognition method composed of weak noise suppression (NS) and weak Vector Taylor Series Adaptation (VTSA). The proposed method compensates defects of NS and VTSA, and gains only the advantages by them. The weak NS reduces distortion by over-suppression that may accompany noise-suppressed speech. The weak VTSA avoids over-adaptation by offsetting a part of acoustic-model adaptation that corresponds to the suppressed noise. Evaluation results with the AURORA2 database show that the proposed method achieves as much as 1.2 points higher word accuracy (87.4%) than a method with VTSA alone (86.2%) that is always better than its counterpart with NS.
提出了一种由弱噪声抑制(NS)和弱矢量泰勒序列自适应(VTSA)组成的噪声鲁棒语音识别方法。该方法弥补了NS和VTSA的缺陷,只获得了它们的优点。弱NS通过过度抑制可能伴随噪声抑制的语音而减少失真。弱VTSA通过抵消与被抑制噪声相对应的部分声学模型适应来避免过度适应。AURORA2数据库的评价结果表明,该方法的单词正确率(87.4%)比单独使用VTSA的方法(86.2%)高出1.2个点,并且始终优于使用NS的方法。
{"title":"A noise-robust speech recognition method composed of weak noise suppression and weak Vector Taylor Series Adaptation","authors":"Shuji Komeiji, T. Arakawa, Takafumi Koshinaka","doi":"10.1109/SLT.2012.6424205","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424205","url":null,"abstract":"This paper proposes a noise-robust speech recognition method composed of weak noise suppression (NS) and weak Vector Taylor Series Adaptation (VTSA). The proposed method compensates defects of NS and VTSA, and gains only the advantages by them. The weak NS reduces distortion by over-suppression that may accompany noise-suppressed speech. The weak VTSA avoids over-adaptation by offsetting a part of acoustic-model adaptation that corresponds to the suppressed noise. Evaluation results with the AURORA2 database show that the proposed method achieves as much as 1.2 points higher word accuracy (87.4%) than a method with VTSA alone (86.2%) that is always better than its counterpart with NS.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128884900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic detection and correction of syntax-based prosody annotation errors 基于句法的韵律标注错误自动检测与修正
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424259
Sandrine Brognaux, Thomas Drugman, Richard Beaufort
Both unit-selection and HMM-based speech synthesis require large annotated speech corpora. To generate more natural speech, considering the prosodic nature of each phoneme of the corpus is crucial. Generally, phonemes are assigned labels which should reflect their suprasegmental characteristics. Labels often result from an automatic syntactic analysis, without checking the acoustic realization of the phoneme in the corpus. This leads to numerous errors because syntax and prosody do not always coincide. This paper proposes a method to reduce the amount of labeling errors, using acoustic information. It is applicable as a post-process to any syntax-driven prosody labeling. Acoustic features are considered, to check the syntax-based labels and suggest potential modifications. The proposed technique has the advantage of not requiring a manually prosody-labelled corpus. The evaluation on a corpus in French shows that more than 75% of the errors detected by the method are effective errors which must be corrected.
单元选择和基于hmm的语音合成都需要大量带注释的语音语料库。为了产生更自然的语音,考虑语料库中每个音素的韵律性质是至关重要的。一般来说,音素被分配的标签应该反映它们的超分段特征。标签通常是自动句法分析的结果,而不检查语料库中音素的声学实现。这导致了许多错误,因为语法和韵律并不总是一致的。本文提出了一种利用声学信息来减少标注错误的方法。它适用于任何语法驱动的韵律标记的后处理。声学特征被考虑,以检查基于语法的标签和建议潜在的修改。该技术的优点是不需要手动标记韵律的语料库。对一个法语语料库的评价表明,该方法检测到的错误中有75%以上是有效错误,必须加以纠正。
{"title":"Automatic detection and correction of syntax-based prosody annotation errors","authors":"Sandrine Brognaux, Thomas Drugman, Richard Beaufort","doi":"10.1109/SLT.2012.6424259","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424259","url":null,"abstract":"Both unit-selection and HMM-based speech synthesis require large annotated speech corpora. To generate more natural speech, considering the prosodic nature of each phoneme of the corpus is crucial. Generally, phonemes are assigned labels which should reflect their suprasegmental characteristics. Labels often result from an automatic syntactic analysis, without checking the acoustic realization of the phoneme in the corpus. This leads to numerous errors because syntax and prosody do not always coincide. This paper proposes a method to reduce the amount of labeling errors, using acoustic information. It is applicable as a post-process to any syntax-driven prosody labeling. Acoustic features are considered, to check the syntax-based labels and suggest potential modifications. The proposed technique has the advantage of not requiring a manually prosody-labelled corpus. The evaluation on a corpus in French shows that more than 75% of the errors detected by the method are effective errors which must be corrected.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114330805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Context-dependent Deep Neural Networks for audio indexing of real-life data 上下文相关的深度神经网络音频索引的现实生活数据
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424212
Gang Li, Huifeng Zhu, G. Cheng, Kit Thambiratnam, Behrooz Chitsaz, Dong Yu, F. Seide
We apply Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, to the real-life problem of audio indexing of data across various sources. Recently, we had shown that on the Switchboard benchmark on speaker-independent transcription of phone calls, CD-DNN-HMMs with 7 hidden layers reduce the word error rate by as much as one-third, compared to discriminatively trained Gaussian-mixture HMMs, and by one-fourth if the GMM-HMM also uses fMPE features. This paper takes CD-DNN-HMM based recognition into a real-life deployment for audio indexing. We find that for our best speaker-independent CD-DNN-HMM, with 32k senones trained on 2000h of data, the one-fourth reduction does carry over to inhomogeneous field data (video podcasts and talks). Compared to a speaker-adaptive GMM system, the relative improvement is 18%, at very similar end-to-end runtime. In system building, we find that DNNs can benefit from a larger number of senones than the GMM-HMM; and that DNN likelihood evaluation is a sizeable runtime factor even in our wide-beam context of generating rich lattices: Cutting the model size by 60% reduces runtime by one-third at a 5% relative WER loss.
我们将上下文相关的深度神经网络hmm或cd - dnn - hmm应用于跨各种来源的音频数据索引的现实问题。最近,我们已经证明,在与扬声器无关的电话转录的交换机基准测试中,与判别训练的高斯混合hmm相比,具有7个隐藏层的cd - dnn - hmm可将单词错误率降低多达三分之一,如果GMM-HMM还使用fMPE特征,则可降低四分之一。本文将基于CD-DNN-HMM的识别应用到音频索引的实际应用中。我们发现,对于我们最好的独立于演讲者的CD-DNN-HMM,在2000h的数据上训练了32k senones,四分之一的减少确实延续到非同质的现场数据(视频播客和演讲)。与扬声器自适应GMM系统相比,在非常相似的端到端运行时,相对改进为18%。在系统构建中,我们发现dnn比GMM-HMM受益于更多的senones;即使在我们生成丰富网格的宽波束环境中,DNN可能性评估也是一个相当大的运行时因素:将模型大小减少60%,运行时减少三分之一,相对WER损失为5%。
{"title":"Context-dependent Deep Neural Networks for audio indexing of real-life data","authors":"Gang Li, Huifeng Zhu, G. Cheng, Kit Thambiratnam, Behrooz Chitsaz, Dong Yu, F. Seide","doi":"10.1109/SLT.2012.6424212","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424212","url":null,"abstract":"We apply Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, to the real-life problem of audio indexing of data across various sources. Recently, we had shown that on the Switchboard benchmark on speaker-independent transcription of phone calls, CD-DNN-HMMs with 7 hidden layers reduce the word error rate by as much as one-third, compared to discriminatively trained Gaussian-mixture HMMs, and by one-fourth if the GMM-HMM also uses fMPE features. This paper takes CD-DNN-HMM based recognition into a real-life deployment for audio indexing. We find that for our best speaker-independent CD-DNN-HMM, with 32k senones trained on 2000h of data, the one-fourth reduction does carry over to inhomogeneous field data (video podcasts and talks). Compared to a speaker-adaptive GMM system, the relative improvement is 18%, at very similar end-to-end runtime. In system building, we find that DNNs can benefit from a larger number of senones than the GMM-HMM; and that DNN likelihood evaluation is a sizeable runtime factor even in our wide-beam context of generating rich lattices: Cutting the model size by 60% reduces runtime by one-third at a 5% relative WER loss.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132380891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2012 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1