首页 > 最新文献

2012 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Improved semantic retrieval of spoken content by language models enhanced with acoustic similarity graph 声学相似图增强语言模型对口语内容语义检索的改进
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424219
Hung-yi Lee, Tsung-Hsien Wen, Lin-Shan Lee
Retrieving objects semantically related to the query has been widely studied in text information retrieval. However, when applying the text-based techniques on spoken content, the inevitable recognition errors may seriously degrade the performance. In this paper, we propose to enhance the expected term frequencies estimated from spoken content by acoustic similarity graphs. For each word in the lexicon, a graph is constructed describing acoustic similarity among spoken segments in the archive. Score propagation over the graph helps in estimating the expected term frequencies. The enhanced expected term frequencies can be used in the language modeling retrieval approach, as well as semantic retrieval techniques such as the document expansion based on latent semantic analysis, and query expansion considering both words and latent topic information. Preliminary experiments performed on Mandarin broadcast news indicated that improved performance were achievable under different conditions.
在文本信息检索中,检索与查询语义相关的对象已经得到了广泛的研究。然而,在将基于文本的识别技术应用于口语内容时,不可避免的识别错误会严重降低识别性能。在本文中,我们提出通过声学相似图来提高从语音内容估计的期望词频率。对于词典中的每个单词,构建一个图来描述档案中语音片段之间的声学相似性。在图上的分数传播有助于估计预期的项频率。增强的期望词频率可用于语言建模检索方法,以及基于潜在语义分析的文档扩展、同时考虑词和潜在主题信息的查询扩展等语义检索技术。对普通话广播新闻进行的初步实验表明,在不同的条件下,性能都可以得到提高。
{"title":"Improved semantic retrieval of spoken content by language models enhanced with acoustic similarity graph","authors":"Hung-yi Lee, Tsung-Hsien Wen, Lin-Shan Lee","doi":"10.1109/SLT.2012.6424219","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424219","url":null,"abstract":"Retrieving objects semantically related to the query has been widely studied in text information retrieval. However, when applying the text-based techniques on spoken content, the inevitable recognition errors may seriously degrade the performance. In this paper, we propose to enhance the expected term frequencies estimated from spoken content by acoustic similarity graphs. For each word in the lexicon, a graph is constructed describing acoustic similarity among spoken segments in the archive. Score propagation over the graph helps in estimating the expected term frequencies. The enhanced expected term frequencies can be used in the language modeling retrieval approach, as well as semantic retrieval techniques such as the document expansion based on latent semantic analysis, and query expansion considering both words and latent topic information. Preliminary experiments performed on Mandarin broadcast news indicated that improved performance were achievable under different conditions.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122277832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Intent transfer in speech-to-speech machine translation 语音到语音机器翻译中的意图转移
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424214
G. Anumanchipalli, Luís C. Oliveira, A. Black
This paper presents an approach for transfer of speaker intent in speech-to-speech machine translation (S2SMT). Specifically, we describe techniques to retain the prominence patterns of the source language utterance through the translation pipeline and impose this information during speech synthesis in the target language. We first present an analysis of word focus across languages to motivate the problem of transfer. We then propose an approach for training an appropriate transfer function for intonation on a parallel speech corpus in the two languages within which the translation is carried out. We present our analysis and experiments on English↔Portuguese and English↔German language pairs and evaluate the proposed transformation techniques through objective measures.
本文提出了语音到语音机器翻译(S2SMT)中说话人意图转移的方法。具体来说,我们描述了通过翻译管道保留源语言话语的突出模式,并在目标语言的语音合成过程中施加这些信息的技术。我们首先提出了跨语言的词焦点分析,以激发迁移问题。然后,我们提出了一种方法,在进行翻译的两种语言的平行语音语料库上训练适当的语调传递函数。我们介绍了我们对英语↔葡萄牙语和英语↔德语两种语言对的分析和实验,并通过客观措施评价了所提出的转换技术。
{"title":"Intent transfer in speech-to-speech machine translation","authors":"G. Anumanchipalli, Luís C. Oliveira, A. Black","doi":"10.1109/SLT.2012.6424214","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424214","url":null,"abstract":"This paper presents an approach for transfer of speaker intent in speech-to-speech machine translation (S2SMT). Specifically, we describe techniques to retain the prominence patterns of the source language utterance through the translation pipeline and impose this information during speech synthesis in the target language. We first present an analysis of word focus across languages to motivate the problem of transfer. We then propose an approach for training an appropriate transfer function for intonation on a parallel speech corpus in the two languages within which the translation is carried out. We present our analysis and experiments on English↔Portuguese and English↔German language pairs and evaluate the proposed transformation techniques through objective measures.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124691012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications 个性化的语言建模,通过群体外包与社会网络数据的云应用程序的语音访问
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424220
Tsung-Hsien Wen, Hung-yi Lee, Tai-Yuan Chen, Lin-Shan Lee
Voice access of cloud applications via smartphones is very attractive today, specifically because a smartphones is used by a single user, so personalized acoustic/language models become feasible. In particular, huge quantities of texts are available within the social networks over the Internet with known authors and given relationships, it is possible to train personalized language models because it is reasonable to assume users with those relationships may share some common subject topics, wording habits and linguistic patterns. In this paper, we propose an adaptation framework for building a robust personalized language model by incorporating the texts the target user and other users had posted on the social networks over the Internet to take care of the linguistic mismatch across different users. Experiments on Facebook dataset showed encouraging improvements in terms of both model perplexity and recognition accuracy with proposed approaches considering relationships among users, similarity based on latent topics, and random walk over a user graph.
如今,通过智能手机访问云应用程序的语音非常有吸引力,特别是因为智能手机是由单个用户使用的,所以个性化的声学/语言模型变得可行。特别是,在互联网上的社交网络中有大量具有已知作者和给定关系的文本,因此可以训练个性化的语言模型,因为假设具有这些关系的用户可能共享一些共同的主题主题、措辞习惯和语言模式是合理的。在本文中,我们提出了一个适应框架,通过整合目标用户和其他用户在互联网社交网络上发布的文本来构建鲁棒的个性化语言模型,以照顾不同用户之间的语言不匹配。在Facebook数据集上的实验表明,通过考虑用户之间的关系、基于潜在主题的相似性和用户图上的随机漫步,所提出的方法在模型困惑度和识别精度方面都有了令人鼓舞的改进。
{"title":"Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications","authors":"Tsung-Hsien Wen, Hung-yi Lee, Tai-Yuan Chen, Lin-Shan Lee","doi":"10.1109/SLT.2012.6424220","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424220","url":null,"abstract":"Voice access of cloud applications via smartphones is very attractive today, specifically because a smartphones is used by a single user, so personalized acoustic/language models become feasible. In particular, huge quantities of texts are available within the social networks over the Internet with known authors and given relationships, it is possible to train personalized language models because it is reasonable to assume users with those relationships may share some common subject topics, wording habits and linguistic patterns. In this paper, we propose an adaptation framework for building a robust personalized language model by incorporating the texts the target user and other users had posted on the social networks over the Internet to take care of the linguistic mismatch across different users. Experiments on Facebook dataset showed encouraging improvements in terms of both model perplexity and recognition accuracy with proposed approaches considering relationships among users, similarity based on latent topics, and random walk over a user graph.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130066166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
On the generalization of Shannon entropy for speech recognition 香农熵在语音识别中的推广
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424204
Nicolas Obin, M. Liuni
This paper introduces an entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition. The proposed representation is based on the Rényi entropy, which is a generalization of the Shannon entropy. In audio signal representation, Rényi entropy presents the advantage of focusing either on the harmonic content (prominent amplitude within a distribution) or on the noise content (equal distribution of amplitudes). The proposed representation outperforms all other noisiness measures - including Shannon and Wiener entropies - in a large-scale classification of vocal effort (whispered-soft/normal/loud-shouted) in the real scenario of multi-language massive role-playing video games. The improvement is around 10% in relative error reduction, and is particularly significant for the recognition of noisy speech - i.e., whispery/breathy speech. This confirms the role of noisiness for speech recognition, and will further be extended to the classification of voice quality for the design of an automatic voice casting system in video games.
本文介绍了一种基于熵的频谱表示,作为音频信号噪声程度的度量,补充了音频和语音识别的标准mfc。所提出的表示是基于rsamunyi熵,这是香农熵的推广。在音频信号表示中,r尼米熵的优点是既可以关注谐波内容(分布内的显著振幅),也可以关注噪声内容(振幅的均匀分布)。在多语言大型角色扮演视频游戏的真实场景中,在大规模的声音努力分类(低语-柔和/正常/大声喊叫)中,所提出的表示优于所有其他噪音度量——包括香农和维纳熵。在相对误差减少方面,改进幅度约为10%,对于嘈杂语音的识别尤其显著,例如低语/呼吸语音。这证实了噪声在语音识别中的作用,并将进一步扩展到语音质量分类,用于设计电子游戏中的自动语音分配系统。
{"title":"On the generalization of Shannon entropy for speech recognition","authors":"Nicolas Obin, M. Liuni","doi":"10.1109/SLT.2012.6424204","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424204","url":null,"abstract":"This paper introduces an entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition. The proposed representation is based on the Rényi entropy, which is a generalization of the Shannon entropy. In audio signal representation, Rényi entropy presents the advantage of focusing either on the harmonic content (prominent amplitude within a distribution) or on the noise content (equal distribution of amplitudes). The proposed representation outperforms all other noisiness measures - including Shannon and Wiener entropies - in a large-scale classification of vocal effort (whispered-soft/normal/loud-shouted) in the real scenario of multi-language massive role-playing video games. The improvement is around 10% in relative error reduction, and is particularly significant for the recognition of noisy speech - i.e., whispery/breathy speech. This confirms the role of noisiness for speech recognition, and will further be extended to the classification of voice quality for the design of an automatic voice casting system in video games.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126208023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Lexical entrainment and success in student engineering groups 学生工程小组的词汇学习与成功
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424258
Heather Friedberg, D. Litman, Susannah B. F. Paletz
Lexical entrainment is a measure of how the words that speakers use in a conversation become more similar over time. In this paper, we propose a measure of lexical entrainment for multi-party speaking situations. We apply this score to a corpus of student engineering groups using high-frequency words and project words, and investigate the relationship between lexical entrainment and group success on a class project. Our initial findings show that, using the entrainment score with project-related words, there is a significant difference between the lexical entrainment of high performing groups, which tended to increase with time, and the entrainment for low performing groups, which tended to decrease with time.
词汇吸收是衡量说话者在对话中使用的词汇如何随着时间的推移变得越来越相似。在本文中,我们提出了一种针对多人说话情况的词汇夹带测量方法。我们将这个分数应用到使用高频词和项目词的学生工程小组语料库中,并研究词汇娱乐与班级项目小组成功之间的关系。我们的初步研究结果表明,使用项目相关词汇的夹带得分,高绩效组的词汇夹带随着时间的推移而增加,低绩效组的词汇夹带随着时间的推移而减少,两者之间存在显著差异。
{"title":"Lexical entrainment and success in student engineering groups","authors":"Heather Friedberg, D. Litman, Susannah B. F. Paletz","doi":"10.1109/SLT.2012.6424258","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424258","url":null,"abstract":"Lexical entrainment is a measure of how the words that speakers use in a conversation become more similar over time. In this paper, we propose a measure of lexical entrainment for multi-party speaking situations. We apply this score to a corpus of student engineering groups using high-frequency words and project words, and investigate the relationship between lexical entrainment and group success on a class project. Our initial findings show that, using the entrainment score with project-related words, there is a significant difference between the lexical entrainment of high performing groups, which tended to increase with time, and the entrainment for low performing groups, which tended to decrease with time.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125009370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Train&align: A new online tool for automatic phonetic alignment Train&align:一个新的自动语音对齐在线工具
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424260
Sandrine Brognaux, Sophie Roekhaut, Thomas Drugman, Richard Beaufort
Several automatic phonetic alignment tools have been proposed in the literature. They usually rely on pre-trained speaker-independent models to align new corpora. Their drawback is that they cover a very limited number of languages and might not perform properly for different speaking styles. This paper presents a new tool for automatic phonetic alignment available online. Its specificity is that it trains the model directly on the corpus to align, which makes it applicable to any language and speaking style. Experiments on three corpora show that it provides results comparable to other existing tools. It also allows the tuning of some training parameters. The use of tied-state triphones, for example, shows further improvement of about 1.5% for a 20 ms threshold. A manually-aligned part of the corpus can also be used as bootstrap to improve the model quality. Alignment rates were found to significantly increase, up to 20%, using only 30 seconds of bootstrapping data.
文献中提出了几种自动语音对齐工具。他们通常依靠预先训练的独立于说话人的模型来对齐新的语料库。它们的缺点是,它们涵盖的语言数量非常有限,可能不适用于不同的口语风格。本文介绍了一种新的在线语音自动对齐工具。它的特殊之处在于它直接在语料库上训练模型来对齐,这使得它适用于任何语言和说话风格。在三个语料库上的实验表明,该方法的结果与其他现有工具相当。它还允许调整一些训练参数。例如,使用固定状态的三音耳机,在20毫秒的阈值下显示出约1.5%的进一步改善。语料库的手动对齐部分也可以用作引导以提高模型质量。结果发现,仅使用30秒的引导数据,对齐率就显著提高了20%。
{"title":"Train&align: A new online tool for automatic phonetic alignment","authors":"Sandrine Brognaux, Sophie Roekhaut, Thomas Drugman, Richard Beaufort","doi":"10.1109/SLT.2012.6424260","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424260","url":null,"abstract":"Several automatic phonetic alignment tools have been proposed in the literature. They usually rely on pre-trained speaker-independent models to align new corpora. Their drawback is that they cover a very limited number of languages and might not perform properly for different speaking styles. This paper presents a new tool for automatic phonetic alignment available online. Its specificity is that it trains the model directly on the corpus to align, which makes it applicable to any language and speaking style. Experiments on three corpora show that it provides results comparable to other existing tools. It also allows the tuning of some training parameters. The use of tied-state triphones, for example, shows further improvement of about 1.5% for a 20 ms threshold. A manually-aligned part of the corpus can also be used as bootstrap to improve the model quality. Alignment rates were found to significantly increase, up to 20%, using only 30 seconds of bootstrapping data.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"142 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114217865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Automatic detection and correction of syntax-based prosody annotation errors 基于句法的韵律标注错误自动检测与修正
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424259
Sandrine Brognaux, Thomas Drugman, Richard Beaufort
Both unit-selection and HMM-based speech synthesis require large annotated speech corpora. To generate more natural speech, considering the prosodic nature of each phoneme of the corpus is crucial. Generally, phonemes are assigned labels which should reflect their suprasegmental characteristics. Labels often result from an automatic syntactic analysis, without checking the acoustic realization of the phoneme in the corpus. This leads to numerous errors because syntax and prosody do not always coincide. This paper proposes a method to reduce the amount of labeling errors, using acoustic information. It is applicable as a post-process to any syntax-driven prosody labeling. Acoustic features are considered, to check the syntax-based labels and suggest potential modifications. The proposed technique has the advantage of not requiring a manually prosody-labelled corpus. The evaluation on a corpus in French shows that more than 75% of the errors detected by the method are effective errors which must be corrected.
单元选择和基于hmm的语音合成都需要大量带注释的语音语料库。为了产生更自然的语音,考虑语料库中每个音素的韵律性质是至关重要的。一般来说,音素被分配的标签应该反映它们的超分段特征。标签通常是自动句法分析的结果,而不检查语料库中音素的声学实现。这导致了许多错误,因为语法和韵律并不总是一致的。本文提出了一种利用声学信息来减少标注错误的方法。它适用于任何语法驱动的韵律标记的后处理。声学特征被考虑,以检查基于语法的标签和建议潜在的修改。该技术的优点是不需要手动标记韵律的语料库。对一个法语语料库的评价表明,该方法检测到的错误中有75%以上是有效错误,必须加以纠正。
{"title":"Automatic detection and correction of syntax-based prosody annotation errors","authors":"Sandrine Brognaux, Thomas Drugman, Richard Beaufort","doi":"10.1109/SLT.2012.6424259","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424259","url":null,"abstract":"Both unit-selection and HMM-based speech synthesis require large annotated speech corpora. To generate more natural speech, considering the prosodic nature of each phoneme of the corpus is crucial. Generally, phonemes are assigned labels which should reflect their suprasegmental characteristics. Labels often result from an automatic syntactic analysis, without checking the acoustic realization of the phoneme in the corpus. This leads to numerous errors because syntax and prosody do not always coincide. This paper proposes a method to reduce the amount of labeling errors, using acoustic information. It is applicable as a post-process to any syntax-driven prosody labeling. Acoustic features are considered, to check the syntax-based labels and suggest potential modifications. The proposed technique has the advantage of not requiring a manually prosody-labelled corpus. The evaluation on a corpus in French shows that more than 75% of the errors detected by the method are effective errors which must be corrected.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114330805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Topic n-gram count language model adaptation for speech recognition 主题n-图计数语言模型自适应语音识别
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424216
Md. Akmal Haidar, D. O'Shaughnessy
We introduce novel language model (LM) adaptation approaches using the latent Dirichlet allocation (LDA) model. Observed n-grams in the training set are assigned to topics using soft and hard clustering. In soft clustering, each n-gram is assigned to topics such that the total count of that n-gram for all topics is equal to the global count of that n-gram in the training set. Here, the normalized topic weights of the n-gram are multiplied by the global n-gram count to form the topic n-gram count for the respective topics. In hard clustering, each n-gram is assigned to a single topic with the maximum fraction of the global n-gram count for the corresponding topic. Here, the topic is selected using the maximum topic weight for the n-gram. The topic n-gram count LMs are created using the respective topic n-gram counts and adapted by using the topic weights of a development test set. We compute the average of the confidence measures: the probability of word given topic and the probability of topic given word. The average is taken over the words in the n-grams and the development test set to form the topic weights of the n-grams and the development test set respectively. Our approaches show better performance over some traditional approaches using the WSJ corpus.
提出了一种基于潜狄利克雷分配(LDA)模型的语言模型自适应方法。利用软、硬聚类方法将训练集中观察到的n-grams分配给主题。在软聚类中,每个n-gram被分配给主题,使得所有主题的n-gram的总数等于该n-gram在训练集中的全局计数。这里,n-gram的规范化主题权重乘以全局n-gram计数,形成各自主题的主题n-gram计数。在硬聚类中,每个n-gram被分配给一个单独的主题,其对应主题的全局n-gram计数占比最大。在这里,使用n-gram的最大主题权重来选择主题。主题n-gram计数lm是使用各自的主题n-gram计数创建的,并通过使用开发测试集的主题权重进行调整。我们计算置信测度的平均值:词给定主题的概率和词给定主题的概率。将n-gram和开发测试集中的单词取平均值,分别形成n-gram和开发测试集中的主题权重。我们的方法比使用WSJ语料库的一些传统方法表现出更好的性能。
{"title":"Topic n-gram count language model adaptation for speech recognition","authors":"Md. Akmal Haidar, D. O'Shaughnessy","doi":"10.1109/SLT.2012.6424216","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424216","url":null,"abstract":"We introduce novel language model (LM) adaptation approaches using the latent Dirichlet allocation (LDA) model. Observed n-grams in the training set are assigned to topics using soft and hard clustering. In soft clustering, each n-gram is assigned to topics such that the total count of that n-gram for all topics is equal to the global count of that n-gram in the training set. Here, the normalized topic weights of the n-gram are multiplied by the global n-gram count to form the topic n-gram count for the respective topics. In hard clustering, each n-gram is assigned to a single topic with the maximum fraction of the global n-gram count for the corresponding topic. Here, the topic is selected using the maximum topic weight for the n-gram. The topic n-gram count LMs are created using the respective topic n-gram counts and adapted by using the topic weights of a development test set. We compute the average of the confidence measures: the probability of word given topic and the probability of topic given word. The average is taken over the words in the n-grams and the development test set to form the topic weights of the n-grams and the development test set respectively. Our approaches show better performance over some traditional approaches using the WSJ corpus.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130771343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
A noise-robust speech recognition method composed of weak noise suppression and weak Vector Taylor Series Adaptation 一种由弱噪声抑制和弱矢量泰勒级数自适应组成的噪声鲁棒语音识别方法
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424205
Shuji Komeiji, T. Arakawa, Takafumi Koshinaka
This paper proposes a noise-robust speech recognition method composed of weak noise suppression (NS) and weak Vector Taylor Series Adaptation (VTSA). The proposed method compensates defects of NS and VTSA, and gains only the advantages by them. The weak NS reduces distortion by over-suppression that may accompany noise-suppressed speech. The weak VTSA avoids over-adaptation by offsetting a part of acoustic-model adaptation that corresponds to the suppressed noise. Evaluation results with the AURORA2 database show that the proposed method achieves as much as 1.2 points higher word accuracy (87.4%) than a method with VTSA alone (86.2%) that is always better than its counterpart with NS.
提出了一种由弱噪声抑制(NS)和弱矢量泰勒序列自适应(VTSA)组成的噪声鲁棒语音识别方法。该方法弥补了NS和VTSA的缺陷,只获得了它们的优点。弱NS通过过度抑制可能伴随噪声抑制的语音而减少失真。弱VTSA通过抵消与被抑制噪声相对应的部分声学模型适应来避免过度适应。AURORA2数据库的评价结果表明,该方法的单词正确率(87.4%)比单独使用VTSA的方法(86.2%)高出1.2个点,并且始终优于使用NS的方法。
{"title":"A noise-robust speech recognition method composed of weak noise suppression and weak Vector Taylor Series Adaptation","authors":"Shuji Komeiji, T. Arakawa, Takafumi Koshinaka","doi":"10.1109/SLT.2012.6424205","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424205","url":null,"abstract":"This paper proposes a noise-robust speech recognition method composed of weak noise suppression (NS) and weak Vector Taylor Series Adaptation (VTSA). The proposed method compensates defects of NS and VTSA, and gains only the advantages by them. The weak NS reduces distortion by over-suppression that may accompany noise-suppressed speech. The weak VTSA avoids over-adaptation by offsetting a part of acoustic-model adaptation that corresponds to the suppressed noise. Evaluation results with the AURORA2 database show that the proposed method achieves as much as 1.2 points higher word accuracy (87.4%) than a method with VTSA alone (86.2%) that is always better than its counterpart with NS.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128884900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Combining cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition 结合倒谱归一化与人工耳蜗类语音处理的麦克风阵列语音识别
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424211
Cong-Thanh Do, M. Taghizadeh, Philip N. Garner
This paper investigates the combination of cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition. Testing speech signals are recorded by a circular microphone array and are subsequently processed with superdirective beamforming and McCowan post-filtering. Training speech signals, from the multichannel overlapping Number corpus (MONC), are clean and not overlapping. Cochlear implant-like speech processing, which is inspired from the speech processing strategy in cochlear implants, is applied on the training and testing speech signals. Cepstral normalization, including cepstral mean and variance normalization (CMN and CVN), are applied on the training and testing cepstra. Experiments show that implementing either cepstral normalization or cochlear implant-like speech processing helps in reducing the WERs of microphone array-based speech recognition. Combining cepstral normalization and cochlear implant-like speech processing reduces further the WERs, when there is overlapping speech. Train/test mismatches are measured using the Kullback-Leibler divergence (KLD), between the global probability density functions (PDFs) of training and testing cepstral vectors. This measure reveals a train/test mismatch reduction when either cepstral normalization or cochlear implant-like speech processing is used. It reveals also that combining these two processing reduces further the train/test mismatches as well as the WERs.
本文研究了将倒谱归一化与类人工耳蜗语音处理相结合的方法用于基于麦克风阵列的语音识别。测试语音信号由圆形麦克风阵列记录,随后进行超指令波束形成和McCowan后滤波处理。训练语音信号,从多通道重叠数语料库(MONC),是干净的,不重叠。仿人工耳蜗语音处理受人工耳蜗语音处理策略的启发,应用于语音信号的训练和测试。倒谱归一化包括倒谱均值和方差归一化(CMN和CVN),用于倒谱的训练和测试。实验表明,采用倒谱归一化或类似人工耳蜗的语音处理方法都有助于降低基于麦克风阵列的语音识别的功率。当存在重叠语音时,倒谱归一化与人工耳蜗样语音处理相结合可进一步降低脑电损伤。训练/测试不匹配使用Kullback-Leibler散度(KLD)来测量,在训练和测试倒谱向量的全局概率密度函数(pdf)之间。当使用倒谱归一化或人工耳蜗类语音处理时,该测量结果显示训练/测试不匹配减少。结果还表明,结合这两种处理方法可以进一步减少训练/测试不匹配以及wwer。
{"title":"Combining cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition","authors":"Cong-Thanh Do, M. Taghizadeh, Philip N. Garner","doi":"10.1109/SLT.2012.6424211","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424211","url":null,"abstract":"This paper investigates the combination of cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition. Testing speech signals are recorded by a circular microphone array and are subsequently processed with superdirective beamforming and McCowan post-filtering. Training speech signals, from the multichannel overlapping Number corpus (MONC), are clean and not overlapping. Cochlear implant-like speech processing, which is inspired from the speech processing strategy in cochlear implants, is applied on the training and testing speech signals. Cepstral normalization, including cepstral mean and variance normalization (CMN and CVN), are applied on the training and testing cepstra. Experiments show that implementing either cepstral normalization or cochlear implant-like speech processing helps in reducing the WERs of microphone array-based speech recognition. Combining cepstral normalization and cochlear implant-like speech processing reduces further the WERs, when there is overlapping speech. Train/test mismatches are measured using the Kullback-Leibler divergence (KLD), between the global probability density functions (PDFs) of training and testing cepstral vectors. This measure reveals a train/test mismatch reduction when either cepstral normalization or cochlear implant-like speech processing is used. It reveals also that combining these two processing reduces further the train/test mismatches as well as the WERs.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116829370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2012 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1