首页 > 最新文献

2012 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Topic n-gram count language model adaptation for speech recognition 主题n-图计数语言模型自适应语音识别
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424216
Md. Akmal Haidar, D. O'Shaughnessy
We introduce novel language model (LM) adaptation approaches using the latent Dirichlet allocation (LDA) model. Observed n-grams in the training set are assigned to topics using soft and hard clustering. In soft clustering, each n-gram is assigned to topics such that the total count of that n-gram for all topics is equal to the global count of that n-gram in the training set. Here, the normalized topic weights of the n-gram are multiplied by the global n-gram count to form the topic n-gram count for the respective topics. In hard clustering, each n-gram is assigned to a single topic with the maximum fraction of the global n-gram count for the corresponding topic. Here, the topic is selected using the maximum topic weight for the n-gram. The topic n-gram count LMs are created using the respective topic n-gram counts and adapted by using the topic weights of a development test set. We compute the average of the confidence measures: the probability of word given topic and the probability of topic given word. The average is taken over the words in the n-grams and the development test set to form the topic weights of the n-grams and the development test set respectively. Our approaches show better performance over some traditional approaches using the WSJ corpus.
提出了一种基于潜狄利克雷分配(LDA)模型的语言模型自适应方法。利用软、硬聚类方法将训练集中观察到的n-grams分配给主题。在软聚类中,每个n-gram被分配给主题,使得所有主题的n-gram的总数等于该n-gram在训练集中的全局计数。这里,n-gram的规范化主题权重乘以全局n-gram计数,形成各自主题的主题n-gram计数。在硬聚类中,每个n-gram被分配给一个单独的主题,其对应主题的全局n-gram计数占比最大。在这里,使用n-gram的最大主题权重来选择主题。主题n-gram计数lm是使用各自的主题n-gram计数创建的,并通过使用开发测试集的主题权重进行调整。我们计算置信测度的平均值:词给定主题的概率和词给定主题的概率。将n-gram和开发测试集中的单词取平均值,分别形成n-gram和开发测试集中的主题权重。我们的方法比使用WSJ语料库的一些传统方法表现出更好的性能。
{"title":"Topic n-gram count language model adaptation for speech recognition","authors":"Md. Akmal Haidar, D. O'Shaughnessy","doi":"10.1109/SLT.2012.6424216","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424216","url":null,"abstract":"We introduce novel language model (LM) adaptation approaches using the latent Dirichlet allocation (LDA) model. Observed n-grams in the training set are assigned to topics using soft and hard clustering. In soft clustering, each n-gram is assigned to topics such that the total count of that n-gram for all topics is equal to the global count of that n-gram in the training set. Here, the normalized topic weights of the n-gram are multiplied by the global n-gram count to form the topic n-gram count for the respective topics. In hard clustering, each n-gram is assigned to a single topic with the maximum fraction of the global n-gram count for the corresponding topic. Here, the topic is selected using the maximum topic weight for the n-gram. The topic n-gram count LMs are created using the respective topic n-gram counts and adapted by using the topic weights of a development test set. We compute the average of the confidence measures: the probability of word given topic and the probability of topic given word. The average is taken over the words in the n-grams and the development test set to form the topic weights of the n-grams and the development test set respectively. Our approaches show better performance over some traditional approaches using the WSJ corpus.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130771343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Statistical methods for varying the degree of articulation in new HMM-based voices 在新的基于hmm的声音中改变发音程度的统计方法
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424238
B. Picart, Thomas Drugman, T. Dutoit
This paper focuses on the automatic modification of the degree of articulation (hypo/hyperarticulation) of an existing standard neutral voice in the framework of HMM-based speech synthesis. Starting from a source speaker for which neutral, hypo and hyperarticulated speech data are available, two sets of transformations are computed during the adaptation of the neutral speech synthesizer. These transformations are then applied to a new target speaker for which no hypo/hyperarticulated recordings are available. Four statistical methods are investigated, differing in the speaking style adaptation technique (MLLR vs. CMLLR) and in the speaking style transposition approach (phonetic vs. acoustic correspondence) they use. This study focuses on the prosody model although such techniques can be applied to any stream of parameters exhibiting suited interpolability properties. Two subjective evaluations are performed in order to determine which statistical transformation method achieves the better segmental quality and reproduction of the articulation degree.
本文主要研究在基于hmm的语音合成框架下,对现有标准中性语音的发音程度(低/高发音)进行自动修改。从具有中性、次和高清晰度语音数据的源说话者开始,在中性语音合成器的适应过程中计算两组转换。然后将这些转换应用于新的目标说话者,其中没有低/高发音录音可用。四种统计方法被调查,不同的说话风格适应技术(MLLR vs. cllr)和说话风格换位方法(语音与声学对应),他们使用。本研究的重点是韵律模型,尽管这种技术可以应用于任何显示合适的可插入性属性的参数流。为了确定哪一种统计变换方法能获得更好的片段质量和发音度的再现,进行了两次主观评价。
{"title":"Statistical methods for varying the degree of articulation in new HMM-based voices","authors":"B. Picart, Thomas Drugman, T. Dutoit","doi":"10.1109/SLT.2012.6424238","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424238","url":null,"abstract":"This paper focuses on the automatic modification of the degree of articulation (hypo/hyperarticulation) of an existing standard neutral voice in the framework of HMM-based speech synthesis. Starting from a source speaker for which neutral, hypo and hyperarticulated speech data are available, two sets of transformations are computed during the adaptation of the neutral speech synthesizer. These transformations are then applied to a new target speaker for which no hypo/hyperarticulated recordings are available. Four statistical methods are investigated, differing in the speaking style adaptation technique (MLLR vs. CMLLR) and in the speaking style transposition approach (phonetic vs. acoustic correspondence) they use. This study focuses on the prosody model although such techniques can be applied to any stream of parameters exhibiting suited interpolability properties. Two subjective evaluations are performed in order to determine which statistical transformation method achieves the better segmental quality and reproduction of the articulation degree.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133893657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic classification of unequal lexical stress patterns using machine learning algorithms 使用机器学习算法的不相等词法重音模式自动分类
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424255
M. Shahin, B. Ahmed, K. Ballard
Technology based speech therapy systems are severely handicapped due to the absence of accurate prosodic event identification algorithms. This paper introduces an automatic method for the classification of strong-weak (SW) and weak-strong (WS) stress patterns in children speech with American English accent, for use in the assessment of the speech dysprosody. We investigate the ability of two sets of features used to train classifiers to identify the variation in lexical stress between two consecutive syllables. The first set consists of traditional features derived from measurements of pitch, intensity and duration, whereas the second set consists of energies of different filter banks. Three different classifiers were used in the experiments: an Artificial Neural Network (ANN) classifier with a single hidden layer, Support Vector Machine (SVM) classifier with both linear and Gaussian kernels and the Maximum Entropy modeling (MaxEnt). these features. Best results were obtained using an ANN classifier and a combination of the two sets of features. The system correctly classified 94% of the SW stress patterns and 76% of the WS stress patterns.
由于缺乏准确的韵律事件识别算法,基于技术的语言治疗系统存在严重缺陷。本文介绍了一种自动分类美国英语儿童语音中强弱重音模式和强弱重音模式的方法,用于语音异常的评估。我们研究了用于训练分类器的两组特征的能力,以识别两个连续音节之间的词汇重音变化。第一组由基音、强度和持续时间的测量得出的传统特征组成,而第二组由不同滤波器组的能量组成。实验中使用了三种不同的分类器:具有单个隐藏层的人工神经网络(ANN)分类器,具有线性和高斯核的支持向量机(SVM)分类器以及最大熵建模(MaxEnt)。这些特性。使用人工神经网络分类器和两组特征的组合获得了最佳结果。该系统正确分类了94%的SW应力模式和76%的WS应力模式。
{"title":"Automatic classification of unequal lexical stress patterns using machine learning algorithms","authors":"M. Shahin, B. Ahmed, K. Ballard","doi":"10.1109/SLT.2012.6424255","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424255","url":null,"abstract":"Technology based speech therapy systems are severely handicapped due to the absence of accurate prosodic event identification algorithms. This paper introduces an automatic method for the classification of strong-weak (SW) and weak-strong (WS) stress patterns in children speech with American English accent, for use in the assessment of the speech dysprosody. We investigate the ability of two sets of features used to train classifiers to identify the variation in lexical stress between two consecutive syllables. The first set consists of traditional features derived from measurements of pitch, intensity and duration, whereas the second set consists of energies of different filter banks. Three different classifiers were used in the experiments: an Artificial Neural Network (ANN) classifier with a single hidden layer, Support Vector Machine (SVM) classifier with both linear and Gaussian kernels and the Maximum Entropy modeling (MaxEnt). these features. Best results were obtained using an ANN classifier and a combination of the two sets of features. The system correctly classified 94% of the SW stress patterns and 76% of the WS stress patterns.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132354438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Combining multiple translation systems for Spoken Language Understanding portability 结合多种翻译系统,实现口语理解的可移植性
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424221
Fernando García, L. Hurtado, E. Segarra, E. Arnal, G. Riccardi
We are interested in the problem of learning Spoken Language Understanding (SLU) models for multiple target languages. Learning such models requires annotated corpora, and porting to different languages would require corpora with parallel text translation and semantic annotations. In this paper we investigate how to learn a SLU model in a target language starting from no target text and no semantic annotation. Our proposed algorithm is based on the idea of exploiting the diversity (with regard to performance and coverage) of multiple translation systems to transfer statistically stable word-to-concept mappings in the case of the romance language pair, French and Spanish. Each translation system performs differently at the lexical level (wrt BLEU). The best translation system performances for the semantic task are gained from their combination at different stages of the portability methodology. We have evaluated the portability algorithms on the French MEDIA corpus, using French as the source language and Spanish as the target language. The experiments show the effectiveness of the proposed methods with respect to the source language SLU baseline.
我们对多目标语言的口语理解(SLU)模型学习问题很感兴趣。学习这样的模型需要带注释的语料库,而移植到不同的语言则需要具有平行文本翻译和语义注释的语料库。本文从无目标文本、无语义标注开始,研究如何在目标语言中学习语言语言单元模型。我们提出的算法基于利用多个翻译系统的多样性(在性能和覆盖范围方面)的思想,在罗曼语对、法语和西班牙语的情况下,转移统计上稳定的词到概念映射。每个翻译系统在词汇层面上的表现各不相同(wrt BLEU)。在可移植性方法的不同阶段,对语义任务的翻译系统性能进行了优化。我们评估了法语MEDIA语料库上的可移植性算法,使用法语作为源语言,西班牙语作为目标语言。实验结果表明,所提方法相对于源语言SLU基线是有效的。
{"title":"Combining multiple translation systems for Spoken Language Understanding portability","authors":"Fernando García, L. Hurtado, E. Segarra, E. Arnal, G. Riccardi","doi":"10.1109/SLT.2012.6424221","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424221","url":null,"abstract":"We are interested in the problem of learning Spoken Language Understanding (SLU) models for multiple target languages. Learning such models requires annotated corpora, and porting to different languages would require corpora with parallel text translation and semantic annotations. In this paper we investigate how to learn a SLU model in a target language starting from no target text and no semantic annotation. Our proposed algorithm is based on the idea of exploiting the diversity (with regard to performance and coverage) of multiple translation systems to transfer statistically stable word-to-concept mappings in the case of the romance language pair, French and Spanish. Each translation system performs differently at the lexical level (wrt BLEU). The best translation system performances for the semantic task are gained from their combination at different stages of the portability methodology. We have evaluated the portability algorithms on the French MEDIA corpus, using French as the source language and Spanish as the target language. The experiments show the effectiveness of the proposed methods with respect to the source language SLU baseline.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114692441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Joint language models for automatic speech recognition and understanding 用于自动语音识别和理解的联合语言模型
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424222
Ali Orkan Bayer, G. Riccardi
Language models (LMs) are one of the main knowledge sources used by automatic speech recognition (ASR) and Spoken Language Understanding (SLU) systems. In ASR systems they are optimized to decode words from speech for a transcription task. In SLU systems they are optimized to map words into concept constructs or interpretation representations. Performance optimization is generally designed independently for ASR and SLU models in terms of word accuracy and concept accuracy respectively. However, the best word accuracy performance does not always yield the best understanding performance. In this paper we investigate how LMs originally trained to maximize word accuracy can be parametrized to account for speech understanding constraints and maximize concept accuracy. Incremental reduction in concept error rate is observed when a LM is trained on word-to-concept mappings. We show how to optimize the joint transcription and understanding task performance in the lexical-semantic relation space.
语言模型(LMs)是自动语音识别(ASR)和口语理解(SLU)系统使用的主要知识来源之一。在ASR系统中,它们被优化为从语音中解码单词以完成转录任务。在SLU系统中,它们被优化为将单词映射为概念结构或解释表示。ASR和SLU模型的性能优化通常分别在词精度和概念精度方面进行独立设计。然而,最好的单词准确性性能并不总是产生最好的理解性能。在本文中,我们研究了如何将最初为最大化单词准确性而训练的LMs参数化,以考虑语音理解约束并最大化概念准确性。当LM在词到概念的映射上进行训练时,可以观察到概念错误率的增量降低。我们展示了如何在词汇-语义关系空间中优化联合转录和理解任务的性能。
{"title":"Joint language models for automatic speech recognition and understanding","authors":"Ali Orkan Bayer, G. Riccardi","doi":"10.1109/SLT.2012.6424222","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424222","url":null,"abstract":"Language models (LMs) are one of the main knowledge sources used by automatic speech recognition (ASR) and Spoken Language Understanding (SLU) systems. In ASR systems they are optimized to decode words from speech for a transcription task. In SLU systems they are optimized to map words into concept constructs or interpretation representations. Performance optimization is generally designed independently for ASR and SLU models in terms of word accuracy and concept accuracy respectively. However, the best word accuracy performance does not always yield the best understanding performance. In this paper we investigate how LMs originally trained to maximize word accuracy can be parametrized to account for speech understanding constraints and maximize concept accuracy. Incremental reduction in concept error rate is observed when a LM is trained on word-to-concept mappings. We show how to optimize the joint transcription and understanding task performance in the lexical-semantic relation space.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121327277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Syllable-based prosodic analysis of Amharic read speech 基于音节的阿姆哈拉语朗读韵律分析
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424232
O. Jokisch, Y. Gebremedhin, R. Hoffmann
Amharic is the official language of Ethiopia and belongs to the under-resourced languages. Analyzing a new corpus of Amharic read speech, this contribution surveys syllable-based prosodic variations in f0, duration and intensity to develop suitable prosody models for speech synthesis and recognition. The article starts with a brief description of the Amharic script, the prosodic analysis methods, an accentuation experiment using resynthesis and a perceptual test. The main part summarizes stress-related analysis results for f0, duration and intensity and their interrelations. The quantitative variations of Amharic are comparable with the range in well-examined languages. The observed modifications in syllable duration and mean f0 proved to be relevant for stress perception as demonstrated in the perceptual test with resynthesis stimuli.
阿姆哈拉语是埃塞俄比亚的官方语言,属于资源不足的语言。本文分析了一个新的阿姆哈拉语朗读语料库,研究了基于音节的韵律变化在频率、持续时间和强度上的变化,以建立适合语音合成和识别的韵律模型。文章首先简要介绍了阿姆哈拉语的文字、韵律分析方法、用重合成法进行的重音实验和感知测试。主要部分总结了f0、持续时间和强度的应力相关分析结果及其相互关系。阿姆哈拉语的数量变化与经过仔细研究的语言的范围相当。在再合成刺激的知觉测试中,音节长度和平均f0的变化被证明与重音知觉有关。
{"title":"Syllable-based prosodic analysis of Amharic read speech","authors":"O. Jokisch, Y. Gebremedhin, R. Hoffmann","doi":"10.1109/SLT.2012.6424232","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424232","url":null,"abstract":"Amharic is the official language of Ethiopia and belongs to the under-resourced languages. Analyzing a new corpus of Amharic read speech, this contribution surveys syllable-based prosodic variations in f0, duration and intensity to develop suitable prosody models for speech synthesis and recognition. The article starts with a brief description of the Amharic script, the prosodic analysis methods, an accentuation experiment using resynthesis and a perceptual test. The main part summarizes stress-related analysis results for f0, duration and intensity and their interrelations. The quantitative variations of Amharic are comparable with the range in well-examined languages. The observed modifications in syllable duration and mean f0 proved to be relevant for stress perception as demonstrated in the perceptual test with resynthesis stimuli.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124615247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition 基于分段线性变换的视听特征集成噪声鲁棒自动语音识别
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424213
Yosuke Kashiwagi, Masayuki Suzuki, N. Minematsu, K. Hirose
Multimodal speech recognition is a promising approach to realize noise robust automatic speech recognition (ASR), and is currently gathering the attention of many researchers. Multimodal ASR utilizes not only audio features, which are sensitive to background noises, but also non-audio features such as lip shapes to achieve noise robustness. Although various methods have been proposed to integrate audio-visual features, there are still continuing discussions on how the vest integration of audio and visual features is realized. Weights of audio and visual features should be decided according to the noise features and levels: in general, larger weights to visual features when the noise level is low and vice versa, but how it can be controlled? In this paper, we propose a method based on piecewise linear transformation in feature integration. In contrast to other feature integration methods, our proposed method can appropriately change the weight depending on a state of an observed noisy feature, which has information both on uttered phonemes and environmental noise. Experiments on noisy speech recognition are conducted following to CENSREC-1-AV, and word error reduction rate around 24% is realized in average as compared to a decision fusion method.
多模态语音识别是实现噪声鲁棒自动语音识别(ASR)的一种很有前途的方法,目前受到许多研究者的关注。多模态ASR不仅利用对背景噪声敏感的音频特征,还利用唇形等非音频特征来实现噪声鲁棒性。虽然已经提出了各种方法来整合音视频特征,但如何实现音视频特征的整合仍在继续讨论。音频和视觉特征的权重应根据噪音特征和水平来决定:一般来说,当噪音水平较低时,视觉特征的权重较大,反之亦然,但如何控制呢?本文提出了一种基于分段线性变换的特征积分方法。与其他特征集成方法相比,我们提出的方法可以根据观察到的噪声特征的状态适当地改变权重,这些特征既包含发出的音素信息,也包含环境噪声信息。根据CENSREC-1-AV进行了噪声语音识别实验,与决策融合方法相比,平均实现了24%左右的词错误率。
{"title":"Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition","authors":"Yosuke Kashiwagi, Masayuki Suzuki, N. Minematsu, K. Hirose","doi":"10.1109/SLT.2012.6424213","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424213","url":null,"abstract":"Multimodal speech recognition is a promising approach to realize noise robust automatic speech recognition (ASR), and is currently gathering the attention of many researchers. Multimodal ASR utilizes not only audio features, which are sensitive to background noises, but also non-audio features such as lip shapes to achieve noise robustness. Although various methods have been proposed to integrate audio-visual features, there are still continuing discussions on how the vest integration of audio and visual features is realized. Weights of audio and visual features should be decided according to the noise features and levels: in general, larger weights to visual features when the noise level is low and vice versa, but how it can be controlled? In this paper, we propose a method based on piecewise linear transformation in feature integration. In contrast to other feature integration methods, our proposed method can appropriately change the weight depending on a state of an observed noisy feature, which has information both on uttered phonemes and environmental noise. Experiments on noisy speech recognition are conducted following to CENSREC-1-AV, and word error reduction rate around 24% is realized in average as compared to a decision fusion method.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116622854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
The FAU Video Lecture Browser system FAU视频讲座浏览器系统
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424256
K. Riedhammer, Martin Gropp, E. Nöth
A growing number of universities and other educational institutions provide recordings of lectures and seminars as an additional resource to the students. In contrast to educational films that are scripted, directed and often shot by film professionals, these plain recordings are typically not post-processed in an editorial sense. Thus, the videos often contain longer periods of inactivity or silence, unnecessary repetitions, or corrections of prior mistakes. This paper describes the FAU Video Lecture Browser system, a web-based platform for the interactive assessment of video lectures, that helps to close the gap between a plain recording and a useful e-learning resource by displaying automatically extracted and ranked key phrases on an augmented time line based on stream graphs. In a pilot study, users of the interface were able to complete a topic localization task about 29 % faster than users provided with the video only while achieving about the same accuracy. The user interactions can be logged on the server to collect data to evaluate the quality of the phrases and rankings, and to train systems that produce customized phrase rankings.
越来越多的大学和其他教育机构提供讲座和研讨会的录音,作为学生的额外资源。与由电影专业人员编写剧本、执导和拍摄的教育片不同,这些普通的录音通常没有经过编辑意义上的后期处理。因此,视频通常包含较长时间的不活动或沉默,不必要的重复或对先前错误的更正。本文介绍了FAU视频讲座浏览器系统,这是一个基于网络的视频讲座交互式评估平台,通过在基于流图的增强时间线上显示自动提取和排名的关键短语,有助于缩小普通录音和有用的电子学习资源之间的差距。在一项初步研究中,使用该界面的用户完成主题定位任务的速度比只提供视频的用户快29%,同时精确度也大致相同。用户交互可以记录在服务器上,以收集数据来评估短语和排名的质量,并训练产生定制短语排名的系统。
{"title":"The FAU Video Lecture Browser system","authors":"K. Riedhammer, Martin Gropp, E. Nöth","doi":"10.1109/SLT.2012.6424256","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424256","url":null,"abstract":"A growing number of universities and other educational institutions provide recordings of lectures and seminars as an additional resource to the students. In contrast to educational films that are scripted, directed and often shot by film professionals, these plain recordings are typically not post-processed in an editorial sense. Thus, the videos often contain longer periods of inactivity or silence, unnecessary repetitions, or corrections of prior mistakes. This paper describes the FAU Video Lecture Browser system, a web-based platform for the interactive assessment of video lectures, that helps to close the gap between a plain recording and a useful e-learning resource by displaying automatically extracted and ranked key phrases on an augmented time line based on stream graphs. In a pilot study, users of the interface were able to complete a topic localization task about 29 % faster than users provided with the video only while achieving about the same accuracy. The user interactions can be logged on the server to collect data to evaluate the quality of the phrases and rankings, and to train systems that produce customized phrase rankings.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126697955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
What makes this voice sound so bad? A multidimensional analysis of state-of-the-art text-to-speech systems 这声音怎么这么难听?最先进的文本到语音系统的多维分析
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424229
Florian Hinterleitner, C. Norrenbrock, S. Möller, U. Heute
This paper presents research on perceptual quality dimensions of synthetic speech. We generated 57 stimuli from 16/19 female/male German text-to-speech systems (TTS) and asked listeners to judge the perceptual distances between them in a sorting task. Through a subsequent multidimensional scaling algorithm, we extracted three dimensions. Via expert listening and a comparison to ratings gathered on 16 attribute scales, the three dimensions can be assigned to naturalness of voice, temporal distortions and calmness. These dimensions are discussed in detail and compared to the perceptual quality dimensions from previous multidimensional analyses. Moreover, the results are analyzed depending on the type of TTS system. The identified dimensions will be used in the future to build a dimension-based quality predictor for synthetic speech.
本文对合成语音的感知质量维度进行了研究。我们从16/19个女性/男性德语文本到语音系统(TTS)中产生57个刺激,并要求听众在分类任务中判断它们之间的感知距离。通过随后的多维缩放算法,我们提取了三个维度。通过专家的倾听,并与16个属性量表收集的评分进行比较,这三个维度可以被分配到声音的自然度、时间扭曲和冷静。对这些维度进行了详细的讨论,并与以前多维分析中的感知质量维度进行了比较。此外,根据TTS系统的类型对结果进行了分析。识别的维度将在未来用于构建基于维度的合成语音质量预测器。
{"title":"What makes this voice sound so bad? A multidimensional analysis of state-of-the-art text-to-speech systems","authors":"Florian Hinterleitner, C. Norrenbrock, S. Möller, U. Heute","doi":"10.1109/SLT.2012.6424229","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424229","url":null,"abstract":"This paper presents research on perceptual quality dimensions of synthetic speech. We generated 57 stimuli from 16/19 female/male German text-to-speech systems (TTS) and asked listeners to judge the perceptual distances between them in a sorting task. Through a subsequent multidimensional scaling algorithm, we extracted three dimensions. Via expert listening and a comparison to ratings gathered on 16 attribute scales, the three dimensions can be assigned to naturalness of voice, temporal distortions and calmness. These dimensions are discussed in detail and compared to the perceptual quality dimensions from previous multidimensional analyses. Moreover, the results are analyzed depending on the type of TTS system. The identified dimensions will be used in the future to build a dimension-based quality predictor for synthetic speech.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127900811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Optimization of the DET curve in speaker verification 说话人验证中DET曲线的优化
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424243
Leibny Paola García-Perera, J. Nolazco-Flores, B. Raj, R. Stern
Speaker verification systems are, in essence, statistical pattern detectors which can trade off false rejections for false acceptances. Any operating point characterized by a specific tradeoff between false rejections and false acceptances may be chosen. Training paradigms in speaker verification systems however either learn the parameters of the classifier employed without actually considering this tradeoff, or optimize the parameters for a particular operating point exemplified by the ratio of positive and negative training instances supplied. In this paper we investigate the optimization of training paradigms to explicitly consider the tradeoff between false rejections and false acceptances, by minimizing the area under the curve of the detection error tradeoff curve. To optimize the parameters, we explicitly minimize a mathematical characterization of the area under the detection error tradeoff curve, through generalized probabilistic descent. Experiments on the NIST 2008 database show that for clean signals the proposed optimization approach is at least as effective as conventional learning. On noisy data, verification performance obtained with the proposed approach is considerably better than that obtained with conventional learning methods.
从本质上讲,说话人验证系统是统计模式检测器,它可以在错误拒绝和错误接受之间进行交易。可以选择在错误拒绝和错误接受之间进行特定权衡的任何操作点。然而,说话人验证系统中的训练范式要么学习所使用的分类器的参数而不实际考虑这种权衡,要么根据所提供的正训练实例和负训练实例的比例来优化特定操作点的参数。在本文中,我们通过最小化检测误差权衡曲线下的面积来研究训练范式的优化,以明确考虑错误拒绝和错误接受之间的权衡。为了优化参数,我们通过广义概率下降显式地最小化检测误差权衡曲线下面积的数学表征。在NIST 2008数据库上的实验表明,对于干净信号,所提出的优化方法至少与传统学习一样有效。在噪声数据上,该方法的验证性能明显优于传统学习方法。
{"title":"Optimization of the DET curve in speaker verification","authors":"Leibny Paola García-Perera, J. Nolazco-Flores, B. Raj, R. Stern","doi":"10.1109/SLT.2012.6424243","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424243","url":null,"abstract":"Speaker verification systems are, in essence, statistical pattern detectors which can trade off false rejections for false acceptances. Any operating point characterized by a specific tradeoff between false rejections and false acceptances may be chosen. Training paradigms in speaker verification systems however either learn the parameters of the classifier employed without actually considering this tradeoff, or optimize the parameters for a particular operating point exemplified by the ratio of positive and negative training instances supplied. In this paper we investigate the optimization of training paradigms to explicitly consider the tradeoff between false rejections and false acceptances, by minimizing the area under the curve of the detection error tradeoff curve. To optimize the parameters, we explicitly minimize a mathematical characterization of the area under the detection error tradeoff curve, through generalized probabilistic descent. Experiments on the NIST 2008 database show that for clean signals the proposed optimization approach is at least as effective as conventional learning. On noisy data, verification performance obtained with the proposed approach is considerably better than that obtained with conventional learning methods.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127161673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
2012 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1