首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Improved prediction of the accent gap between speakers of English for individual-based clustering of World Englishes 改进了基于个体的世界英语聚类的英语说话者之间的口音差距预测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846255
Fumiya Shiozawa, D. Saito, N. Minematsu
The term of “World Englishes” describes the current state of English and one of their main characteristics is a large diversity of pronunciation, called accents. In our previous studies, we developed several techniques to realize effective clustering and visualization of the diversity. For this aim, the accent gap between two speakers has to be quantified independently of extra-linguistic factors such as age and gender. To realize this, a unique representation of speech, called speech structure, which is theoretically invariant against these factors, was applied to represent pronunciation. In the current study, by controlling the degree of invariance, we attempt to improve accent gap prediction. Two techniques are tested: DNN-based model-free estimation of divergence and multi-stream speech structures. In the former, instead of estimating separability between two speech events based on some model assumptions, DNN-based class posteriors are utilized for estimation. In the latter, by deriving one speech structure for each sub-space of acoustic features, constrained invariance is realized. Our proposals are tested in terms of the correlation between reference accent gaps and the predicted and quantified gaps. Experiments show that the correlation is improved from 0.718 to 0.730.
“世界英语”一词描述了英语的现状,其主要特征之一是发音的多样性,称为口音。在我们之前的研究中,我们开发了几种技术来实现有效的聚类和可视化的多样性。为了达到这个目的,两个说话者之间的口音差异必须独立于年龄和性别等语言外因素进行量化。为了实现这一点,我们采用了一种独特的语音表示,即语音结构,它在理论上对这些因素是不变的,用于表示发音。在目前的研究中,我们试图通过控制不变性的程度来改进重音间隙的预测。测试了两种技术:基于dnn的散度无模型估计和多流语音结构。在前者中,使用基于dnn的类后验进行估计,而不是基于一些模型假设来估计两个语音事件之间的可分离性。后者通过为声学特征的每个子空间推导一个语音结构,实现约束不变性。我们的建议在参考重音间隙和预测和量化间隙之间的相关性方面进行了测试。实验表明,相关系数由0.718提高到0.730。
{"title":"Improved prediction of the accent gap between speakers of English for individual-based clustering of World Englishes","authors":"Fumiya Shiozawa, D. Saito, N. Minematsu","doi":"10.1109/SLT.2016.7846255","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846255","url":null,"abstract":"The term of “World Englishes” describes the current state of English and one of their main characteristics is a large diversity of pronunciation, called accents. In our previous studies, we developed several techniques to realize effective clustering and visualization of the diversity. For this aim, the accent gap between two speakers has to be quantified independently of extra-linguistic factors such as age and gender. To realize this, a unique representation of speech, called speech structure, which is theoretically invariant against these factors, was applied to represent pronunciation. In the current study, by controlling the degree of invariance, we attempt to improve accent gap prediction. Two techniques are tested: DNN-based model-free estimation of divergence and multi-stream speech structures. In the former, instead of estimating separability between two speech events based on some model assumptions, DNN-based class posteriors are utilized for estimation. In the latter, by deriving one speech structure for each sub-space of acoustic features, constrained invariance is realized. Our proposals are tested in terms of the correlation between reference accent gaps and the predicted and quantified gaps. Experiments show that the correlation is improved from 0.718 to 0.730.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129088224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic optimization of data perturbation distributions for multi-style training in speech recognition 语音识别中多风格训练数据扰动分布的自动优化
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846240
Mortaza Doulaty, R. Rose, O. Siohan
Speech recognition performance using deep neural network based acoustic models is known to degrade when the acoustic environment and the speaker population in the target utterances are significantly different from the conditions represented in the training data. To address these mismatched scenarios, multi-style training (MTR) has been used to perturb utterances in an existing uncorrupted and potentially mismatched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels for a given set of perturbation types that best matches the target speech utterances. An approach is presented that, given a small set of utterances from a target domain, automatically identifies an empirical distribution of perturbation levels that can be applied to utterances in an existing training set. Distributions are estimated for perturbation types that include acoustic background environments, reverberant room configurations, and speaker related variation like frequency and temporal warping. The end goal is for the resulting perturbed training set to characterize the variability in the target domain and thereby optimize ASR performance. An experimental study is performed to evaluate the impact of this approach on ASR performance when the target utterances are taken from a simulated far-field acoustic environment.
当目标话语中的声环境和说话人群体与训练数据中表示的条件显著不同时,使用基于深度神经网络的声学模型的语音识别性能会下降。为了解决这些不匹配的情况,多风格训练(MTR)被用于干扰现有的未损坏的和可能不匹配的训练语音语料库中的话语,以更好地匹配目标域的话语。本文解决的问题是确定最适合目标语音的给定扰动类型集的扰动水平分布。提出了一种方法,给定一小组来自目标域的话语,自动识别可应用于现有训练集中的话语的扰动水平的经验分布。估计扰动类型的分布,包括声学背景环境,混响室配置和扬声器相关的变化,如频率和时间翘曲。最终目标是使得到的扰动训练集表征目标域的可变性,从而优化ASR性能。我们进行了一项实验研究,以评估当目标话语来自模拟远场声环境时,该方法对ASR性能的影响。
{"title":"Automatic optimization of data perturbation distributions for multi-style training in speech recognition","authors":"Mortaza Doulaty, R. Rose, O. Siohan","doi":"10.1109/SLT.2016.7846240","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846240","url":null,"abstract":"Speech recognition performance using deep neural network based acoustic models is known to degrade when the acoustic environment and the speaker population in the target utterances are significantly different from the conditions represented in the training data. To address these mismatched scenarios, multi-style training (MTR) has been used to perturb utterances in an existing uncorrupted and potentially mismatched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels for a given set of perturbation types that best matches the target speech utterances. An approach is presented that, given a small set of utterances from a target domain, automatically identifies an empirical distribution of perturbation levels that can be applied to utterances in an existing training set. Distributions are estimated for perturbation types that include acoustic background environments, reverberant room configurations, and speaker related variation like frequency and temporal warping. The end goal is for the resulting perturbed training set to characterize the variability in the target domain and thereby optimize ASR performance. An experimental study is performed to evaluate the impact of this approach on ASR performance when the target utterances are taken from a simulated far-field acoustic environment.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114923030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Improving multi-stream classification by mapping sequence-embedding in a high dimensional space 利用高维空间映射序列嵌入改进多流分类
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846269
Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès
Most of the Natural and Spoken Language Processing tasks now employ Neural Networks (NN), allowing them to reach impressive performances. Embedding features allow the NLP systems to represent input vectors in a latent space and to improve the observed performances. In this context, Recurrent Neural Network (RNN) based architectures such as Long Short-Term Memory (LSTM) are well known for their capacity to encode sequential data into a non-sequential hidden vector representation, called sequence embedding. In this paper, we propose an LSTM-based multi-stream sequence embedding in order to encode parallel sequences by a single non-sequential latent representation vector. We then propose to map this embedding representation in a high-dimensional space using a Support Vector Machine (SVM) in order to classify the multi-stream sequences by finding out an optimal hyperplane. Multi-stream sequence embedding allowed the SVM classifier to more efficiently profit from information carried by both parallel streams and longer sequences. The system achieved the best performance, in a multi-stream sequence classification task, with a gain of 9 points in error rate compared to an SVM trained on the original input sequences.
大多数自然语言和口语处理任务现在都使用神经网络(NN),使它们能够达到令人印象深刻的性能。嵌入特征允许NLP系统在潜在空间中表示输入向量并改善观察性能。在这种情况下,基于循环神经网络(RNN)的架构,如长短期记忆(LSTM),以其将序列数据编码为非序列隐藏向量表示的能力而闻名,称为序列嵌入。在本文中,我们提出了一种基于lstm的多流序列嵌入方法,以便通过单个非顺序潜在表示向量对并行序列进行编码。然后,我们提出使用支持向量机(SVM)将该嵌入表示映射到高维空间中,以便通过寻找最优超平面对多流序列进行分类。多流序列嵌入使得支持向量机分类器能够更有效地利用并行流和长序列所携带的信息。该系统在多流序列分类任务中取得了最佳性能,与在原始输入序列上训练的SVM相比,错误率提高了9个点。
{"title":"Improving multi-stream classification by mapping sequence-embedding in a high dimensional space","authors":"Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès","doi":"10.1109/SLT.2016.7846269","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846269","url":null,"abstract":"Most of the Natural and Spoken Language Processing tasks now employ Neural Networks (NN), allowing them to reach impressive performances. Embedding features allow the NLP systems to represent input vectors in a latent space and to improve the observed performances. In this context, Recurrent Neural Network (RNN) based architectures such as Long Short-Term Memory (LSTM) are well known for their capacity to encode sequential data into a non-sequential hidden vector representation, called sequence embedding. In this paper, we propose an LSTM-based multi-stream sequence embedding in order to encode parallel sequences by a single non-sequential latent representation vector. We then propose to map this embedding representation in a high-dimensional space using a Support Vector Machine (SVM) in order to classify the multi-stream sequences by finding out an optimal hyperplane. Multi-stream sequence embedding allowed the SVM classifier to more efficiently profit from information carried by both parallel streams and longer sequences. The system achieved the best performance, in a multi-stream sequence classification task, with a gain of 9 points in error rate compared to an SVM trained on the original input sequences.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114891322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A log-linear weighting approach in the Word2vec space for spoken language understanding 用于口语理解的Word2vec空间的对数线性加权方法
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846289
Killian Janod, Mohamed Morchid, Richard Dufour, G. Linarès
This paper proposes an original method which integrates contextual information of words into Word2vec neural networks that learn from words and their respective context windows. In the classical word embedding approach, context windows are represented as bag-of-words, i.e. every word in the context is treated equally. A log-linear weighting approach modeling the continuous context is proposed in our model to take into account the relative position of words in the surrounding context of the word. Quality improvements implied by this method are shown on the the Semantic-Syntactic Word Relationship test and on a real application framework implying a theme identification task of human dialogues. The promising gains of our adapted Word2vec model of 7 and 5 points for Skip-gram and CBOW approaches respectively demonstrate that the proposed models are a step forward for word and document representation.
本文提出了一种新颖的方法,将单词的上下文信息整合到Word2vec神经网络中,Word2vec神经网络从单词及其上下文窗口中学习。在经典的词嵌入方法中,上下文窗口被表示为词袋,即上下文中的每个词都被平等对待。在我们的模型中提出了一种对数线性加权方法来建模连续上下文,以考虑单词在单词周围上下文中的相对位置。在语义-句法词关系测试和一个隐含人类对话主题识别任务的实际应用框架上,表明了该方法所隐含的质量改进。在Skip-gram和CBOW方法中,我们的Word2vec模型分别获得了7和5个点,这表明我们提出的模型在单词和文档表示方面向前迈进了一步。
{"title":"A log-linear weighting approach in the Word2vec space for spoken language understanding","authors":"Killian Janod, Mohamed Morchid, Richard Dufour, G. Linarès","doi":"10.1109/SLT.2016.7846289","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846289","url":null,"abstract":"This paper proposes an original method which integrates contextual information of words into Word2vec neural networks that learn from words and their respective context windows. In the classical word embedding approach, context windows are represented as bag-of-words, i.e. every word in the context is treated equally. A log-linear weighting approach modeling the continuous context is proposed in our model to take into account the relative position of words in the surrounding context of the word. Quality improvements implied by this method are shown on the the Semantic-Syntactic Word Relationship test and on a real application framework implying a theme identification task of human dialogues. The promising gains of our adapted Word2vec model of 7 and 5 points for Skip-gram and CBOW approaches respectively demonstrate that the proposed models are a step forward for word and document representation.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131889255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice 语料库大小和内容对单元选择MaryTTS语音感知质量的影响
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846336
Florian Hinterleitner, Benjamin Weiss, S. Möller
State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes of the same speech corpus using the MaryTTS synthesis platform. Statistical analysis showed a significant influence of the size of the speech corpus on all of the five dimensions. Surprisingly the voice created from the second largest speech corpus reached the best ratings in almost all dimensions, with the rating in the dimension fluency and intelligibility being significantly higher than the ratings of any other voice. Moreover, we could also verify a significant effect of the synthesized utterance on four of the five perceptual quality dimensions.
最先进的文本到语音合成方法,如单元选择和HMM合成是数据驱动的。因此,他们使用预先录制的自然语音语料库来构建声音。本文研究了语料库大小对五个不同感知质量维度的影响。使用MaryTTS合成平台,基于同一语音语料库的不同大小的子集,创建了六个德语单位选择语音。统计分析表明,语料库的大小对这五个维度都有显著影响。令人惊讶的是,来自第二大语音语料库的声音几乎在所有维度上都获得了最好的评分,在流畅性和可理解性维度上的评分明显高于其他任何声音的评分。此外,我们还可以验证合成话语对五个感知质量维度中的四个维度的显著影响。
{"title":"Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice","authors":"Florian Hinterleitner, Benjamin Weiss, S. Möller","doi":"10.1109/SLT.2016.7846336","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846336","url":null,"abstract":"State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes of the same speech corpus using the MaryTTS synthesis platform. Statistical analysis showed a significant influence of the size of the speech corpus on all of the five dimensions. Surprisingly the voice created from the second largest speech corpus reached the best ratings in almost all dimensions, with the rating in the dimension fluency and intelligibility being significantly higher than the ratings of any other voice. Moreover, we could also verify a significant effect of the synthesized utterance on four of the five perceptual quality dimensions.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114039200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic plagiarism detection for spoken responses in an assessment of English language proficiency 英语语言能力评估中口语回答的自动抄袭检测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846254
Xinhao Wang, Keelan Evanini, James V. Bruno, Matthew David Mulholland
This paper addresses the task of automatically detecting plagiarized responses in the context of a test of spoken English proficiency for non-native speakers. Text-to-text content similarity features are used jointly with speaking proficiency features extracted using an automated speech scoring system to train classifiers to distinguish between plagiarized and non-plagiarized spoken responses. A large data set drawn from an operational English proficiency assessment is used to simulate the performance of the detection system in a practical application. The best classifier on this heavily imbalanced data set resulted in an F1-score of 0.706 on the plagiarized class. These results indicate that the proposed system can potentially be used to improve the validity of both human and automated assessment of non-native spoken English.
本文讨论了在非英语母语者英语口语水平测试中自动检测剽窃回答的任务。文本到文本内容相似性特征与使用自动语音评分系统提取的口语熟练度特征联合使用,以训练分类器区分抄袭和非抄袭的口语回答。从一个操作性英语水平评估中提取的大数据集被用来模拟检测系统在实际应用中的性能。在这个严重不平衡的数据集上,最好的分类器在抄袭班级上的f1得分为0.706。这些结果表明,所提出的系统可以潜在地用于提高非母语英语口语的人工和自动评估的有效性。
{"title":"Automatic plagiarism detection for spoken responses in an assessment of English language proficiency","authors":"Xinhao Wang, Keelan Evanini, James V. Bruno, Matthew David Mulholland","doi":"10.1109/SLT.2016.7846254","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846254","url":null,"abstract":"This paper addresses the task of automatically detecting plagiarized responses in the context of a test of spoken English proficiency for non-native speakers. Text-to-text content similarity features are used jointly with speaking proficiency features extracted using an automated speech scoring system to train classifiers to distinguish between plagiarized and non-plagiarized spoken responses. A large data set drawn from an operational English proficiency assessment is used to simulate the performance of the detection system in a practical application. The best classifier on this heavily imbalanced data set resulted in an F1-score of 0.706 on the plagiarized class. These results indicate that the proposed system can potentially be used to improve the validity of both human and automated assessment of non-native spoken English.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129321592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Robust utterance classification using multiple classifiers in the presence of speech recognition errors 基于多分类器的语音识别错误鲁棒性语音分类
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846291
Takeshi Homma, Kazuaki Shima, Takuya Matsumoto
In order to achieve an utterance classifier that not only works robustly against speech recognition errors but also maintains high accuracy for input with no errors, we propose the following techniques. First, we propose a classifier training method in which not only error-free transcriptions but also recognized sentences with errors were used as training data. To maintain high accuracy whether or not input has recognition errors, we adjusted a scaling factor of the number of transcriptions for training data. Second, we introduced three classifiers that utilize different input features: words, phonemes, and words recovered from phonetic recognition errors. We also introduced a selection method that selects the most probable utterance class from outputs of multiple utterance classifiers using recognition results obtained from enhanced and non-enhanced speech signals. Experimental results showed our method cuts 55% of classification errors for speech recognition input while accuracy degradation rate for transcription input is 0.7%.
为了实现一个既能鲁棒地对抗语音识别错误,又能在输入无错误的情况下保持高准确率的语音分类器,我们提出了以下技术。首先,我们提出了一种分类器训练方法,该方法不仅使用无错误的转录,而且使用已识别的有错误的句子作为训练数据。为了保持较高的准确性,无论输入是否有识别错误,我们调整了训练数据转录数的比例因子。其次,我们引入了三个利用不同输入特征的分类器:单词、音素和从语音识别错误中恢复的单词。我们还介绍了一种选择方法,该方法使用从增强和非增强语音信号中获得的识别结果,从多个语音分类器的输出中选择最可能的语音类别。实验结果表明,该方法对语音识别输入减少了55%的分类错误,而对转录输入的准确率下降率为0.7%。
{"title":"Robust utterance classification using multiple classifiers in the presence of speech recognition errors","authors":"Takeshi Homma, Kazuaki Shima, Takuya Matsumoto","doi":"10.1109/SLT.2016.7846291","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846291","url":null,"abstract":"In order to achieve an utterance classifier that not only works robustly against speech recognition errors but also maintains high accuracy for input with no errors, we propose the following techniques. First, we propose a classifier training method in which not only error-free transcriptions but also recognized sentences with errors were used as training data. To maintain high accuracy whether or not input has recognition errors, we adjusted a scaling factor of the number of transcriptions for training data. Second, we introduced three classifiers that utilize different input features: words, phonemes, and words recovered from phonetic recognition errors. We also introduced a selection method that selects the most probable utterance class from outputs of multiple utterance classifiers using recognition results obtained from enhanced and non-enhanced speech signals. Experimental results showed our method cuts 55% of classification errors for speech recognition input while accuracy degradation rate for transcription input is 0.7%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129136619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic turn segmentation for Movie & TV subtitles 自动转向分割电影和电视字幕
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846272
Pierre Lison, R. Meena
Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material - although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78 % on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.
电影和电视字幕包含大量的会话材料,但缺乏明确的回合结构。本文提出了一种数据驱动的字幕分词方法。训练数据首先通过将字幕与文本对齐来提取,以获得说话人标签。然后使用该数据构建一个分类器,其任务是确定两个连续的句子是否属于同一对话回合的一部分。该方法依赖于从字幕本身提取的语言、视觉和时间特征,不需要访问视听材料——尽管在音频数据可用时可以利用说话人拨号。该方法还利用与其他语言中相关字幕的对齐来进一步提高分类性能。该分类器在hold -out测试集上实现了78%的准确率。后续的注释实验表明,对于人类注释者来说,这项任务也很困难。
{"title":"Automatic turn segmentation for Movie & TV subtitles","authors":"Pierre Lison, R. Meena","doi":"10.1109/SLT.2016.7846272","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846272","url":null,"abstract":"Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material - although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78 % on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117240134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
End-to-End attention based text-dependent speaker verification 基于端到端注意的文本依赖说话人验证
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846261
Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Y. Gong
A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.
本文提出了一种基于文本的端到端说话人验证系统。以前,使用语音识别/说话人识别深度神经网络作为说话人验证的特征提取器已经显示出很好的结果。提取的帧级(瓶颈、后验或d向量)特征被同等加权和聚合,以计算话语级说话人表示(d向量或i向量)。在这项工作中,我们使用说话人识别CNN来提取噪声鲁棒的帧级特征。这些特征巧妙地结合在一起,通过注意机制形成一个话语级的说话人向量。该注意模型利用说话人的辨别信息和语音信息来学习权重。整个系统,包括CNN和注意力模型,使用端到端标准进行联合优化。训练算法完全模仿了评估过程-直接将测试话语和一些目标说话者的话语映射到单个验证分数中。该算法可以智能地为每个目标说话人选择最相似的冒名顶替者来训练网络。我们在Windows 10“Hey Cortana”语音验证任务中演示了所提出的端到端系统的有效性。
{"title":"End-to-End attention based text-dependent speaker verification","authors":"Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Y. Gong","doi":"10.1109/SLT.2016.7846261","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846261","url":null,"abstract":"A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114651716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 173
Attribute based shared hidden layers for cross-language knowledge transfer 基于属性的跨语言知识传递共享隐藏层
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846327
Vipul Arora, A. Lahiri, Henning Reetz
Deep neural network (DNN) acoustic models can be adapted to under-resourced languages by transferring the hidden layers. An analogous transfer problem is popular as few-shot learning to recognise scantily seen objects based on their meaningful attributes. In similar way, this paper proposes a principled way to represent the hidden layers of DNN in terms of attributes shared across languages. The diverse phoneme sets of different languages can be represented in terms of phonological features that are shared by them. The DNN layers estimating these features could then be transferred in a meaningful and reliable way. Here, we evaluate model transfer from English to German, by comparing the proposed method with other popular methods on the task of phoneme recognition. Experimental results support that apart from providing interpretability to the DNN acoustic models, the proposed framework provides efficient means for their speedy adaptation to different languages, even in the face of scanty adaptation data.
深度神经网络(DNN)声学模型可以通过传递隐藏层来适应资源不足的语言。一个类似的转移问题很受欢迎,因为基于有意义的属性来识别很少看到的物体。以类似的方式,本文提出了一种原则性的方法,根据跨语言共享的属性来表示DNN的隐藏层。不同语言的不同音素集可以用它们共有的音位特征来表示。估计这些特征的DNN层可以以一种有意义和可靠的方式传递。本文通过将该方法与其他常用的音素识别方法进行比较,对英语到德语的模型迁移进行了评价。实验结果表明,除了为DNN声学模型提供可解释性外,该框架还为其快速适应不同语言提供了有效手段,即使在适应数据匮乏的情况下也是如此。
{"title":"Attribute based shared hidden layers for cross-language knowledge transfer","authors":"Vipul Arora, A. Lahiri, Henning Reetz","doi":"10.1109/SLT.2016.7846327","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846327","url":null,"abstract":"Deep neural network (DNN) acoustic models can be adapted to under-resourced languages by transferring the hidden layers. An analogous transfer problem is popular as few-shot learning to recognise scantily seen objects based on their meaningful attributes. In similar way, this paper proposes a principled way to represent the hidden layers of DNN in terms of attributes shared across languages. The diverse phoneme sets of different languages can be represented in terms of phonological features that are shared by them. The DNN layers estimating these features could then be transferred in a meaningful and reliable way. Here, we evaluate model transfer from English to German, by comparing the proposed method with other popular methods on the task of phoneme recognition. Experimental results support that apart from providing interpretability to the DNN acoustic models, the proposed framework provides efficient means for their speedy adaptation to different languages, even in the face of scanty adaptation data.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125596040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1