首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
BBN technologies' OpenSAD system BBN technologies的OpenSAD系统
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846238
Scott Novotney, D. Karakos, J. Silovský, R. Schwartz
We describe our submission to the NIST OpenSAD evaluation of speech activity detection of noisy audio generated by the DARPA RATS program. With frequent transmission degradation, channel interference and other noises added, simple energy thresholds do a poor job at SAD for this audio. The evaluation measured performance on both in-training and novel channels. Our approach used a system combination of feed-forward neural networks and bidirectional LSTM recurrent neural networks. System combination and unsupervised adaptation provided further gains on novel channels that lack training data. These improvements lead to a 26% relative improvement for novel channels over simple decoding. Our system resulted in the lowest error rate on the in-training channels and second on the out-of-training channels.
我们向NIST openad提交了对DARPA RATS项目产生的噪声音频的语音活动检测的评估。随着频繁的传输退化、信道干扰和其他噪声的增加,简单的能量阈值对这种音频的SAD效果很差。该评估测量了培训和新渠道的绩效。我们的方法使用了前馈神经网络和双向LSTM递归神经网络的系统组合。系统组合和无监督自适应在缺乏训练数据的新信道上提供了进一步的增益。与简单解码相比,这些改进使新信道的相对性能提高了26%。我们的系统在培训频道的错误率最低,在培训频道的错误率第二。
{"title":"BBN technologies' OpenSAD system","authors":"Scott Novotney, D. Karakos, J. Silovský, R. Schwartz","doi":"10.1109/SLT.2016.7846238","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846238","url":null,"abstract":"We describe our submission to the NIST OpenSAD evaluation of speech activity detection of noisy audio generated by the DARPA RATS program. With frequent transmission degradation, channel interference and other noises added, simple energy thresholds do a poor job at SAD for this audio. The evaluation measured performance on both in-training and novel channels. Our approach used a system combination of feed-forward neural networks and bidirectional LSTM recurrent neural networks. System combination and unsupervised adaptation provided further gains on novel channels that lack training data. These improvements lead to a 26% relative improvement for novel channels over simple decoding. Our system resulted in the lowest error rate on the in-training channels and second on the out-of-training channels.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125855402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automated structure discovery and parameter tuning of neural network language model based on evolution strategy 基于进化策略的神经网络语言模型自动结构发现与参数整定
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846334
Tomohiro Tanaka, Takafumi Moriya, T. Shinozaki, Shinji Watanabe, Takaaki Hori, Kevin Duh
Long short-term memory (LSTM) recurrent neural network based language models are known to improve speech recognition performance. However, significant effort is required to optimize network structures and training configurations. In this study, we automate the development process using evolutionary algorithms. In particular, we apply the covariance matrix adaptation-evolution strategy (CMA-ES), which has demonstrated robustness in other black box hyper-parameter optimization problems. By flexibly allowing optimization of various meta-parameters including layer wise unit types, our method automatically finds a configuration that gives improved recognition performance. Further, by using a Pareto based multi-objective CMA-ES, both WER and computational time were reduced jointly: after 10 generations, relative WER and computational time reductions for decoding were 4.1% and 22.7% respectively, compared to an initial baseline system whose WER was 8.7%.
基于长短期记忆(LSTM)的递归神经网络语言模型可以提高语音识别性能。然而,优化网络结构和训练配置需要大量的努力。在这项研究中,我们使用进化算法自动化开发过程。特别地,我们应用了协方差矩阵自适应进化策略(CMA-ES),该策略在其他黑盒超参数优化问题中显示出鲁棒性。通过灵活地允许各种元参数(包括分层单元类型)的优化,我们的方法自动找到一种能够提高识别性能的配置。此外,通过使用基于Pareto的多目标CMA-ES, WER和计算时间都得到了降低:经过10代,解码的相对WER和计算时间分别减少了4.1%和22.7%,而初始基线系统的WER为8.7%。
{"title":"Automated structure discovery and parameter tuning of neural network language model based on evolution strategy","authors":"Tomohiro Tanaka, Takafumi Moriya, T. Shinozaki, Shinji Watanabe, Takaaki Hori, Kevin Duh","doi":"10.1109/SLT.2016.7846334","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846334","url":null,"abstract":"Long short-term memory (LSTM) recurrent neural network based language models are known to improve speech recognition performance. However, significant effort is required to optimize network structures and training configurations. In this study, we automate the development process using evolutionary algorithms. In particular, we apply the covariance matrix adaptation-evolution strategy (CMA-ES), which has demonstrated robustness in other black box hyper-parameter optimization problems. By flexibly allowing optimization of various meta-parameters including layer wise unit types, our method automatically finds a configuration that gives improved recognition performance. Further, by using a Pareto based multi-objective CMA-ES, both WER and computational time were reduced jointly: after 10 generations, relative WER and computational time reductions for decoding were 4.1% and 22.7% respectively, compared to an initial baseline system whose WER was 8.7%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114425383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Automated optimization of decoder hyper-parameters for online LVCSR 在线LVCSR解码器超参数的自动优化
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846303
Akshay Chandrashekaran, Ian Lane
In this paper, we explore the usage of automated hyper-parameter optimization techniques with scalarization of multiple objectives to find decoder hyper-parameters suitable for a given acoustic and language model for an LVCSR task. We compare manual optimization, random sampling, tree of Parzen estimators, Bayesian Optimization, and genetic algorithm to find a technique that yields better performance than manual optimization in a comparable number of hyper-parameter evaluations. We focus on a scalar combination of word error rate (WER), log of real time factor (logRTF), and peak memory usage, formulated using the augmented Tchebyscheff function(ATF), as the objective function for the automated techniques. For this task, with a constraint on the maximum number of objective evaluations, we find that the best automated optimization technique: Bayesian Optimization outperforms manual optimization by 8% in terms of ATF. We find that memory usage was not a very useful distinguishing factor between different hyper-parameter settings, with trade-offs occurring between RTF and WER a majority of the time. We also try to perform optimization of WER with a hard constraint on the real time factor of 0.1. In this case, performing constrained Bayesian Optimization yields a model that provides an improvement of 2.7% over the best model obtained from manual optimization with 60% the number of evaluations.
在本文中,我们探索了使用多目标标化的自动超参数优化技术来寻找适合LVCSR任务的给定声学和语言模型的解码器超参数。我们比较了手动优化、随机抽样、Parzen估计器树、贝叶斯优化和遗传算法,以找到在相当数量的超参数评估中比手动优化产生更好性能的技术。我们将重点放在单词错误率(WER)、实时因子对数(logRTF)和峰值内存使用的标量组合上,使用增强Tchebyscheff函数(ATF)作为自动化技术的目标函数。对于这个任务,在限制最大客观评价次数的情况下,我们发现最好的自动优化技术:贝叶斯优化在ATF方面比人工优化高出8%。我们发现,内存使用并不是区分不同超参数设置的一个非常有用的因素,在大多数情况下,RTF和WER之间存在权衡。我们还尝试在实时性因子为0.1的硬性约束下对WER进行优化。在这种情况下,执行约束贝叶斯优化产生的模型比手动优化获得的最佳模型改进了2.7%,评估次数减少了60%。
{"title":"Automated optimization of decoder hyper-parameters for online LVCSR","authors":"Akshay Chandrashekaran, Ian Lane","doi":"10.1109/SLT.2016.7846303","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846303","url":null,"abstract":"In this paper, we explore the usage of automated hyper-parameter optimization techniques with scalarization of multiple objectives to find decoder hyper-parameters suitable for a given acoustic and language model for an LVCSR task. We compare manual optimization, random sampling, tree of Parzen estimators, Bayesian Optimization, and genetic algorithm to find a technique that yields better performance than manual optimization in a comparable number of hyper-parameter evaluations. We focus on a scalar combination of word error rate (WER), log of real time factor (logRTF), and peak memory usage, formulated using the augmented Tchebyscheff function(ATF), as the objective function for the automated techniques. For this task, with a constraint on the maximum number of objective evaluations, we find that the best automated optimization technique: Bayesian Optimization outperforms manual optimization by 8% in terms of ATF. We find that memory usage was not a very useful distinguishing factor between different hyper-parameter settings, with trade-offs occurring between RTF and WER a majority of the time. We also try to perform optimization of WER with a hard constraint on the real time factor of 0.1. In this case, performing constrained Bayesian Optimization yields a model that provides an improvement of 2.7% over the best model obtained from manual optimization with 60% the number of evaluations.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125964616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Parallel Long Short-Term Memory for multi-stream classification 多流分类的并行长短期记忆
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846268
Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès, R. Mori
Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells allow these DNN-based models to manage long-term dependencies such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM). Nevertheless, these RNNs process a single input stream in one (LSTM) or two (Bidirectional LSTM) directions. But most of the information available nowadays is from multistreams or multimedia documents, and require RNNs to process these information synchronously during the training. This paper presents an original LSTM-based architecture, named Parallel LSTM (PLSTM), that carries out multiple parallel synchronized input sequences in order to predict a common output. The proposed PLSTM method could be used for parallel sequence classification purposes. The PLSTM approach is evaluated on an automatic telecast genre sequences classification task and compared with different state-of-the-art architectures. Results show that the proposed PLSTM method outperforms the baseline n-gram models as well as the state-of-the-art LSTM approach.
最近,机器学习方法提供了广泛的基于深度神经网络(DNN)的原始和高效算法,以自动预测相对于输入序列的结果。循环隐藏细胞允许这些基于dnn的模型管理长期依赖,如循环神经网络(RNN)和长短期记忆(LSTM)。然而,这些rnn在一个(LSTM)或两个(双向LSTM)方向上处理单个输入流。但目前大多数可用的信息来自多流或多媒体文档,并且要求rnn在训练过程中同步处理这些信息。本文提出了一种基于LSTM的原始体系结构,称为并行LSTM (PLSTM),它执行多个并行同步输入序列以预测公共输出。所提出的PLSTM方法可用于并行序列分类。在一个电视节目类型序列自动分类任务中对PLSTM方法进行了评估,并与不同的最先进的体系结构进行了比较。结果表明,所提出的PLSTM方法优于基线n-gram模型以及最先进的LSTM方法。
{"title":"Parallel Long Short-Term Memory for multi-stream classification","authors":"Mohamed Bouaziz, Mohamed Morchid, Richard Dufour, G. Linarès, R. Mori","doi":"10.1109/SLT.2016.7846268","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846268","url":null,"abstract":"Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells allow these DNN-based models to manage long-term dependencies such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM). Nevertheless, these RNNs process a single input stream in one (LSTM) or two (Bidirectional LSTM) directions. But most of the information available nowadays is from multistreams or multimedia documents, and require RNNs to process these information synchronously during the training. This paper presents an original LSTM-based architecture, named Parallel LSTM (PLSTM), that carries out multiple parallel synchronized input sequences in order to predict a common output. The proposed PLSTM method could be used for parallel sequence classification purposes. The PLSTM approach is evaluated on an automatic telecast genre sequences classification task and compared with different state-of-the-art architectures. Results show that the proposed PLSTM method outperforms the baseline n-gram models as well as the state-of-the-art LSTM approach.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126700902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice 语料库大小和内容对单元选择MaryTTS语音感知质量的影响
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846336
Florian Hinterleitner, Benjamin Weiss, S. Möller
State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes of the same speech corpus using the MaryTTS synthesis platform. Statistical analysis showed a significant influence of the size of the speech corpus on all of the five dimensions. Surprisingly the voice created from the second largest speech corpus reached the best ratings in almost all dimensions, with the rating in the dimension fluency and intelligibility being significantly higher than the ratings of any other voice. Moreover, we could also verify a significant effect of the synthesized utterance on four of the five perceptual quality dimensions.
最先进的文本到语音合成方法,如单元选择和HMM合成是数据驱动的。因此,他们使用预先录制的自然语音语料库来构建声音。本文研究了语料库大小对五个不同感知质量维度的影响。使用MaryTTS合成平台,基于同一语音语料库的不同大小的子集,创建了六个德语单位选择语音。统计分析表明,语料库的大小对这五个维度都有显著影响。令人惊讶的是,来自第二大语音语料库的声音几乎在所有维度上都获得了最好的评分,在流畅性和可理解性维度上的评分明显高于其他任何声音的评分。此外,我们还可以验证合成话语对五个感知质量维度中的四个维度的显著影响。
{"title":"Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice","authors":"Florian Hinterleitner, Benjamin Weiss, S. Möller","doi":"10.1109/SLT.2016.7846336","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846336","url":null,"abstract":"State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes of the same speech corpus using the MaryTTS synthesis platform. Statistical analysis showed a significant influence of the size of the speech corpus on all of the five dimensions. Surprisingly the voice created from the second largest speech corpus reached the best ratings in almost all dimensions, with the rating in the dimension fluency and intelligibility being significantly higher than the ratings of any other voice. Moreover, we could also verify a significant effect of the synthesized utterance on four of the five perceptual quality dimensions.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114039200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic plagiarism detection for spoken responses in an assessment of English language proficiency 英语语言能力评估中口语回答的自动抄袭检测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846254
Xinhao Wang, Keelan Evanini, James V. Bruno, Matthew David Mulholland
This paper addresses the task of automatically detecting plagiarized responses in the context of a test of spoken English proficiency for non-native speakers. Text-to-text content similarity features are used jointly with speaking proficiency features extracted using an automated speech scoring system to train classifiers to distinguish between plagiarized and non-plagiarized spoken responses. A large data set drawn from an operational English proficiency assessment is used to simulate the performance of the detection system in a practical application. The best classifier on this heavily imbalanced data set resulted in an F1-score of 0.706 on the plagiarized class. These results indicate that the proposed system can potentially be used to improve the validity of both human and automated assessment of non-native spoken English.
本文讨论了在非英语母语者英语口语水平测试中自动检测剽窃回答的任务。文本到文本内容相似性特征与使用自动语音评分系统提取的口语熟练度特征联合使用,以训练分类器区分抄袭和非抄袭的口语回答。从一个操作性英语水平评估中提取的大数据集被用来模拟检测系统在实际应用中的性能。在这个严重不平衡的数据集上,最好的分类器在抄袭班级上的f1得分为0.706。这些结果表明,所提出的系统可以潜在地用于提高非母语英语口语的人工和自动评估的有效性。
{"title":"Automatic plagiarism detection for spoken responses in an assessment of English language proficiency","authors":"Xinhao Wang, Keelan Evanini, James V. Bruno, Matthew David Mulholland","doi":"10.1109/SLT.2016.7846254","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846254","url":null,"abstract":"This paper addresses the task of automatically detecting plagiarized responses in the context of a test of spoken English proficiency for non-native speakers. Text-to-text content similarity features are used jointly with speaking proficiency features extracted using an automated speech scoring system to train classifiers to distinguish between plagiarized and non-plagiarized spoken responses. A large data set drawn from an operational English proficiency assessment is used to simulate the performance of the detection system in a practical application. The best classifier on this heavily imbalanced data set resulted in an F1-score of 0.706 on the plagiarized class. These results indicate that the proposed system can potentially be used to improve the validity of both human and automated assessment of non-native spoken English.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129321592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Robust utterance classification using multiple classifiers in the presence of speech recognition errors 基于多分类器的语音识别错误鲁棒性语音分类
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846291
Takeshi Homma, Kazuaki Shima, Takuya Matsumoto
In order to achieve an utterance classifier that not only works robustly against speech recognition errors but also maintains high accuracy for input with no errors, we propose the following techniques. First, we propose a classifier training method in which not only error-free transcriptions but also recognized sentences with errors were used as training data. To maintain high accuracy whether or not input has recognition errors, we adjusted a scaling factor of the number of transcriptions for training data. Second, we introduced three classifiers that utilize different input features: words, phonemes, and words recovered from phonetic recognition errors. We also introduced a selection method that selects the most probable utterance class from outputs of multiple utterance classifiers using recognition results obtained from enhanced and non-enhanced speech signals. Experimental results showed our method cuts 55% of classification errors for speech recognition input while accuracy degradation rate for transcription input is 0.7%.
为了实现一个既能鲁棒地对抗语音识别错误,又能在输入无错误的情况下保持高准确率的语音分类器,我们提出了以下技术。首先,我们提出了一种分类器训练方法,该方法不仅使用无错误的转录,而且使用已识别的有错误的句子作为训练数据。为了保持较高的准确性,无论输入是否有识别错误,我们调整了训练数据转录数的比例因子。其次,我们引入了三个利用不同输入特征的分类器:单词、音素和从语音识别错误中恢复的单词。我们还介绍了一种选择方法,该方法使用从增强和非增强语音信号中获得的识别结果,从多个语音分类器的输出中选择最可能的语音类别。实验结果表明,该方法对语音识别输入减少了55%的分类错误,而对转录输入的准确率下降率为0.7%。
{"title":"Robust utterance classification using multiple classifiers in the presence of speech recognition errors","authors":"Takeshi Homma, Kazuaki Shima, Takuya Matsumoto","doi":"10.1109/SLT.2016.7846291","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846291","url":null,"abstract":"In order to achieve an utterance classifier that not only works robustly against speech recognition errors but also maintains high accuracy for input with no errors, we propose the following techniques. First, we propose a classifier training method in which not only error-free transcriptions but also recognized sentences with errors were used as training data. To maintain high accuracy whether or not input has recognition errors, we adjusted a scaling factor of the number of transcriptions for training data. Second, we introduced three classifiers that utilize different input features: words, phonemes, and words recovered from phonetic recognition errors. We also introduced a selection method that selects the most probable utterance class from outputs of multiple utterance classifiers using recognition results obtained from enhanced and non-enhanced speech signals. Experimental results showed our method cuts 55% of classification errors for speech recognition input while accuracy degradation rate for transcription input is 0.7%.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129136619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic turn segmentation for Movie & TV subtitles 自动转向分割电影和电视字幕
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846272
Pierre Lison, R. Meena
Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material - although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78 % on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.
电影和电视字幕包含大量的会话材料,但缺乏明确的回合结构。本文提出了一种数据驱动的字幕分词方法。训练数据首先通过将字幕与文本对齐来提取,以获得说话人标签。然后使用该数据构建一个分类器,其任务是确定两个连续的句子是否属于同一对话回合的一部分。该方法依赖于从字幕本身提取的语言、视觉和时间特征,不需要访问视听材料——尽管在音频数据可用时可以利用说话人拨号。该方法还利用与其他语言中相关字幕的对齐来进一步提高分类性能。该分类器在hold -out测试集上实现了78%的准确率。后续的注释实验表明,对于人类注释者来说,这项任务也很困难。
{"title":"Automatic turn segmentation for Movie & TV subtitles","authors":"Pierre Lison, R. Meena","doi":"10.1109/SLT.2016.7846272","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846272","url":null,"abstract":"Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material - although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78 % on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117240134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
End-to-End attention based text-dependent speaker verification 基于端到端注意的文本依赖说话人验证
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846261
Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Y. Gong
A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.
本文提出了一种基于文本的端到端说话人验证系统。以前,使用语音识别/说话人识别深度神经网络作为说话人验证的特征提取器已经显示出很好的结果。提取的帧级(瓶颈、后验或d向量)特征被同等加权和聚合,以计算话语级说话人表示(d向量或i向量)。在这项工作中,我们使用说话人识别CNN来提取噪声鲁棒的帧级特征。这些特征巧妙地结合在一起,通过注意机制形成一个话语级的说话人向量。该注意模型利用说话人的辨别信息和语音信息来学习权重。整个系统,包括CNN和注意力模型,使用端到端标准进行联合优化。训练算法完全模仿了评估过程-直接将测试话语和一些目标说话者的话语映射到单个验证分数中。该算法可以智能地为每个目标说话人选择最相似的冒名顶替者来训练网络。我们在Windows 10“Hey Cortana”语音验证任务中演示了所提出的端到端系统的有效性。
{"title":"End-to-End attention based text-dependent speaker verification","authors":"Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, Y. Gong","doi":"10.1109/SLT.2016.7846261","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846261","url":null,"abstract":"A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114651716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 173
Attribute based shared hidden layers for cross-language knowledge transfer 基于属性的跨语言知识传递共享隐藏层
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846327
Vipul Arora, A. Lahiri, Henning Reetz
Deep neural network (DNN) acoustic models can be adapted to under-resourced languages by transferring the hidden layers. An analogous transfer problem is popular as few-shot learning to recognise scantily seen objects based on their meaningful attributes. In similar way, this paper proposes a principled way to represent the hidden layers of DNN in terms of attributes shared across languages. The diverse phoneme sets of different languages can be represented in terms of phonological features that are shared by them. The DNN layers estimating these features could then be transferred in a meaningful and reliable way. Here, we evaluate model transfer from English to German, by comparing the proposed method with other popular methods on the task of phoneme recognition. Experimental results support that apart from providing interpretability to the DNN acoustic models, the proposed framework provides efficient means for their speedy adaptation to different languages, even in the face of scanty adaptation data.
深度神经网络(DNN)声学模型可以通过传递隐藏层来适应资源不足的语言。一个类似的转移问题很受欢迎,因为基于有意义的属性来识别很少看到的物体。以类似的方式,本文提出了一种原则性的方法,根据跨语言共享的属性来表示DNN的隐藏层。不同语言的不同音素集可以用它们共有的音位特征来表示。估计这些特征的DNN层可以以一种有意义和可靠的方式传递。本文通过将该方法与其他常用的音素识别方法进行比较,对英语到德语的模型迁移进行了评价。实验结果表明,除了为DNN声学模型提供可解释性外,该框架还为其快速适应不同语言提供了有效手段,即使在适应数据匮乏的情况下也是如此。
{"title":"Attribute based shared hidden layers for cross-language knowledge transfer","authors":"Vipul Arora, A. Lahiri, Henning Reetz","doi":"10.1109/SLT.2016.7846327","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846327","url":null,"abstract":"Deep neural network (DNN) acoustic models can be adapted to under-resourced languages by transferring the hidden layers. An analogous transfer problem is popular as few-shot learning to recognise scantily seen objects based on their meaningful attributes. In similar way, this paper proposes a principled way to represent the hidden layers of DNN in terms of attributes shared across languages. The diverse phoneme sets of different languages can be represented in terms of phonological features that are shared by them. The DNN layers estimating these features could then be transferred in a meaningful and reliable way. Here, we evaluate model transfer from English to German, by comparing the proposed method with other popular methods on the task of phoneme recognition. Experimental results support that apart from providing interpretability to the DNN acoustic models, the proposed framework provides efficient means for their speedy adaptation to different languages, even in the face of scanty adaptation data.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125596040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1