2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文中文

The MGB-2 challenge: Arabic multi-dialect broadcast media recognition MGB-2的挑战:阿拉伯语多方言广播媒体识别

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2016-09-19 DOI: 10.1109/SLT.2016.7846277

Ahmed M. Ali, P. Bell, James R. Glass, Yacine Messaoui, Hamdy Mubarak, S. Renals, Yifan Zhang

This paper describes the Arabic Multi-Genre Broadcast (MGB-2) Challenge for SLT-2016. Unlike last year's English MGB Challenge, which focused on recognition of diverse TV genres, this year, the challenge has an emphasis on handling the diversity in dialect in Arabic speech. Audio data comes from 19 distinct programmes from the Aljazeera Arabic TV channel between March 2005 and December 2015. Programmes are split into three groups: conversations, interviews, and reports. A total of 1,200 hours have been released with lightly supervised transcriptions for the acoustic modelling. For language modelling, we made available over 110M words crawled from Aljazeera Arabic website Aljazeera.net for a 10 year duration 2000−2011. Two lexicons have been provided, one phoneme based and one grapheme based. Finally, two tasks were proposed for this year's challenge: standard speech transcription, and word alignment. This paper describes the task data and evaluation process used in the MGB challenge, and summarises the results obtained.

本文介绍了SLT-2016的阿拉伯语多类型广播(MGB-2)挑战。去年的英语MGB挑战赛侧重于识别不同的电视类型，今年的挑战赛侧重于处理阿拉伯语方言的多样性。音频数据来自2005年3月至2015年12月期间半岛电视台阿拉伯语频道的19个不同节目。节目分为三部分:对话、采访和报道。总共有1200小时的时间被释放，并对声学建模进行了轻微的监督转录。对于语言建模，我们从半岛电视台阿拉伯语网站Aljazeera.net抓取了超过1.1亿个单词，持续时间为2000年至2011年。提供了两个词典，一个基于音素，一个基于字素。最后，为今年的挑战提出了两个任务:标准语音转录和单词对齐。本文描述了MGB挑战中使用的任务数据和评估过程，并总结了所获得的结果。

引用次数: 93

Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification 基于长短期记忆的递归神经网络语音增强噪声鲁棒说话人验证

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2016-09-16 DOI: 10.1109/SLT.2016.7846281

Morten Kolbæk, Z. Tan, J. Jensen

In this paper we propose to use a state-of-the-art Deep Recurrent Neural Network (DRNN) based Speech Enhancement (SE) algorithm for noise robust Speaker Verification (SV). Specifically, we study the performance of an i-vector based SV system, when tested in noisy conditions using a DRNN based SE front-end utilizing a Long Short-Term Memory (LSTM) architecture. We make comparisons to systems using a Non-negative Matrix Factorization (NMF) based front-end, and a Short-Time Spectral Amplitude Minimum Mean Square Error (STSA-MMSE) based front-end, respectively. We show in simulation experiments that a male-speaker and text-independent DRNN based SE front-end, without specific a priori knowledge about the noise type outperforms a text, noise type and speaker dependent NMF based front-end as well as a STSA-MMSE based front-end in terms of Equal Error Rates for a large range of noise types and signal to noise ratios on the RSR2015 speech corpus.

在本文中，我们建议使用最先进的基于深度递归神经网络(DRNN)的语音增强(SE)算法进行噪声鲁棒说话人验证(SV)。具体来说，我们研究了基于i向量的SV系统的性能，当使用基于DRNN的SE前端利用长短期记忆(LSTM)架构在噪声条件下进行测试时。我们分别比较了使用基于非负矩阵分解(NMF)的前端和基于短时谱幅最小均方误差(STSA-MMSE)的前端的系统。我们在模拟实验中表明，在RSR2015语音语料库上，基于男性说话人和文本无关的基于DRNN的SE前端，在没有关于噪声类型的特定先验知识的情况下，在大范围噪声类型和信噪比的相等错误率方面，优于基于文本、噪声类型和说话人的基于NMF的前端以及基于STSA-MMSE的前端。

引用次数: 53

Approaches for language identification in mismatched environments 不匹配环境下的语言识别方法

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2016-09-08 DOI: 10.1109/SLT.2016.7846286

S. Nercessian, P. Torres-Carrasquillo, Gabriel Martinez-Montes

In this paper, we consider the task of language identification in the context of mismatch conditions. Specifically, we address the issue of using unlabeled data in the domain of interest to improve the performance of a state-of-the-art system. The evaluation is performed on a 9-language set that includes data in both conversational telephone speech and narrowband broadcast speech. Multiple experiments are conducted to assess the performance of the system in this condition and a number of alternatives to ameliorate the drop in performance. The best system evaluated is based on deep neural network (DNN) bottleneck features using i-vectors utilizing a combination of all the approaches proposed in this work. The resulting system improved baseline DNN system performance by 30%.

在本文中，我们考虑了在不匹配条件下的语言识别任务。具体来说，我们解决了在感兴趣的领域使用未标记数据以提高最先进系统性能的问题。所述评估是在包含会话电话语音和窄带广播语音数据的9语言集上执行的。我们进行了多次实验来评估系统在这种情况下的性能，并提出了一些改善性能下降的替代方案。评估的最佳系统是基于深度神经网络(DNN)瓶颈特征，使用i向量，利用本工作中提出的所有方法的组合。由此产生的系统将基线DNN系统性能提高了30%。

引用次数: 11

Hierarchical attention model for improved machine comprehension of spoken content 提高机器对口语内容理解的层次注意模型

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2016-08-28 DOI: 10.1109/SLT.2016.7846270

Wei Fang, Juei-Yang Hsu, Hung-yi Lee, Lin-Shan Lee

Multimedia or spoken content presents more attractive information than plain text content, but the former is more difficult to display on a screen and be selected by a user. As a result, accessing large collections of the former is much more difficult and time-consuming than the latter for humans. It's therefore highly attractive to develop machines which can automatically understand spoken content and summarize the key information for humans to browse over. In this endeavor, a new task of machine comprehension of spoken content was proposed recently. The initial goal was defined as the listening comprehension test of TOEFL, a challenging academic English examination for English learners whose native languages are not English. An Attention-based Multi-hop Recurrent Neural Network (AMRNN) architecture was also proposed for this task, which considered only the sequential relationship within the speech utterances. In this paper, we propose a new Hierarchical Attention Model (HAM), which constructs multi-hopped attention mechanism over tree-structured rather than sequential representations for the utterances. Improved comprehension performance robust with respect to ASR errors were obtained.

多媒体或语音内容比纯文本内容提供更有吸引力的信息，但前者更难以在屏幕上显示并被用户选择。因此，对人类来说，获取前者的大量收藏要比后者困难得多，也要耗时得多。因此，开发能够自动理解口语内容并总结关键信息以供人类浏览的机器是非常有吸引力的。在这方面，最近提出了一个新的任务，即语音内容的机器理解。最初的目标被定义为托福的听力理解测试，这是一项针对母语不是英语的英语学习者的具有挑战性的学术英语考试。针对这一问题，提出了一种基于注意力的多跳递归神经网络(AMRNN)结构，该结构只考虑语音的顺序关系。本文提出了一种新的分层注意模型(HAM)，该模型在树形结构而非顺序表征上构建了多跳注意机制。相对于ASR误差，获得了更好的理解性能。

{"title":"Hierarchical attention model for improved machine comprehension of spoken content","authors":"Wei Fang, Juei-Yang Hsu, Hung-yi Lee, Lin-Shan Lee","doi":"10.1109/SLT.2016.7846270","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846270","url":null,"abstract":"Multimedia or spoken content presents more attractive information than plain text content, but the former is more difficult to display on a screen and be selected by a user. As a result, accessing large collections of the former is much more difficult and time-consuming than the latter for humans. It's therefore highly attractive to develop machines which can automatically understand spoken content and summarize the key information for humans to browse over. In this endeavor, a new task of machine comprehension of spoken content was proposed recently. The initial goal was defined as the listening comprehension test of TOEFL, a challenging academic English examination for English learners whose native languages are not English. An Attention-based Multi-hop Recurrent Neural Network (AMRNN) architecture was also proposed for this task, which considered only the sequential relationship within the speech utterances. In this paper, we propose a new Hierarchical Attention Model (HAM), which constructs multi-hopped attention mechanism over tree-structured rather than sequential representations for the utterances. Improved comprehension performance robust with respect to ASR errors were obtained.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Median-based generation of synthetic speech durations using a non-parametric approach 使用非参数方法合成语音持续时间的基于中位数的生成

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2016-08-22 DOI: 10.1109/SLT.2016.7846337

S. Ronanki, O. Watts, Simon King, G. Henter

This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling - which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis - our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.

本文提出了一种用于统计参数语音合成的持续时间建模的新方法，其中训练一个循环统计模型以输出每个时间步长(声学帧)的电话转移概率。与传统的持续时间建模方法(假设持续时间分布具有特定形式(例如，高斯分布)并使用该分布的平均值进行合成)不同，我们的方法原则上可以对非负整数支持的任何分布进行建模。该模型的生成可以通过多种方式进行;这里我们考虑基于中值预测持续时间的输出生成。中位数比传统的平均持续时间更典型(更可能)，对训练数据不规则性具有鲁棒性，并支持增量生成。此外，持续时间预测的帧级方法与持续时间和声学特征建模的长期目标是一致的。结果表明，所提出的方法在近似自然语音的中位数持续时间方面与基线方法具有竞争力。

引用次数: 16

Multi-lingual deep neural networks for language recognition 用于语言识别的多语言深度神经网络

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2016-08-08 DOI: 10.1109/SLT.2016.7846285

Luis Murphy Marcos, F. Richardson

Multi-lingual feature extraction using bottleneck layers in deep neural networks (BN-DNNs) has been proven to be an effective technique for low resource speech recognition and more recently for language recognition. In this work we investigate the impact on language recognition performance of the multi-lingual BN-DNN architecture and training configurations for the NIST 2011 and 2015 language recognition evaluations (LRE11 and LRE15). The best performing multi-lingual BN-DNN configuration yields relative performance gains of 50% on LRE11 and 40% on LRE15 compared to a standard MFCC/SDC baseline system and 17% on LRE11 and 7% on LRE15 relative to a single language BN-DNN system. Detailed performance analysis using data from all 24 Babel languages, Fisher Spanish and Switchboard English shows the impact of language selection and the amount of training data on overall BN-DNN performance.

在深度神经网络(bn - dnn)中使用瓶颈层的多语言特征提取已被证明是低资源语音识别和语言识别的有效技术。在这项工作中，我们研究了NIST 2011年和2015年语言识别评估(LRE11和LRE15)中多语言BN-DNN架构和训练配置对语言识别性能的影响。与标准MFCC/SDC基线系统相比，性能最好的多语言BN-DNN配置在LRE11上的相对性能提高了50%，在LRE15上的相对性能提高了40%，在LRE11上的相对性能提高了17%，在LRE15上的相对性能提高了7%。使用来自所有24种Babel语言、Fisher西班牙语和Switchboard英语的数据进行详细的性能分析，显示了语言选择和训练数据量对BN-DNN整体性能的影响。

引用次数: 4

Sequence training and adaptation of highway deep neural networks 公路深度神经网络的序列训练与自适应

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2016-07-07 DOI: 10.1109/SLT.2016.7846304

Liang Lu

Highway deep neural network (HDNN) is a type of depth-gated feedforward neural network, which has shown to be easier to train with more hidden layers and also generalise better compared to conventional plain deep neural networks (DNNs). Previously, we investigated a structured HDNN architecture for speech recognition, in which the two gate functions were tied across all the hidden layers, and we were able to train a much smaller model without sacrificing the recognition accuracy. In this paper, we carry on the study of this architecture with sequence-discriminative training criterion and speaker adaptation techniques on the AMI meeting speech recognition corpus. We show that these two techniques improve speech recognition accuracy on top of the model trained with the cross entropy criterion. Furthermore, we demonstrate that the two gate functions that are tied across all the hidden layers are able to control the information flow over the whole network, and we can achieve considerable improvements by only updating these gate functions in both sequence training and adaptation experiments.

高速公路深度神经网络(HDNN)是一种深度门控前馈神经网络，与传统的普通深度神经网络(dnn)相比，使用更多隐藏层更容易训练，并且泛化效果更好。之前，我们研究了用于语音识别的结构化HDNN架构，其中两个门函数在所有隐藏层上捆绑，我们能够在不牺牲识别精度的情况下训练更小的模型。本文以AMI会议语音识别语料库为研究对象，采用顺序判别训练准则和说话人自适应技术对该体系结构进行了研究。我们证明了这两种技术在交叉熵准则训练的模型的基础上提高了语音识别的准确性。此外，我们证明了连接在所有隐藏层上的两个门函数能够控制整个网络的信息流，并且我们可以通过在序列训练和自适应实验中仅更新这些门函数来获得相当大的改进。

引用次数: 6

DialPort: Connecting the spoken dialog research community to real user data DialPort:将口语对话研究社区与真实用户数据连接起来

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2016-06-08 DOI: 10.1109/SLT.2016.7846249

Tiancheng Zhao, Kyusong Lee, M. Eskénazi

This paper describes a new spoken dialog portal that connects systems produced by the spoken dialog academic research community and gives them access to real users. We introduce a distributed, multi-modal, multi-agent prototype dialog framework that affords easy integration with various remote resources, ranging from end-to-end dialog systems to external knowledge APIs. The portal provides seamless passage from one spoken dialog system to another. To date, the DialPort portal has successfully connected to the multi-domain spoken dialog system at Cambridge University, the NOAA (National Oceanic and Atmospheric Administration) weather API and the Yelp API. We present statistics derived from log data gathered during preliminary tests of the portal on the performance of the portal and on the quality (seamlessness) of the transition from one system to another.

本文描述了一个新的口语对话门户，它将口语对话学术研究界生产的系统连接起来，并使它们能够访问真实的用户。我们引入了一个分布式、多模态、多代理的原型对话框架，它可以很容易地与各种远程资源集成，从端到端对话系统到外部知识api。门户提供了从一个口语对话系统到另一个口语对话系统的无缝通道。到目前为止，DialPort门户已经成功地连接到剑桥大学的多域口语对话系统、NOAA(美国国家海洋和大气管理局)天气API和Yelp API。我们提供了从门户初步测试期间收集的日志数据中得出的统计数据，这些数据涉及门户的性能和从一个系统到另一个系统的转换的质量(无缝性)。

引用次数: 20

Deep neural network driven mixture of PLDA for robust i-vector speaker verification 基于深度神经网络的混合PLDA鲁棒i向量说话人验证

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 1900-01-01 DOI: 10.1109/SLT.2016.7846263

N. Li, M. Mak, Jen-Tzung Chien

In speaker recognition, the mismatch between the enrollment and test utterances due to noise with different signal-to-noise ratios (SNRs) is a great challenge. Based on the observation that noise-level variability causes the i-vectors to form heterogeneous clusters, this paper proposes using an SNR-aware deep neural network (DNN) to guide the training of PLDA mixture models. Specifically, given an i-vector, the SNR posterior probabilities produced by the DNN are used as the posteriors of indicator variables of the mixture model. As a result, the proposed model provides a more reasonable soft division of the i-vector space compared to the conventional mixture of PLDA. During verification, given a test trial, the marginal likelihoods from individual PLDA models are linearly combined by the posterior probabilities of SNR levels computed by the DNN. Experimental results for SNR mismatch tasks based on NIST 2012 SRE suggest that the proposed model is more effective than PLDA and conventional mixture of PLDA for handling heterogeneous corpora.

在说话人识别中，由于不同信噪比的噪声导致的入组话语与测试话语不匹配是一个很大的挑战。基于观察到噪声级变异性会导致i向量形成异质聚类，本文提出使用感知信噪比的深度神经网络(DNN)来指导PLDA混合模型的训练。具体来说，给定一个i向量，DNN产生的信噪比后验概率被用作混合模型指标变量的后验。因此，与传统的混合PLDA相比，该模型提供了更合理的i向量空间软划分。在验证过程中，给定一个测试试验，单个PLDA模型的边际似然与DNN计算的信噪比水平的后验概率线性组合。基于NIST 2012 SRE的信噪比错配任务实验结果表明，该模型比PLDA和传统混合PLDA处理异构语料库更有效。

引用次数: 9

The fifth dialog state tracking challenge 第五个对话框状态跟踪挑战

2016 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 1900-01-01 DOI: 10.1109/SLT.2016.7846311

Seokhwan Kim, L. F. D’Haro, Rafael E. Banchs, J. Williams, Matthew Henderson, Koichiro Yoshino

Dialog state tracking - the process of updating the dialog state after each interaction with the user - is a key component of most dialog systems. Following a similar scheme to the fourth dialog state tracking challenge, this edition again focused on human-human dialogs, but introduced the task of cross-lingual adaptation of trackers. The challenge received a total of 32 entries from 9 research groups. In addition, several pilot track evaluations were also proposed receiving a total of 16 entries from 4 groups. In both cases, the results show that most of the groups were able to outperform the provided baselines for each task.

对话状态跟踪——在每次与用户交互后更新对话状态的过程——是大多数对话系统的关键组件。遵循与第四个对话状态跟踪挑战类似的方案，本版本再次关注人与人之间的对话，但引入了跟踪器的跨语言适应任务。本次挑战赛共收到来自9个研究小组的32个参赛作品。此外，还提议进行若干试点跟踪评价，共收到来自4组的16个参赛作品。在这两种情况下，结果表明，大多数组都能够在每个任务中超过所提供的基线。

引用次数: 69

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 IEEE Spoken Language Technology Workshop (SLT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀