首页 > 最新文献

IberSPEECH Conference最新文献

英文 中文
Bottleneck and Embedding Representation of Speech for DNN-based Language and Speaker Recognition 基于dnn的语言和说话人识别的瓶颈和嵌入表征
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-36
Alicia Lozano-Diez, J. González-Rodríguez, J. Gonzalez-Dominguez
In this manuscript, we summarize the findings presented in Alicia Lozano Diez’s Ph.D. Thesis, defended on the 22nd of June, 2018 in Universidad Autonoma de Madrid (Spain). In particular, this Ph.D. Thesis explores different approaches to the tasks of language and speaker recognition, focusing on systems where deep neural networks (DNNs) become part of traditional pipelines, replacing some stages or the whole system itself. First, we present a DNN as classifier for the task of language recognition. Second, we analyze the use of DNNs for feature extraction at frame-level, the so-called bottleneck features, for both language and speaker recognition. Finally, utterance-level representation of the speech segments learned by the DNN (known as embedding) is described and presented for the task of language recognition. All these approaches provide alter-natives to classical language and speaker recognition systems based on i-vectors (Total Variability modeling) over acoustic features (MFCCs, for instance). Moreover, they usually yield better results in terms of performance. stochastic gradient descent to minimize the negative log-likelihood. We conducted experiments to evaluate the influence of differ-IberSPEECH
在这份手稿中,我们总结了Alicia Lozano Diez博士论文中的发现,该论文于2018年6月22日在马德里自治大学(西班牙)答辩。特别地,本博士论文探讨了语言和说话人识别任务的不同方法,重点关注深度神经网络(dnn)成为传统管道的一部分,取代某些阶段或整个系统本身的系统。首先,我们提出了一个深度神经网络作为语言识别任务的分类器。其次,我们分析了dnn在帧级特征提取中的使用,即所谓的瓶颈特征,用于语言和说话人识别。最后,描述了DNN学习到的语音片段的话语级表示(称为嵌入),并提出了用于语言识别的任务。所有这些方法都为基于声学特征(例如mfccc)的i向量(总变异性建模)的经典语言和说话人识别系统提供了替代方案。此外,它们通常在性能方面产生更好的结果。随机梯度下降最小化负对数似然。我们进行了实验来评估不同的iberspeech的影响
{"title":"Bottleneck and Embedding Representation of Speech for DNN-based Language and Speaker Recognition","authors":"Alicia Lozano-Diez, J. González-Rodríguez, J. Gonzalez-Dominguez","doi":"10.21437/IBERSPEECH.2018-36","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-36","url":null,"abstract":"In this manuscript, we summarize the findings presented in Alicia Lozano Diez’s Ph.D. Thesis, defended on the 22nd of June, 2018 in Universidad Autonoma de Madrid (Spain). In particular, this Ph.D. Thesis explores different approaches to the tasks of language and speaker recognition, focusing on systems where deep neural networks (DNNs) become part of traditional pipelines, replacing some stages or the whole system itself. First, we present a DNN as classifier for the task of language recognition. Second, we analyze the use of DNNs for feature extraction at frame-level, the so-called bottleneck features, for both language and speaker recognition. Finally, utterance-level representation of the speech segments learned by the DNN (known as embedding) is described and presented for the task of language recognition. All these approaches provide alter-natives to classical language and speaker recognition systems based on i-vectors (Total Variability modeling) over acoustic features (MFCCs, for instance). Moreover, they usually yield better results in terms of performance. stochastic gradient descent to minimize the negative log-likelihood. We conducted experiments to evaluate the influence of differ-IberSPEECH","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133548364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
TransDic, a public domain tool for the generation of phonetic dictionaries in standard and dialectal Spanish and Catalan TransDic,一个公共领域的工具,用于生成标准和方言西班牙语和加泰罗尼亚语的语音词典
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-61
J. Garrido, Marta Codina, K. Fodge
This paper presents TransDic, a free distribution tool for the phonetic transcription of word lists in Spanish and Catalan which allows the generation of phonetic transcription variants, a feature that can be useful for some technological applications, such as speech recognition. It allows the transcription in both standard Spanish and Catalan, but also in several dialects of these two languages spoken in Spain. Its general structure, input, output and main functionalities are presented, and the procedure followed to define and implement the transcription rules in the tool is described. Finally, the results of an evaluation carried for both languages are presented, which show that TransDic performs correctly the transcription tasks that it was developed for.
本文介绍了TransDic,这是一个免费的分发工具,用于西班牙语和加泰罗尼亚语单词列表的音标,它允许生成音标变体,这一功能可以用于某些技术应用,例如语音识别。它允许在标准西班牙语和加泰罗尼亚语的转录,但也在这两种语言的几种方言在西班牙说。介绍了该工具的总体结构、输入输出和主要功能,并描述了在该工具中定义和实现转录规则的过程。最后,对两种语言进行了评估,结果表明TransDic正确地执行了开发时的转录任务。
{"title":"TransDic, a public domain tool for the generation of phonetic dictionaries in standard and dialectal Spanish and Catalan","authors":"J. Garrido, Marta Codina, K. Fodge","doi":"10.21437/IBERSPEECH.2018-61","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-61","url":null,"abstract":"This paper presents TransDic, a free distribution tool for the phonetic transcription of word lists in Spanish and Catalan which allows the generation of phonetic transcription variants, a feature that can be useful for some technological applications, such as speech recognition. It allows the transcription in both standard Spanish and Catalan, but also in several dialects of these two languages spoken in Spain. Its general structure, input, output and main functionalities are presented, and the procedure followed to define and implement the transcription rules in the tool is described. Finally, the results of an evaluation carried for both languages are presented, which show that TransDic performs correctly the transcription tasks that it was developed for.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127408045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
AUDIAS-CEU: A Language-independent approach for the Query-by-Example Spoken Term Detection task of the Search on Speech ALBAYZIN 2018 evaluation AUDIAS-CEU:一种独立于语言的基于实例查询的语音词检测方法
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-51
Maria Cabello, D. Toledano, Javier Tejedor
Query-by-Example Spoken Term Detection is the task of detecting query occurrences within speech data (henceforth utterances). Our submission is based on a language-independent template matching approach. First, queries and utterances are represented as phonetic posteriorgrams computed for English language with the phoneme decoder developed by the Brno Uni-versity of Technology. Next, the Subsequence Dynamic Time Warping algorithm with a modified Pearson correlation coefficient as cost measure is employed to hipothesize detections. Results on development data showed an ATWV=0.1774 with MAVIR data and an ATWV=0.0365 with RTVE data.
按例查询语音术语检测是检测语音数据(因此是语音)中的查询出现情况的任务。我们的提交基于一种与语言无关的模板匹配方法。首先,使用布尔诺理工大学开发的音素解码器将查询和话语表示为英语语言的语音后置图。其次,采用改进的Pearson相关系数作为代价度量的子序列动态时间翘曲算法对检测进行假设。发展数据的结果显示,MAVIR数据的ATWV=0.1774, RTVE数据的ATWV=0.0365。
{"title":"AUDIAS-CEU: A Language-independent approach for the Query-by-Example Spoken Term Detection task of the Search on Speech ALBAYZIN 2018 evaluation","authors":"Maria Cabello, D. Toledano, Javier Tejedor","doi":"10.21437/IBERSPEECH.2018-51","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-51","url":null,"abstract":"Query-by-Example Spoken Term Detection is the task of detecting query occurrences within speech data (henceforth utterances). Our submission is based on a language-independent template matching approach. First, queries and utterances are represented as phonetic posteriorgrams computed for English language with the phoneme decoder developed by the Brno Uni-versity of Technology. Next, the Subsequence Dynamic Time Warping algorithm with a modified Pearson correlation coefficient as cost measure is employed to hipothesize detections. Results on development data showed an ATWV=0.1774 with MAVIR data and an ATWV=0.0365 with RTVE data.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114513719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-End Speech Translation with the Transformer 端到端语音翻译与变压器
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-13
Laura Cross Vila, Carlos Escolano, José A. R. Fonollosa, M. Costa-jussà
Speech Translation has been traditionally addressed with the concatenation of two tasks: Speech Recognition and Machine Translation. This approach has the main drawback that errors are concatenated. Recently, neural approaches to Speech Recognition and Machine Translation have made possible facing the task by means of an End-to-End Speech Translation architecture. In this paper, we propose to use the architecture of the Transformer which is based solely on attention-based mechanisms to address the End-to-End Speech Translation system. As a contrastive architecture, we use the same Transformer to built the Speech Recognition and Machine Translation systems to perform Speech Translation through concatenation of systems. Results on a Spanish-to-English standard task show that the end-to-end architecture is able to outperform the concatenated systems by half point BLEU.
传统上,语音翻译由两个任务组成:语音识别和机器翻译。这种方法的主要缺点是错误是串联起来的。最近,语音识别和机器翻译的神经方法通过端到端语音翻译架构使得面对这一任务成为可能。在本文中,我们建议使用Transformer的架构,该架构完全基于基于注意力的机制来解决端到端语音翻译系统。作为一种对比体系结构,我们使用相同的Transformer来构建语音识别和机器翻译系统,通过系统的连接来执行语音翻译。在西班牙语到英语标准任务上的结果表明,端到端架构能够比连接系统的性能高出0.5个BLEU。
{"title":"End-to-End Speech Translation with the Transformer","authors":"Laura Cross Vila, Carlos Escolano, José A. R. Fonollosa, M. Costa-jussà","doi":"10.21437/IBERSPEECH.2018-13","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-13","url":null,"abstract":"Speech Translation has been traditionally addressed with the concatenation of two tasks: Speech Recognition and Machine Translation. This approach has the main drawback that errors are concatenated. Recently, neural approaches to Speech Recognition and Machine Translation have made possible facing the task by means of an End-to-End Speech Translation architecture. In this paper, we propose to use the architecture of the Transformer which is based solely on attention-based mechanisms to address the End-to-End Speech Translation system. As a contrastive architecture, we use the same Transformer to built the Speech Recognition and Machine Translation systems to perform Speech Translation through concatenation of systems. Results on a Spanish-to-English standard task show that the end-to-end architecture is able to outperform the concatenated systems by half point BLEU.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131309104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
The GTM-UVIGO System for Albayzin 2018 Speech-to-Text Evaluation Albayzin 2018语音到文本评价的GTM-UVIGO系统
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-58
Laura Docío Fernández, C. García-Mateo
This paper describes the Speech-to-Text system developed by the Multimedia Technologies Group (GTM) of the atlanTTic research center at the University of Vigo, for the Albayzin Speech-to-Text Challenge (S2T) organized in the Iberspeech 2018 conference. The large vocabulary automatic speech recognition system is built using the Kaldi toolkit. It uses an hybrid Deep Neural Network - Hidden Markov Model (DNN-HMM) for acoustic modeling, and a rescoring of a trigram based word-lattices, obtained in a first decoding stage, with a fourgram language model or a language model based on a recurrent neural network. The system was evaluated only on the open set training condition.
本文描述了由维戈大学大西洋研究中心的多媒体技术小组(GTM)为Iberspeech 2018会议组织的Albayzin语音到文本挑战(S2T)开发的语音到文本系统。利用Kaldi工具箱构建了大词汇自动语音识别系统。它使用混合深度神经网络-隐马尔可夫模型(DNN-HMM)进行声学建模,并使用四格语言模型或基于循环神经网络的语言模型对在第一解码阶段获得的基于三格的词格进行重新评分。系统仅在开放集训练条件下进行评估。
{"title":"The GTM-UVIGO System for Albayzin 2018 Speech-to-Text Evaluation","authors":"Laura Docío Fernández, C. García-Mateo","doi":"10.21437/IBERSPEECH.2018-58","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-58","url":null,"abstract":"This paper describes the Speech-to-Text system developed by the Multimedia Technologies Group (GTM) of the atlanTTic research center at the University of Vigo, for the Albayzin Speech-to-Text Challenge (S2T) organized in the Iberspeech 2018 conference. The large vocabulary automatic speech recognition system is built using the Kaldi toolkit. It uses an hybrid Deep Neural Network - Hidden Markov Model (DNN-HMM) for acoustic modeling, and a rescoring of a trigram based word-lattices, obtained in a first decoding stage, with a fourgram language model or a language model based on a recurrent neural network. The system was evaluated only on the open set training condition.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"302 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122246221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Towards an automatic evaluation of the prosody of people with Down syndrome 对唐氏综合症患者的韵律进行自动评估
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-24
Mario Corrales-Astorgano, P. Martínez-Castilla, David Escudero Mancebo, L. Aguilar, César González Ferreras, Valentín Cardeñoso-Payo
{"title":"Towards an automatic evaluation of the prosody of people with Down syndrome","authors":"Mario Corrales-Astorgano, P. Martínez-Castilla, David Escudero Mancebo, L. Aguilar, César González Ferreras, Valentín Cardeñoso-Payo","doi":"10.21437/IberSPEECH.2018-24","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-24","url":null,"abstract":"","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116689944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Converted Mel-Cepstral Coefficients for Gender Variability Reduction in Query-by-Example Spoken Document Retrieval 基于实例查询的口语文档检索中减少性别变异的转换Mel-Cepstral系数
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-18
Paula Lopez-Otero, Laura Docío Fernández
{"title":"Converted Mel-Cepstral Coefficients for Gender Variability Reduction in Query-by-Example Spoken Document Retrieval","authors":"Paula Lopez-Otero, Laura Docío Fernández","doi":"10.21437/IberSPEECH.2018-18","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-18","url":null,"abstract":"","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131205934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The observation likelihood of silence: analysis and prospects for VAD applications 沉默的观测可能性:VAD应用的分析与展望
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-11
I. Odriozola, I. Hernáez, E. Navas, Luis Serrano, Jon Sánchez
This work has been partially supported by the EU(FEDER) under grant TEC2015-67163-C2-1-R (RESTORE)(MINECO/FEDER, UE) and by the Basque Government undergrant KK-2017/00043 (BerbaOla)
这项工作得到了欧盟(FEDER)在TEC2015-67163-C2-1-R (RESTORE)(MINECO/FEDER, UE)和巴斯克政府资助KK-2017/00043 (BerbaOla)下的部分支持。
{"title":"The observation likelihood of silence: analysis and prospects for VAD applications","authors":"I. Odriozola, I. Hernáez, E. Navas, Luis Serrano, Jon Sánchez","doi":"10.21437/IberSPEECH.2018-11","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-11","url":null,"abstract":"This work has been partially supported by the EU(FEDER) under grant TEC2015-67163-C2-1-R (RESTORE)(MINECO/FEDER, UE) and by the Basque Government undergrant KK-2017/00043 (BerbaOla)","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132583163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CENATAV Voice-Group Systems for Albayzin 2018 Speaker Diarization Evaluation Campaign CENATAV语音组系统用于2018年阿尔拜辛的说话人分化评价活动
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-47
Edward L. Campbell, Gabriel Hernández, J. Lara
Usually, the environment to record a voice signal is not ideal and, in order to improve the representation of the speaker characteristic space, it is necessary to use a robust algorithm, thus making the representation more stable in the presence of noise. A Diarization system that focuses on the use of robust feature extraction techniques is proposed in this paper. The pre-sented features ( such as Mean Hilbert Envelope Coefficients, Medium Duration Modulation Coefficients and Power Normalization Cepstral Coefficients ) were not used in other Albayzin Challenges. These robust techniques have a common characteristic, which is the use of a Gammatone filter-bank for divid-ing the voice signal in sub-bands as an alternative option to the classical Triangular filter-bank used in Mel Frequency Cepstral Coefficients. The experiment results show a more stable Diarization Error Rate in robust features than in classic features.
通常,记录语音信号的环境并不理想,为了改善说话人特征空间的表示,有必要使用鲁棒算法,从而使表示在存在噪声时更加稳定。本文提出了一种基于鲁棒特征提取技术的数字化系统。所提出的特征(如平均希尔伯特包络系数、中持续时间调制系数和功率归一化倒谱系数)在其他Albayzin挑战中未被使用。这些鲁棒技术有一个共同的特点,那就是使用伽玛酮滤波器组将语音信号分成子带,作为Mel频率倒谱系数中使用的经典三角形滤波器组的替代选择。实验结果表明,鲁棒特征比经典特征具有更稳定的双化错误率。
{"title":"CENATAV Voice-Group Systems for Albayzin 2018 Speaker Diarization Evaluation Campaign","authors":"Edward L. Campbell, Gabriel Hernández, J. Lara","doi":"10.21437/IBERSPEECH.2018-47","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-47","url":null,"abstract":"Usually, the environment to record a voice signal is not ideal and, in order to improve the representation of the speaker characteristic space, it is necessary to use a robust algorithm, thus making the representation more stable in the presence of noise. A Diarization system that focuses on the use of robust feature extraction techniques is proposed in this paper. The pre-sented features ( such as Mean Hilbert Envelope Coefficients, Medium Duration Modulation Coefficients and Power Normalization Cepstral Coefficients ) were not used in other Albayzin Challenges. These robust techniques have a common characteristic, which is the use of a Gammatone filter-bank for divid-ing the voice signal in sub-bands as an alternative option to the classical Triangular filter-bank used in Mel Frequency Cepstral Coefficients. The experiment results show a more stable Diarization Error Rate in robust features than in classic features.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123450315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
GTTS-EHU Systems for the Albayzin 2018 Search on Speech Evaluation 基于Albayzin 2018搜索的语音评价gts - ehu系统
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-52
Luis Javier Rodriguez-Fuentes, M. Peñagarikano, A. Varona, Germán Bordel
This paper describes the systems developed by GTTS-EHU for the QbE-STD and STD tasks of the Albayzin 2018 Search on Speech Evaluation. Stacked bottleneck features (sBNF) are used as frame-level acoustic representation for both audio documents and spoken queries. In QbE-STD, a flavour of segmental DTW (originally developed for MediaEval 2013) is used to perform the search, which iteratively finds the match that minimizes the average distance between two test-normalized sBNF vectors, until either a maximum number of hits is obtained or the score does not attain a given threshold. The STD task is performed by synthesizing spoken queries (using publicly available TTS APIs), then averaging their sBNF representations and using the average query for QbE-STD. A publicly available toolkit (developed by BUT/Phonexia) has been used to extract three sBNF sets, trained for English monophone and triphone state posteriors (contrastive systems 3 and 4) and for multilingual triphone posteriors (contrastive system 2), respectively. The concatenation of the three sBNF sets has been also tested (contrastive system 1). The primary system consists of a discriminative fusion of the four contrastive systems. Detection scores are normalized on a query-by-query basis (qnorm), calibrated and, if two or more systems are considered, fused with other scores. Calibration and fusion parameters are discriminatively estimated using the ground truth of development data. Finally, due to a lack of robustness in calibration, Yes/No decisions are made by applying the MTWV thresholds obtained for the development sets, except for the COREMAH test set. In this case, calibration is based on the MAVIR corpus, and the 15% highest scores are taken as positive (Yes) detections.
本文介绍了gts - ehu为Albayzin 2018语音评价检索的QbE-STD和STD任务开发的系统。堆叠瓶颈特征(sBNF)被用作音频文档和语音查询的帧级声学表示。在QbE-STD中,使用分段DTW(最初是为MediaEval 2013开发的)来执行搜索,迭代地找到最小化两个测试标准化sBNF向量之间平均距离的匹配,直到获得最大命中数或得分未达到给定阈值。STD任务是通过合成语音查询(使用公开可用的TTS api)来执行的,然后对它们的sBNF表示取平均值,并使用QbE-STD的平均查询。一个公开可用的工具包(由BUT/Phonexia开发)被用来提取三个sBNF集,分别训练为英语单音和三音状态后置(对比系统3和4)和多语言三音后置(对比系统2)。我们还测试了三个sBNF集的连接(对比系统1)。主系统由四个对比系统的判别融合组成。检测分数在逐个查询的基础上进行标准化(qnorm),进行校准,如果考虑两个或多个系统,则与其他分数融合。利用开发数据的真实值判别估计校准和融合参数。最后,由于校准缺乏鲁棒性,除了COREMAH测试集之外,Yes/No决策是通过应用开发集获得的MTWV阈值来做出的。在这种情况下,校准是基于MAVIR语料库,15%的最高分被视为阳性(Yes)检测。
{"title":"GTTS-EHU Systems for the Albayzin 2018 Search on Speech Evaluation","authors":"Luis Javier Rodriguez-Fuentes, M. Peñagarikano, A. Varona, Germán Bordel","doi":"10.21437/IberSPEECH.2018-52","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-52","url":null,"abstract":"This paper describes the systems developed by GTTS-EHU for the QbE-STD and STD tasks of the Albayzin 2018 Search on Speech Evaluation. Stacked bottleneck features (sBNF) are used as frame-level acoustic representation for both audio documents and spoken queries. In QbE-STD, a flavour of segmental DTW (originally developed for MediaEval 2013) is used to perform the search, which iteratively finds the match that minimizes the average distance between two test-normalized sBNF vectors, until either a maximum number of hits is obtained or the score does not attain a given threshold. The STD task is performed by synthesizing spoken queries (using publicly available TTS APIs), then averaging their sBNF representations and using the average query for QbE-STD. A publicly available toolkit (developed by BUT/Phonexia) has been used to extract three sBNF sets, trained for English monophone and triphone state posteriors (contrastive systems 3 and 4) and for multilingual triphone posteriors (contrastive system 2), respectively. The concatenation of the three sBNF sets has been also tested (contrastive system 1). The primary system consists of a discriminative fusion of the four contrastive systems. Detection scores are normalized on a query-by-query basis (qnorm), calibrated and, if two or more systems are considered, fused with other scores. Calibration and fusion parameters are discriminatively estimated using the ground truth of development data. Finally, due to a lack of robustness in calibration, Yes/No decisions are made by applying the MTWV thresholds obtained for the development sets, except for the COREMAH test set. In this case, calibration is based on the MAVIR corpus, and the 15% highest scores are taken as positive (Yes) detections.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124213788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
IberSPEECH Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1