首页 > 最新文献

IberSPEECH Conference最新文献

英文 中文
In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge RTVE 2018年挑战赛的域内自适应解决方案
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-45
I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO
This paper tries to deal with domain mismatch scenarios in the diarization task. This research has been carried out in the con-text of the Radio Televisi´on Espa˜nola (RTVE) 2018 Challenge at IberSpeech 2018. This evaluation seeks the improvement of the diarization task in broadcast corpora, known to contain multiple unknown speakers. These speakers are set to contribute in different scenarios, genres, media and languages. The evaluation offers two different conditions: A closed one with restrictions in the resources to train and develop diarization systems, and an open condition without restrictions to check the latest improvements in the state-of-the-art. Our proposal is centered on the closed condition, specially dealing with two important mismatches: media and language. ViVoLab system for the challenge is based on the i-vector PLDA framework: I-vectors are extracted from the input audio according to a given segmentation, supposing that each segment represents one speaker intervention. The diarization hypotheses are obtained by clustering the estimated i-vectors with a Fully Bayesian PLDA, a generative model with latent variables as speaker labels. The number of speakers is decided by com-paring multiple hypotheses according to the Evidence Lower Bound (ELBO) provided by the PLDA, penalized in terms of the hypothesized speakers to compensate different modeling ca-pabilities.
本文试图处理在分类任务中出现的域不匹配情况。本研究是在IberSpeech 2018的Radio Televisi ' on Espa ' nola (RTVE) 2018挑战赛的背景下进行的。该评估旨在改进广播语料库中包含多个未知说话者的词法任务。这些演讲者将在不同的场景、类型、媒体和语言中做出贡献。评估提供了两种不同的条件:一种是封闭条件,在培训和开发数字化系统的资源方面受到限制;另一种是开放条件,在检查最先进技术的最新改进方面没有限制。我们的方案以封闭的条件为中心,特别处理了两个重要的错配:媒介和语言。该挑战的ViVoLab系统基于i-vector PLDA框架:假设每个片段代表一个说话者的干预,根据给定的分割从输入音频中提取i-vector。利用完全贝叶斯PLDA(一种以潜在变量作为说话人标签的生成模型)对估计的i向量进行聚类,从而得到diarization假设。根据PLDA提供的证据下限(ELBO),通过比较多个假设来决定扬声器的数量,并根据假设的扬声器进行惩罚,以补偿不同的建模能力。
{"title":"In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge","authors":"I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-45","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-45","url":null,"abstract":"This paper tries to deal with domain mismatch scenarios in the diarization task. This research has been carried out in the con-text of the Radio Televisi´on Espa˜nola (RTVE) 2018 Challenge at IberSpeech 2018. This evaluation seeks the improvement of the diarization task in broadcast corpora, known to contain multiple unknown speakers. These speakers are set to contribute in different scenarios, genres, media and languages. The evaluation offers two different conditions: A closed one with restrictions in the resources to train and develop diarization systems, and an open condition without restrictions to check the latest improvements in the state-of-the-art. Our proposal is centered on the closed condition, specially dealing with two important mismatches: media and language. ViVoLab system for the challenge is based on the i-vector PLDA framework: I-vectors are extracted from the input audio according to a given segmentation, supposing that each segment represents one speaker intervention. The diarization hypotheses are obtained by clustering the estimated i-vectors with a Fully Bayesian PLDA, a generative model with latent variables as speaker labels. The number of speakers is decided by com-paring multiple hypotheses according to the Evidence Lower Bound (ELBO) provided by the PLDA, penalized in terms of the hypothesized speakers to compensate different modeling ca-pabilities.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121124916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
ODESSA at Albayzin Speaker Diarization Challenge 2018 敖德萨在2018年阿尔巴津演讲挑战
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-43
Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans
This paper describes the ODESSA submissions to the Albayzin Speaker Diarization Challenge 2018. The challenge addresses the diarization of TV shows. This work explores three different techniques to represent speech segments, namely binary key, x-vector and triplet-loss based embeddings. While training-free methods such as the binary key technique can be applied easily to a scenario where training data is limited, the training of robust neural-embedding extractors is considerably more challenging. However, when training data is plentiful (open-set condition), neural embeddings provide more robust segmentations, giving speaker representations which lead to better diarization performance. The paper also reports our efforts to improve speaker diarization performance through system combination. For systems with a common temporal resolution, fusion is performed at segment level during clustering. When the systems under fusion produce segmentations with an arbitrary resolution, they are combined at solution level. Both approaches to fusion are shown to improve diarization performance.
本文描述了敖德萨提交给2018年阿尔巴津演讲者Diarization挑战赛的作品。这一挑战解决了电视节目的数字化问题。这项工作探讨了三种不同的技术来表示语音片段,即二进制密钥,x向量和基于三重损失的嵌入。虽然无需训练的方法,如二进制密钥技术,可以很容易地应用于训练数据有限的场景,但鲁棒神经嵌入提取器的训练相当具有挑战性。然而,当训练数据丰富(开集条件)时,神经嵌入提供更鲁棒的分割,给出说话人表示,从而获得更好的分割性能。本文还报道了我们通过系统组合来提高扬声器偏振性能的努力。对于具有共同时间分辨率的系统,在聚类过程中在段级进行融合。当融合下的系统产生任意分辨率的分割时,它们在解级上进行组合。这两种融合方法都被证明可以提高双化性能。
{"title":"ODESSA at Albayzin Speaker Diarization Challenge 2018","authors":"Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans","doi":"10.21437/IBERSPEECH.2018-43","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-43","url":null,"abstract":"This paper describes the ODESSA submissions to the Albayzin Speaker Diarization Challenge 2018. The challenge addresses the diarization of TV shows. This work explores three different techniques to represent speech segments, namely binary key, x-vector and triplet-loss based embeddings. While training-free methods such as the binary key technique can be applied easily to a scenario where training data is limited, the training of robust neural-embedding extractors is considerably more challenging. However, when training data is plentiful (open-set condition), neural embeddings provide more robust segmentations, giving speaker representations which lead to better diarization performance. The paper also reports our efforts to improve speaker diarization performance through system combination. For systems with a common temporal resolution, fusion is performed at segment level during clustering. When the systems under fusion produce segmentations with an arbitrary resolution, they are combined at solution level. Both approaches to fusion are shown to improve diarization performance.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126859299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
LSTM based voice conversion for laryngectomees 基于LSTM的喉切除术患者语音转换
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-26
Luis Serrano, David Tavarez, X. Sarasola, Sneha Raman, I. Saratxaga, E. Navas, I. Hernáez
This work has been partially funded by the Spanish Ministryof Economy and Competitiveness with FEDER support (RE-STORE project, TEC2015-67163-C2-1-R), the Basque Govern-ment (BerbaOla project, KK-2018/00014) and from the Euro-pean Unions H2020 research and innovation programme un-der the Marie Curie European Training Network ENRICH(675324).
这项工作得到了西班牙经济和竞争力部的部分资助,并得到了联邦政府的支持(RE-STORE项目,TEC2015-67163-C2-1-R),巴斯克政府(BerbaOla项目,KK-2018/00014),以及居里夫人欧洲培训网络充实下的欧盟H2020研究和创新计划(675324)。
{"title":"LSTM based voice conversion for laryngectomees","authors":"Luis Serrano, David Tavarez, X. Sarasola, Sneha Raman, I. Saratxaga, E. Navas, I. Hernáez","doi":"10.21437/IberSPEECH.2018-26","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-26","url":null,"abstract":"This work has been partially funded by the Spanish Ministryof Economy and Competitiveness with FEDER support (RE-STORE project, TEC2015-67163-C2-1-R), the Basque Govern-ment (BerbaOla project, KK-2018/00014) and from the Euro-pean Unions H2020 research and innovation programme un-der the Marie Curie European Training Network ENRICH(675324).","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122108410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
On the use of Phone-based Embeddings for Language Recognition 基于电话的嵌入在语言识别中的应用
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-12
Christian Salamea, R. Córdoba, L. F. D’Haro, Rubén San-Segundo-Hernández, J. Ferreiros
Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the so-called “phone-gram sequences”. In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i-Vector system provides up to 34,1% improvement over the acoustic system alone.
语言识别(LID)可以定义为自动识别给定口语话语的语言的过程。我们关注的是语音定向方法,在这种方法中,系统输入是语音识别器(ASR)生成的音素序列,但我们使用的不是音素,而是包含上下文信息的语音单位,即所谓的“音素序列”。在这种情况下,我们建议使用神经嵌入(NEs)作为这些电话-grams序列的特征,这些序列被用作经典i-Vector框架中的条目来训练多类逻辑分类器。这些NEs结合了序列中相邻电话图的信息,并隐式地模拟了较长的上下文信息。使用Skip-Gram和Glove Model对网元进行了训练。在KALAKA-3数据库上进行了实验,我们使用Cavg作为度量来比较系统。我们建议将在LID任务中使用NEs作为特征获得的Cavg作为基线,为24.7%。我们的策略是将来自相邻的phone-gram的信息结合起来定义最终序列,使用Skip-Gram模型和Glove模型,相对于基线的相对改进可达24.3%和32.4%。最后,将我们最好的系统与基于mfc的声学i-Vector系统相融合,比单独的声学系统提高了34.1%。
{"title":"On the use of Phone-based Embeddings for Language Recognition","authors":"Christian Salamea, R. Córdoba, L. F. D’Haro, Rubén San-Segundo-Hernández, J. Ferreiros","doi":"10.21437/IBERSPEECH.2018-12","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-12","url":null,"abstract":"Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the so-called “phone-gram sequences”. In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i-Vector system provides up to 34,1% improvement over the acoustic system alone.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130664023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
EML Submission to Albayzin 2018 Speaker Diarization Challenge
Pub Date : 2018-11-21 DOI: 10.21437/iberspeech.2018-44
O. Ghahabi, V. Fischer
Speaker diarization, who is speaking when, is one of the most challenging tasks in speaker recognition, as usually no prior information is available about the identity and the number of the speakers in an audio recording. The task will be more challenging when there is some noise or music on the background and the speakers are changed more frequently. This usually hap-pens in broadcast news conversations. In this paper, we use the EML speaker diarization system as a participation to the recent Albayzin Evaluation challenge. The EML system uses a real-time robust algorithm to make decision about the identity of the speakers approximately every 2 sec. The experimental results on about 16 hours of the developing data provided in the challenge show a reasonable accuracy of the system with a very low computational cost.
由于在录音中通常没有关于说话人的身份和人数的先验信息,说话人在什么时候说话是说话人识别中最具挑战性的任务之一。当有一些噪音或音乐作为背景,并且扬声器更换得更频繁时,这项任务将更具挑战性。这通常发生在广播新闻对话中。在本文中,我们使用EML说话人分类系统来参与最近的Albayzin评估挑战。EML系统使用一种实时鲁棒算法,大约每2秒就能对说话人的身份做出判断。对挑战中提供的约16小时的开发数据进行的实验结果表明,该系统具有合理的精度和非常低的计算成本。
{"title":"EML Submission to Albayzin 2018 Speaker Diarization Challenge","authors":"O. Ghahabi, V. Fischer","doi":"10.21437/iberspeech.2018-44","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-44","url":null,"abstract":"Speaker diarization, who is speaking when, is one of the most challenging tasks in speaker recognition, as usually no prior information is available about the identity and the number of the speakers in an audio recording. The task will be more challenging when there is some noise or music on the background and the speakers are changed more frequently. This usually hap-pens in broadcast news conversations. In this paper, we use the EML speaker diarization system as a participation to the recent Albayzin Evaluation challenge. The EML system uses a real-time robust algorithm to make decision about the identity of the speakers approximately every 2 sec. The experimental results on about 16 hours of the developing data provided in the challenge show a reasonable accuracy of the system with a very low computational cost.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121587157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Emotion Detection from Speech and Text 基于语音和文本的情感检测
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-15
Mikel de Velasco, R. Justo, J. Antón, Mikel Carrilero, M. Inés Torres
This work has been partially founded by bythe Spanish Government (TIN2014-54288-C4-4-R and TIN2017-85854-C4-3-R), and bythe European Commission H2020 SC1-PM15program under RIA 7 grant 69872.
这项工作部分由西班牙政府(TIN2014-54288-C4-4-R和TIN2017-85854-C4-3-R)和欧盟委员会H2020 sc1 - pm15项目(RIA 7拨款69872)资助。
{"title":"Emotion Detection from Speech and Text","authors":"Mikel de Velasco, R. Justo, J. Antón, Mikel Carrilero, M. Inés Torres","doi":"10.21437/IberSPEECH.2018-15","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-15","url":null,"abstract":"This work has been partially founded by bythe Spanish Government (TIN2014-54288-C4-4-R and TIN2017-85854-C4-3-R), and bythe European Commission H2020 SC1-PM15program under RIA 7 grant 69872.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127668357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
RESTORE Project: REpair, STOrage and REhabilitation of speech 恢复项目:语言的修复、存储和恢复
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-34
I. Hernáez, E. Navas, J. Martín, J. Suárez
This project has been founded by the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTOREproject, TEC2015-67163-C2-1-R and TEC2015-67163-C2-2-R)
本项目由西班牙经济和竞争力部在FEDER的支持下成立(RESTOREproject, TEC2015-67163-C2-1-R和TEC2015-67163-C2-2-R)
{"title":"RESTORE Project: REpair, STOrage and REhabilitation of speech","authors":"I. Hernáez, E. Navas, J. Martín, J. Suárez","doi":"10.21437/IBERSPEECH.2018-34","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-34","url":null,"abstract":"This project has been founded by the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTOREproject, TEC2015-67163-C2-1-R and TEC2015-67163-C2-2-R)","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127675111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features 基于深度特征的ASV欺骗检测系统前后端技术性能评价
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-10
A. Alanís, A. Peinado, José Andrés González López, A. Gómez
As Automatic Speaker Verification (ASV) becomes more popular, so do the ways impostors can use to gain illegal access to speech-based biometric systems. For instance, impostors can use Text-to-Speech (TTS) and Voice Conversion (VC) techniques to generate speech acoustics resembling the voice of a genuine user and, hence, gain fraudulent access to the system. To prevent this, a number of anti-spoofing countermeasures have been developed for detecting these high technology attacks. However, the detection of previously unforeseen spoofing attacks remains challenging. To address this issue, in this work we perform an extensive empirical investigation on the speech features and back-end classifiers providing the best overall performance for an antispoofing system based on a deep learning framework. In this architecture, a deep neural network is used to extract a single identity spoofing vector per utterance from the speech features. Then, the extracted vectors are passed to a classifier in order to make the final detection decision. Experimental evaluation is carried out on the standard ASVSpoof2015 data corpus. The results show that classical FBANK features and Linear Discriminant Analysis (LDA) obtain the best performance for the proposed system.
随着自动说话人验证(ASV)变得越来越流行,骗子可以用来非法访问基于语音的生物识别系统的方法也越来越多。例如,冒名顶替者可以使用文本到语音(TTS)和语音转换(VC)技术来产生类似于真正用户声音的语音声学,从而获得对系统的欺诈性访问。为了防止这种情况,已经开发了许多反欺骗对策来检测这些高科技攻击。然而,检测以前无法预见的欺骗攻击仍然具有挑战性。为了解决这个问题,在这项工作中,我们对语音特征和后端分类器进行了广泛的实证研究,为基于深度学习框架的反欺骗系统提供了最佳的整体性能。在该体系结构中,使用深度神经网络从语音特征中提取单个身份欺骗向量。然后,将提取的向量传递给分类器,以做出最终的检测决策。在标准的asvspof2015数据语料库上进行了实验评估。结果表明,经典的FBANK特征和线性判别分析(LDA)可以获得最佳的性能。
{"title":"Performance evaluation of front- and back-end techniques for ASV spoofing detection systems based on deep features","authors":"A. Alanís, A. Peinado, José Andrés González López, A. Gómez","doi":"10.21437/IBERSPEECH.2018-10","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-10","url":null,"abstract":"As Automatic Speaker Verification (ASV) becomes more popular, so do the ways impostors can use to gain illegal access to speech-based biometric systems. For instance, impostors can use Text-to-Speech (TTS) and Voice Conversion (VC) techniques to generate speech acoustics resembling the voice of a genuine user and, hence, gain fraudulent access to the system. To prevent this, a number of anti-spoofing countermeasures have been developed for detecting these high technology attacks. However, the detection of previously unforeseen spoofing attacks remains challenging. To address this issue, in this work we perform an extensive empirical investigation on the speech features and back-end classifiers providing the best overall performance for an antispoofing system based on a deep learning framework. In this architecture, a deep neural network is used to extract a single identity spoofing vector per utterance from the speech features. Then, the extracted vectors are passed to a classifier in order to make the final detection decision. Experimental evaluation is carried out on the standard ASVSpoof2015 data corpus. The results show that classical FBANK features and Linear Discriminant Analysis (LDA) obtain the best performance for the proposed system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128152698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Wide Residual Networks 1D for Automatic Text Punctuation 用于自动文本标点的宽残差网络 1D
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-62
Jorge Llombart, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO
Documentation and analysis of multimedia resources usually requires a large pipeline with many stages. It is common to obtain texts without punctuation at some point, although later steps might need some accurate punctuation, like the ones related to natural language processing. This paper is focused on the task of recovering pause punctuation from a text without prosodic or acoustic information. We propose the use of Wide Residual Networks to predict which words should have a comma or stop from a text with removed punctuation. Wide Residual Networks are a well-known technique in image processing, but they are not commonly used in other areas as speech or natural language processing. We propose the use of Wide residual networks because they show great stability and the ability to work with long and short contextual dependencies in deep structures. Unlike for image processing, we will use 1-Dimensional convolutions because in text processing we only focus on the temporal dimension. Moreover, this architecture allows us to work with past and future context. This paper compares this architecture with Long-Short Term Memory cells which are used in this task and also combine the two architectures to get better results than each of them separately.
多媒体资源的文档化和分析通常需要一个包含多个阶段的大型流程。在某些时候获得没有标点符号的文本是很常见的,尽管后面的步骤可能需要一些准确的标点符号,比如与自然语言处理相关的标点符号。本文主要研究从没有韵律或声学信息的文本中恢复暂停标点符号的任务。我们建议使用宽残差网络来预测在删除标点符号的文本中哪些单词应该有逗号或句号。宽残差网络是一种众所周知的图像处理技术,但在语音或自然语言处理等其他领域并不常用。我们建议使用宽残差网络,因为它们在深度结构中表现出很强的稳定性和处理长和短上下文依赖关系的能力。与图像处理不同,我们将使用一维卷积,因为在文本处理中我们只关注时间维度。此外,这种架构允许我们处理过去和未来的环境。本文将这种结构与长短期记忆单元进行了比较,并将两种结构结合起来,得到了比单独使用更好的结果。
{"title":"Wide Residual Networks 1D for Automatic Text Punctuation","authors":"Jorge Llombart, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-62","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-62","url":null,"abstract":"Documentation and analysis of multimedia resources usually requires a large pipeline with many stages. It is common to obtain texts without punctuation at some point, although later steps might need some accurate punctuation, like the ones related to natural language processing. This paper is focused on the task of recovering pause punctuation from a text without prosodic or acoustic information. We propose the use of Wide Residual Networks to predict which words should have a comma or stop from a text with removed punctuation. Wide Residual Networks are a well-known technique in image processing, but they are not commonly used in other areas as speech or natural language processing. We propose the use of Wide residual networks because they show great stability and the ability to work with long and short contextual dependencies in deep structures. Unlike for image processing, we will use 1-Dimensional convolutions because in text processing we only focus on the temporal dimension. Moreover, this architecture allows us to work with past and future context. This paper compares this architecture with Long-Short Term Memory cells which are used in this task and also combine the two architectures to get better results than each of them separately.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126012539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Restricted Boltzmann Machine Vectors for Speaker Clustering 说话人聚类的受限玻尔兹曼机向量
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-3
Muhammad Umair Ahmed Khan, Pooyan Safari, J. Hernando
Restricted Boltzmann Machines (RBMs) have been used both in the front-end and backend of speaker verification systems. In this work, we apply RBMs as a front-end in the context of speaker clustering. Speakers' utterances are transformed into a vector representation by means of RBMs. These vectors, referred to as RBM vectors, have shown to preserve speaker-specific information and are used for the task of speaker clustering. In this work, we perform the traditional bottom-up Agglomerative Hierarchical Clustering (AHC). Using the RBM vector representation of speakers, the performance of speaker clustering is improved. The evaluation has been performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed system outperforms the baseline i-vectors system in terms of Equal Impurity (EI). Using cosine scoring, a relative improvement of 11% and 12% are achieved for average and single linkage clustering algorithms respectively. Using PLDA scoring, the RBM vectors achieve a relative improvement of 11% compared to i-vectors for the single linkage algorithm.
受限玻尔兹曼机(rbm)已被应用于说话人验证系统的前端和后端。在这项工作中,我们将rbm作为前端应用于说话人聚类。通过rbm将说话人的话语转换为向量表示。这些向量被称为RBM向量,可以保留说话人特定的信息,并用于说话人聚类的任务。在这项工作中,我们执行传统的自下而上的聚集分层聚类(AHC)。利用RBM向量表示说话人,提高了说话人聚类的性能。对加泰罗尼亚电视广播节目的录音进行了评估。实验结果表明,该系统在等杂质(EI)方面优于基线i向量系统。使用余弦评分,平均和单链接聚类算法的相对改进分别达到11%和12%。使用PLDA评分,与单链接算法的i-vectors相比,RBM vector实现了11%的相对改进。
{"title":"Restricted Boltzmann Machine Vectors for Speaker Clustering","authors":"Muhammad Umair Ahmed Khan, Pooyan Safari, J. Hernando","doi":"10.21437/IBERSPEECH.2018-3","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-3","url":null,"abstract":"Restricted Boltzmann Machines (RBMs) have been used both in the front-end and backend of speaker verification systems. In this work, we apply RBMs as a front-end in the context of speaker clustering. Speakers' utterances are transformed into a vector representation by means of RBMs. These vectors, referred to as RBM vectors, have shown to preserve speaker-specific information and are used for the task of speaker clustering. In this work, we perform the traditional bottom-up Agglomerative Hierarchical Clustering (AHC). Using the RBM vector representation of speakers, the performance of speaker clustering is improved. The evaluation has been performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed system outperforms the baseline i-vectors system in terms of Equal Impurity (EI). Using cosine scoring, a relative improvement of 11% and 12% are achieved for average and single linkage clustering algorithms respectively. Using PLDA scoring, the RBM vectors achieve a relative improvement of 11% compared to i-vectors for the single linkage algorithm.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131172353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
IberSPEECH Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1