JHU Diarization System Description

IberSPEECH Conference Pub Date : 2018-11-21 DOI:10.21437/IBERSPEECH.2018-49

Zili Huang, Leibny Paola García-Perera, J. Villalba, Daniel Povey, N. Dehak

{"title":"JHU Diarization System Description","authors":"Zili Huang, Leibny Paola García-Perera, J. Villalba, Daniel Povey, N. Dehak","doi":"10.21437/IBERSPEECH.2018-49","DOIUrl":null,"url":null,"abstract":"We present the JHU system for Iberspeech-RTVE Speaker Diarization Evaluation. This assessment combines Spanish language and broadcast audio in the same recordings, conditions in which our system has not been tested before. To tackle this problem, the pipeline of our general system, developed en-tirely in Kaldi, includes an acoustic feature extraction, a SAD, an embedding extractor, a PLDA and a clustering stage. This pipeline was used for both, the open and the closed conditions (described in the evaluation plan). All the proposed solutions use wide-band data (16KHz) and MFCCs as their input. For the closed condition, the system trains a DNN SAD using the Albayzin2016 data. Due to the small amount of data available, the i-vector embedding extraction was the only approach explored for this task. The PLDA training utilizes Albayzin data fol-lowed by an Agglomerative Hierarchical Clustering (AHC) to obtain the speaker segmentation. The open condition employs the DNN SAD obtained in the closed condition. Four types of embeddings were extracted, x-vector-basic, x-vector-factored, i-vector-basic and BNF-i-vector. The x-vector-basic is a TDNN trained on augmented Voxceleb1 and Voxceleb2. The x-vector-factored is a factored-TDNN (TDNN-F) trained on SRE12-micphn, MX6-micphn, VoxCeleb and SITW-dev-core. The i-vector-basic was trained on Voxceleb1 and Voxceleb2 data (no augmentation). The BNF-i-vector is a BNF-posterior i-vector trained with the same data as x-vector-factored. The PLDA training for the new scenario uses the Albayzin2016 data. The four systems were fused at the score level. Once again, the AHC computed the ﬁnal speaker segmentation. We tested our systems in the Albayzin2018 dev2 data and observed that the SAD is of importance to improve the results. Moreover, we noticed that x-vectors were better than i-vectors, as already observed in previous experiments.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"697 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IberSPEECH Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/IBERSPEECH.2018-49","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

We present the JHU system for Iberspeech-RTVE Speaker Diarization Evaluation. This assessment combines Spanish language and broadcast audio in the same recordings, conditions in which our system has not been tested before. To tackle this problem, the pipeline of our general system, developed en-tirely in Kaldi, includes an acoustic feature extraction, a SAD, an embedding extractor, a PLDA and a clustering stage. This pipeline was used for both, the open and the closed conditions (described in the evaluation plan). All the proposed solutions use wide-band data (16KHz) and MFCCs as their input. For the closed condition, the system trains a DNN SAD using the Albayzin2016 data. Due to the small amount of data available, the i-vector embedding extraction was the only approach explored for this task. The PLDA training utilizes Albayzin data fol-lowed by an Agglomerative Hierarchical Clustering (AHC) to obtain the speaker segmentation. The open condition employs the DNN SAD obtained in the closed condition. Four types of embeddings were extracted, x-vector-basic, x-vector-factored, i-vector-basic and BNF-i-vector. The x-vector-basic is a TDNN trained on augmented Voxceleb1 and Voxceleb2. The x-vector-factored is a factored-TDNN (TDNN-F) trained on SRE12-micphn, MX6-micphn, VoxCeleb and SITW-dev-core. The i-vector-basic was trained on Voxceleb1 and Voxceleb2 data (no augmentation). The BNF-i-vector is a BNF-posterior i-vector trained with the same data as x-vector-factored. The PLDA training for the new scenario uses the Albayzin2016 data. The four systems were fused at the score level. Once again, the AHC computed the ﬁnal speaker segmentation. We tested our systems in the Albayzin2018 dev2 data and observed that the SAD is of importance to improve the results. Moreover, we noticed that x-vectors were better than i-vectors, as already observed in previous experiments.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

JHU分级系统描述

提出了一种用于自由语音- rtve说话人特征化评价的JHU系统。该评估将西班牙语和广播音频结合在同一录音中，这是我们的系统以前从未测试过的条件。为了解决这个问题，我们的通用系统的管道，完全在Kaldi开发，包括声学特征提取，SAD，嵌入提取器，PLDA和聚类阶段。该管道用于开放和封闭两种情况(在评价计划中描述)。所有提出的解决方案都使用宽带数据(16KHz)和mfc作为输入。对于封闭条件，系统使用Albayzin2016数据训练DNN SAD。由于可用数据量少，i向量嵌入提取是该任务探索的唯一方法。PLDA训练利用Albayzin数据和AHC聚类(Agglomerative Hierarchical Clustering, AHC)来获得说话人分割。打开条件使用在关闭条件下获得的DNN SAD。提取了x-vector-basic、x-vector-factor、i-vector-basic和BNF-i-vector四种类型的嵌入。x-vector-basic是在增强Voxceleb1和Voxceleb2上训练的TDNN。x-vector- factors是在SRE12-micphn、MX6-micphn、VoxCeleb和SITW-dev-core上训练的因子- tdnn (TDNN-F)。在Voxceleb1和Voxceleb2数据上训练i-vector-basic(无增强)。bnf -i向量是用与x向量因子相同的数据训练的bnf -后验i向量。新场景的PLDA训练使用Albayzin2016数据。这四个系统在分数水平上融合在一起。AHC再一次计算出最终的说话人分割。我们在Albayzin2018 dev2数据中测试了我们的系统，并观察到SAD对于改善结果非常重要。此外，我们注意到x向量比i向量更好，这在之前的实验中已经观察到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IberSPEECH Conference

自引率

0.00%

发文量