I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO
{"title":"RTVE 2018年挑战赛的域内自适应解决方案","authors":"I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-45","DOIUrl":null,"url":null,"abstract":"This paper tries to deal with domain mismatch scenarios in the diarization task. This research has been carried out in the con-text of the Radio Televisi´on Espa˜nola (RTVE) 2018 Challenge at IberSpeech 2018. This evaluation seeks the improvement of the diarization task in broadcast corpora, known to contain multiple unknown speakers. These speakers are set to contribute in different scenarios, genres, media and languages. The evaluation offers two different conditions: A closed one with restrictions in the resources to train and develop diarization systems, and an open condition without restrictions to check the latest improvements in the state-of-the-art. Our proposal is centered on the closed condition, specially dealing with two important mismatches: media and language. ViVoLab system for the challenge is based on the i-vector PLDA framework: I-vectors are extracted from the input audio according to a given segmentation, supposing that each segment represents one speaker intervention. The diarization hypotheses are obtained by clustering the estimated i-vectors with a Fully Bayesian PLDA, a generative model with latent variables as speaker labels. The number of speakers is decided by com-paring multiple hypotheses according to the Evidence Lower Bound (ELBO) provided by the PLDA, penalized in terms of the hypothesized speakers to compensate different modeling ca-pabilities.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge\",\"authors\":\"I. Viñals, Pablo Gimeno, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO\",\"doi\":\"10.21437/IBERSPEECH.2018-45\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper tries to deal with domain mismatch scenarios in the diarization task. This research has been carried out in the con-text of the Radio Televisi´on Espa˜nola (RTVE) 2018 Challenge at IberSpeech 2018. This evaluation seeks the improvement of the diarization task in broadcast corpora, known to contain multiple unknown speakers. These speakers are set to contribute in different scenarios, genres, media and languages. The evaluation offers two different conditions: A closed one with restrictions in the resources to train and develop diarization systems, and an open condition without restrictions to check the latest improvements in the state-of-the-art. Our proposal is centered on the closed condition, specially dealing with two important mismatches: media and language. ViVoLab system for the challenge is based on the i-vector PLDA framework: I-vectors are extracted from the input audio according to a given segmentation, supposing that each segment represents one speaker intervention. The diarization hypotheses are obtained by clustering the estimated i-vectors with a Fully Bayesian PLDA, a generative model with latent variables as speaker labels. The number of speakers is decided by com-paring multiple hypotheses according to the Evidence Lower Bound (ELBO) provided by the PLDA, penalized in terms of the hypothesized speakers to compensate different modeling ca-pabilities.\",\"PeriodicalId\":115963,\"journal\":{\"name\":\"IberSPEECH Conference\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IberSPEECH Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/IBERSPEECH.2018-45\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IberSPEECH Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/IBERSPEECH.2018-45","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
摘要
本文试图处理在分类任务中出现的域不匹配情况。本研究是在IberSpeech 2018的Radio Televisi ' on Espa ' nola (RTVE) 2018挑战赛的背景下进行的。该评估旨在改进广播语料库中包含多个未知说话者的词法任务。这些演讲者将在不同的场景、类型、媒体和语言中做出贡献。评估提供了两种不同的条件:一种是封闭条件,在培训和开发数字化系统的资源方面受到限制;另一种是开放条件,在检查最先进技术的最新改进方面没有限制。我们的方案以封闭的条件为中心,特别处理了两个重要的错配:媒介和语言。该挑战的ViVoLab系统基于i-vector PLDA框架:假设每个片段代表一个说话者的干预,根据给定的分割从输入音频中提取i-vector。利用完全贝叶斯PLDA(一种以潜在变量作为说话人标签的生成模型)对估计的i向量进行聚类,从而得到diarization假设。根据PLDA提供的证据下限(ELBO),通过比较多个假设来决定扬声器的数量,并根据假设的扬声器进行惩罚,以补偿不同的建模能力。
In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge
This paper tries to deal with domain mismatch scenarios in the diarization task. This research has been carried out in the con-text of the Radio Televisi´on Espa˜nola (RTVE) 2018 Challenge at IberSpeech 2018. This evaluation seeks the improvement of the diarization task in broadcast corpora, known to contain multiple unknown speakers. These speakers are set to contribute in different scenarios, genres, media and languages. The evaluation offers two different conditions: A closed one with restrictions in the resources to train and develop diarization systems, and an open condition without restrictions to check the latest improvements in the state-of-the-art. Our proposal is centered on the closed condition, specially dealing with two important mismatches: media and language. ViVoLab system for the challenge is based on the i-vector PLDA framework: I-vectors are extracted from the input audio according to a given segmentation, supposing that each segment represents one speaker intervention. The diarization hypotheses are obtained by clustering the estimated i-vectors with a Fully Bayesian PLDA, a generative model with latent variables as speaker labels. The number of speakers is decided by com-paring multiple hypotheses according to the Evidence Lower Bound (ELBO) provided by the PLDA, penalized in terms of the hypothesized speakers to compensate different modeling ca-pabilities.