敖德萨在2018年阿尔巴津演讲挑战

IberSPEECH Conference Pub Date : 2018-11-21 DOI:10.21437/IBERSPEECH.2018-43

Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans

{"title":"敖德萨在2018年阿尔巴津演讲挑战","authors":"Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans","doi":"10.21437/IBERSPEECH.2018-43","DOIUrl":null,"url":null,"abstract":"This paper describes the ODESSA submissions to the Albayzin Speaker Diarization Challenge 2018. The challenge addresses the diarization of TV shows. This work explores three different techniques to represent speech segments, namely binary key, x-vector and triplet-loss based embeddings. While training-free methods such as the binary key technique can be applied easily to a scenario where training data is limited, the training of robust neural-embedding extractors is considerably more challenging. However, when training data is plentiful (open-set condition), neural embeddings provide more robust segmentations, giving speaker representations which lead to better diarization performance. The paper also reports our efforts to improve speaker diarization performance through system combination. For systems with a common temporal resolution, fusion is performed at segment level during clustering. When the systems under fusion produce segmentations with an arbitrary resolution, they are combined at solution level. Both approaches to fusion are shown to improve diarization performance.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"ODESSA at Albayzin Speaker Diarization Challenge 2018\",\"authors\":\"Jose Patino, H. Delgado, Ruiqing Yin, H. Bredin, C. Barras, N. Evans\",\"doi\":\"10.21437/IBERSPEECH.2018-43\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the ODESSA submissions to the Albayzin Speaker Diarization Challenge 2018. The challenge addresses the diarization of TV shows. This work explores three different techniques to represent speech segments, namely binary key, x-vector and triplet-loss based embeddings. While training-free methods such as the binary key technique can be applied easily to a scenario where training data is limited, the training of robust neural-embedding extractors is considerably more challenging. However, when training data is plentiful (open-set condition), neural embeddings provide more robust segmentations, giving speaker representations which lead to better diarization performance. The paper also reports our efforts to improve speaker diarization performance through system combination. For systems with a common temporal resolution, fusion is performed at segment level during clustering. When the systems under fusion produce segmentations with an arbitrary resolution, they are combined at solution level. Both approaches to fusion are shown to improve diarization performance.\",\"PeriodicalId\":115963,\"journal\":{\"name\":\"IberSPEECH Conference\",\"volume\":\"100 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IberSPEECH Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/IBERSPEECH.2018-43\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IberSPEECH Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/IBERSPEECH.2018-43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

本文描述了敖德萨提交给2018年阿尔巴津演讲者Diarization挑战赛的作品。这一挑战解决了电视节目的数字化问题。这项工作探讨了三种不同的技术来表示语音片段，即二进制密钥，x向量和基于三重损失的嵌入。虽然无需训练的方法，如二进制密钥技术，可以很容易地应用于训练数据有限的场景，但鲁棒神经嵌入提取器的训练相当具有挑战性。然而，当训练数据丰富(开集条件)时，神经嵌入提供更鲁棒的分割，给出说话人表示，从而获得更好的分割性能。本文还报道了我们通过系统组合来提高扬声器偏振性能的努力。对于具有共同时间分辨率的系统，在聚类过程中在段级进行融合。当融合下的系统产生任意分辨率的分割时，它们在解级上进行组合。这两种融合方法都被证明可以提高双化性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ODESSA at Albayzin Speaker Diarization Challenge 2018

This paper describes the ODESSA submissions to the Albayzin Speaker Diarization Challenge 2018. The challenge addresses the diarization of TV shows. This work explores three different techniques to represent speech segments, namely binary key, x-vector and triplet-loss based embeddings. While training-free methods such as the binary key technique can be applied easily to a scenario where training data is limited, the training of robust neural-embedding extractors is considerably more challenging. However, when training data is plentiful (open-set condition), neural embeddings provide more robust segmentations, giving speaker representations which lead to better diarization performance. The paper also reports our efforts to improve speaker diarization performance through system combination. For systems with a common temporal resolution, fusion is performed at segment level during clustering. When the systems under fusion produce segmentations with an arbitrary resolution, they are combined at solution level. Both approaches to fusion are shown to improve diarization performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IberSPEECH Conference

自引率

0.00%

发文量