Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald
{"title":"Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels","authors":"Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald","doi":"arxiv-2409.10791","DOIUrl":null,"url":null,"abstract":"Iterative self-training, or iterative pseudo-labeling (IPL)--using an\nimproved model from the current iteration to provide pseudo-labels for the next\niteration--has proven to be a powerful approach to enhance the quality of\nspeaker representations. Recent applications of IPL in unsupervised speaker\nrecognition start with representations extracted from very elaborate\nself-supervised methods (e.g., DINO). However, training such strong\nself-supervised models is not straightforward (they require hyper-parameters\ntuning and may not generalize to out-of-domain data) and, moreover, may not be\nneeded at all. To this end, we show the simple, well-studied, and established\ni-vector generative model is enough to bootstrap the IPL process for\nunsupervised learning of speaker representations. We also systematically study\nthe impact of other components on the IPL process, which includes the initial\nmodel, the encoder, augmentations, the number of clusters, and the clustering\nalgorithm. Remarkably, we find that even with a simple and significantly weaker\ninitial model like i-vector, IPL can still achieve speaker verification\nperformance that rivals state-of-the-art methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10791","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Iterative self-training, or iterative pseudo-labeling (IPL)--using an
improved model from the current iteration to provide pseudo-labels for the next
iteration--has proven to be a powerful approach to enhance the quality of
speaker representations. Recent applications of IPL in unsupervised speaker
recognition start with representations extracted from very elaborate
self-supervised methods (e.g., DINO). However, training such strong
self-supervised models is not straightforward (they require hyper-parameters
tuning and may not generalize to out-of-domain data) and, moreover, may not be
needed at all. To this end, we show the simple, well-studied, and established
i-vector generative model is enough to bootstrap the IPL process for
unsupervised learning of speaker representations. We also systematically study
the impact of other components on the IPL process, which includes the initial
model, the encoder, augmentations, the number of clusters, and the clustering
algorithm. Remarkably, we find that even with a simple and significantly weaker
initial model like i-vector, IPL can still achieve speaker verification
performance that rivals state-of-the-art methods.