{"title":"鲁棒说话人验证领域自适应的双模型自正则化与融合","authors":"Yibo Duan , Yanhua Long , Jiaen Liang","doi":"10.1016/j.specom.2023.103001","DOIUrl":null,"url":null,"abstract":"<div><p>Learning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim to improve the well-established ECAPA-TDNN framework to enhance its domain robustness for low-resource cross-domain speaker verification tasks. Specifically, a novel dual-model self-learning approach is first proposed to produce robust speaker identity embeddings, where the ECAPA-TDNN is extended into a dual-model structure and then trained and regularized using self-supervised learning between different intermediate acoustic representations; Then, we enhance the dual-model by combining self-supervised loss and supervised loss in a time-dependent manner, thus enhancing the model’s overall generalization capabilities. Furthermore, to better utilize the complementary information in the dual-model’s outputs, we explore various methods for similarity computation and score fusion. Our experiments, conducted on the publicly available <span>VoxCeleb2</span> and <span>VoxMovies</span><span><span> datasets, have demonstrated that our proposed dual-model regularization and fusion methods outperformed the strong baseline by a relative 9.07%–11.6% </span>EER reduction across various in-domain and cross-domain evaluation sets. Importantly, our approach exhibits effectiveness in both supervised and unsupervised scenarios for low-resource cross-domain speaker verification tasks.</span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103001"},"PeriodicalIF":2.4000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dual-model self-regularization and fusion for domain adaptation of robust speaker verification\",\"authors\":\"Yibo Duan , Yanhua Long , Jiaen Liang\",\"doi\":\"10.1016/j.specom.2023.103001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Learning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim to improve the well-established ECAPA-TDNN framework to enhance its domain robustness for low-resource cross-domain speaker verification tasks. Specifically, a novel dual-model self-learning approach is first proposed to produce robust speaker identity embeddings, where the ECAPA-TDNN is extended into a dual-model structure and then trained and regularized using self-supervised learning between different intermediate acoustic representations; Then, we enhance the dual-model by combining self-supervised loss and supervised loss in a time-dependent manner, thus enhancing the model’s overall generalization capabilities. Furthermore, to better utilize the complementary information in the dual-model’s outputs, we explore various methods for similarity computation and score fusion. Our experiments, conducted on the publicly available <span>VoxCeleb2</span> and <span>VoxMovies</span><span><span> datasets, have demonstrated that our proposed dual-model regularization and fusion methods outperformed the strong baseline by a relative 9.07%–11.6% </span>EER reduction across various in-domain and cross-domain evaluation sets. Importantly, our approach exhibits effectiveness in both supervised and unsupervised scenarios for low-resource cross-domain speaker verification tasks.</span></p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"155 \",\"pages\":\"Article 103001\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639323001358\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001358","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
Dual-model self-regularization and fusion for domain adaptation of robust speaker verification
Learning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim to improve the well-established ECAPA-TDNN framework to enhance its domain robustness for low-resource cross-domain speaker verification tasks. Specifically, a novel dual-model self-learning approach is first proposed to produce robust speaker identity embeddings, where the ECAPA-TDNN is extended into a dual-model structure and then trained and regularized using self-supervised learning between different intermediate acoustic representations; Then, we enhance the dual-model by combining self-supervised loss and supervised loss in a time-dependent manner, thus enhancing the model’s overall generalization capabilities. Furthermore, to better utilize the complementary information in the dual-model’s outputs, we explore various methods for similarity computation and score fusion. Our experiments, conducted on the publicly available VoxCeleb2 and VoxMovies datasets, have demonstrated that our proposed dual-model regularization and fusion methods outperformed the strong baseline by a relative 9.07%–11.6% EER reduction across various in-domain and cross-domain evaluation sets. Importantly, our approach exhibits effectiveness in both supervised and unsupervised scenarios for low-resource cross-domain speaker verification tasks.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.