Zhor Benhafid, S. Selouani, M. S. Yakoub, A. Amrouche
{"title":"Hybrid Residual Block Time-Delay Neural Network Embeddings for Speaker Recognition","authors":"Zhor Benhafid, S. Selouani, M. S. Yakoub, A. Amrouche","doi":"10.1109/BioSMART54244.2021.9677886","DOIUrl":null,"url":null,"abstract":"Current speaker recognition systems are based ei-ther on time-delay neural network (TDNN) x-vectors or ResNet embedding speaker representations. Both architectures have their advantages and this paper aims to benefit from their prominent and complementary features. In contrast to what has been already proposed in the literature, we investigate the impact of using only one residual neural network block named ResBlock on x-vectors instead of the several blocks used in conventional sys-tems. Four ResBlock variants are integrated at the TDNN frame-level layer of x-vectors. The obtained hybrid One-ResBlock-TDNN architectures are evaluated using Speaker In The Wild (SITW) and Voices Obscured in Complex Environmental Settings (VOiCES) evaluation sets. The experimental assessment reveals that compared to conventional x-vectors' encoder, a noticeable accuracy improvement of all proposed hybrid One-ResBlock-TDNN variants has been achieved on both SITW and VOiCES standards' datasets.","PeriodicalId":286026,"journal":{"name":"2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BioSMART54244.2021.9677886","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Current speaker recognition systems are based ei-ther on time-delay neural network (TDNN) x-vectors or ResNet embedding speaker representations. Both architectures have their advantages and this paper aims to benefit from their prominent and complementary features. In contrast to what has been already proposed in the literature, we investigate the impact of using only one residual neural network block named ResBlock on x-vectors instead of the several blocks used in conventional sys-tems. Four ResBlock variants are integrated at the TDNN frame-level layer of x-vectors. The obtained hybrid One-ResBlock-TDNN architectures are evaluated using Speaker In The Wild (SITW) and Voices Obscured in Complex Environmental Settings (VOiCES) evaluation sets. The experimental assessment reveals that compared to conventional x-vectors' encoder, a noticeable accuracy improvement of all proposed hybrid One-ResBlock-TDNN variants has been achieved on both SITW and VOiCES standards' datasets.