Zhor Benhafid, S. Selouani, M. S. Yakoub, A. Amrouche
{"title":"基于混合残差块时延神经网络的说话人识别","authors":"Zhor Benhafid, S. Selouani, M. S. Yakoub, A. Amrouche","doi":"10.1109/BioSMART54244.2021.9677886","DOIUrl":null,"url":null,"abstract":"Current speaker recognition systems are based ei-ther on time-delay neural network (TDNN) x-vectors or ResNet embedding speaker representations. Both architectures have their advantages and this paper aims to benefit from their prominent and complementary features. In contrast to what has been already proposed in the literature, we investigate the impact of using only one residual neural network block named ResBlock on x-vectors instead of the several blocks used in conventional sys-tems. Four ResBlock variants are integrated at the TDNN frame-level layer of x-vectors. The obtained hybrid One-ResBlock-TDNN architectures are evaluated using Speaker In The Wild (SITW) and Voices Obscured in Complex Environmental Settings (VOiCES) evaluation sets. The experimental assessment reveals that compared to conventional x-vectors' encoder, a noticeable accuracy improvement of all proposed hybrid One-ResBlock-TDNN variants has been achieved on both SITW and VOiCES standards' datasets.","PeriodicalId":286026,"journal":{"name":"2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Hybrid Residual Block Time-Delay Neural Network Embeddings for Speaker Recognition\",\"authors\":\"Zhor Benhafid, S. Selouani, M. S. Yakoub, A. Amrouche\",\"doi\":\"10.1109/BioSMART54244.2021.9677886\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current speaker recognition systems are based ei-ther on time-delay neural network (TDNN) x-vectors or ResNet embedding speaker representations. Both architectures have their advantages and this paper aims to benefit from their prominent and complementary features. In contrast to what has been already proposed in the literature, we investigate the impact of using only one residual neural network block named ResBlock on x-vectors instead of the several blocks used in conventional sys-tems. Four ResBlock variants are integrated at the TDNN frame-level layer of x-vectors. The obtained hybrid One-ResBlock-TDNN architectures are evaluated using Speaker In The Wild (SITW) and Voices Obscured in Complex Environmental Settings (VOiCES) evaluation sets. The experimental assessment reveals that compared to conventional x-vectors' encoder, a noticeable accuracy improvement of all proposed hybrid One-ResBlock-TDNN variants has been achieved on both SITW and VOiCES standards' datasets.\",\"PeriodicalId\":286026,\"journal\":{\"name\":\"2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BioSMART54244.2021.9677886\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BioSMART54244.2021.9677886","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hybrid Residual Block Time-Delay Neural Network Embeddings for Speaker Recognition
Current speaker recognition systems are based ei-ther on time-delay neural network (TDNN) x-vectors or ResNet embedding speaker representations. Both architectures have their advantages and this paper aims to benefit from their prominent and complementary features. In contrast to what has been already proposed in the literature, we investigate the impact of using only one residual neural network block named ResBlock on x-vectors instead of the several blocks used in conventional sys-tems. Four ResBlock variants are integrated at the TDNN frame-level layer of x-vectors. The obtained hybrid One-ResBlock-TDNN architectures are evaluated using Speaker In The Wild (SITW) and Voices Obscured in Complex Environmental Settings (VOiCES) evaluation sets. The experimental assessment reveals that compared to conventional x-vectors' encoder, a noticeable accuracy improvement of all proposed hybrid One-ResBlock-TDNN variants has been achieved on both SITW and VOiCES standards' datasets.