{"title":"渐进式信道融合可提高 TDNN 在扬声器验证方面的效率","authors":"Zhenduo Zhao , Zhuo Li , Wenchao Wang , Ji Xu","doi":"10.1016/j.specom.2024.103105","DOIUrl":null,"url":null,"abstract":"<div><p>ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103105"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Progressive channel fusion for more efficient TDNN on speaker verification\",\"authors\":\"Zhenduo Zhao , Zhuo Li , Wenchao Wang , Ji Xu\",\"doi\":\"10.1016/j.specom.2024.103105\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"163 \",\"pages\":\"Article 103105\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000773\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000773","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
Progressive channel fusion for more efficient TDNN on speaker verification
ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.