Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han
{"title":"从预训练模型中提取多层特征的通用汇集法用于扬声器验证","authors":"Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han","doi":"arxiv-2409.07770","DOIUrl":null,"url":null,"abstract":"Recent advancements in automatic speaker verification (ASV) studies have been\nachieved by leveraging large-scale pretrained networks. In this study, we\nanalyze the approaches toward such a paradigm and underline the significance of\ninterlayer information processing as a result. Accordingly, we present a novel\napproach for exploiting the multilayered nature of pretrained models for ASV,\nwhich comprises a layer/frame-level network and two steps of pooling\narchitectures for each layer and frame axis. Specifically, we let convolutional\narchitecture directly processes a stack of layer outputs.Then, we present a\nchannel attention-based scheme of gauging layer significance and squeeze the\nlayer level with the most representative value. Finally, attentive statistics\nover frame-level representations yield a single vector speaker embedding.\nComparative experiments are designed using versatile data environments and\ndiverse pretraining models to validate the proposed approach. The experimental\nresults demonstrate the stability of the approach using multi-layer outputs in\nleveraging pretrained architectures. Then, we verify the superiority of the\nproposed ASV backend structure, which involves layer-wise operations, in terms\nof performance improvement along with cost efficiency compared to the\nconventional method. The ablation study shows how the proposed interlayer\nprocessing aids in maximizing the advantage of utilizing pretrained models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification\",\"authors\":\"Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han\",\"doi\":\"arxiv-2409.07770\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advancements in automatic speaker verification (ASV) studies have been\\nachieved by leveraging large-scale pretrained networks. In this study, we\\nanalyze the approaches toward such a paradigm and underline the significance of\\ninterlayer information processing as a result. Accordingly, we present a novel\\napproach for exploiting the multilayered nature of pretrained models for ASV,\\nwhich comprises a layer/frame-level network and two steps of pooling\\narchitectures for each layer and frame axis. Specifically, we let convolutional\\narchitecture directly processes a stack of layer outputs.Then, we present a\\nchannel attention-based scheme of gauging layer significance and squeeze the\\nlayer level with the most representative value. Finally, attentive statistics\\nover frame-level representations yield a single vector speaker embedding.\\nComparative experiments are designed using versatile data environments and\\ndiverse pretraining models to validate the proposed approach. The experimental\\nresults demonstrate the stability of the approach using multi-layer outputs in\\nleveraging pretrained architectures. Then, we verify the superiority of the\\nproposed ASV backend structure, which involves layer-wise operations, in terms\\nof performance improvement along with cost efficiency compared to the\\nconventional method. The ablation study shows how the proposed interlayer\\nprocessing aids in maximizing the advantage of utilizing pretrained models.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07770\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
通过利用大规模预训练网络,说话人自动验证(ASV)研究取得了最新进展。在本研究中,我们分析了实现这种模式的方法,并强调了层间信息处理的重要性。因此,我们提出了一种利用预训练模型的多层特性进行 ASV 的新方法,它包括一个层/帧级网络和针对每个层和帧轴的两步池化架构。具体来说,我们让卷积架构直接处理层输出的堆叠。然后,我们提出了一种基于通道注意力的方案来衡量层的重要性,并挤压出最具代表性值的层级。最后,通过对帧级表征的注意统计,得出单个矢量的扬声器嵌入。我们设计了多种数据环境和不同的预训练模型来验证所提出的方法。实验结果表明,在杠杆化预训练架构中使用多层输出的方法具有稳定性。然后,我们验证了所提出的 ASV 后端结构的优越性,与传统方法相比,该结构涉及分层操作,在提高性能的同时还节约了成本。消融研究表明,所提出的层间处理方法有助于最大限度地发挥利用预训练模型的优势。
Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification
Recent advancements in automatic speaker verification (ASV) studies have been
achieved by leveraging large-scale pretrained networks. In this study, we
analyze the approaches toward such a paradigm and underline the significance of
interlayer information processing as a result. Accordingly, we present a novel
approach for exploiting the multilayered nature of pretrained models for ASV,
which comprises a layer/frame-level network and two steps of pooling
architectures for each layer and frame axis. Specifically, we let convolutional
architecture directly processes a stack of layer outputs.Then, we present a
channel attention-based scheme of gauging layer significance and squeeze the
layer level with the most representative value. Finally, attentive statistics
over frame-level representations yield a single vector speaker embedding.
Comparative experiments are designed using versatile data environments and
diverse pretraining models to validate the proposed approach. The experimental
results demonstrate the stability of the approach using multi-layer outputs in
leveraging pretrained architectures. Then, we verify the superiority of the
proposed ASV backend structure, which involves layer-wise operations, in terms
of performance improvement along with cost efficiency compared to the
conventional method. The ablation study shows how the proposed interlayer
processing aids in maximizing the advantage of utilizing pretrained models.