Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI:arxiv-2409.11909

Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, Chenxing Li, Xuefei Liu, Guanjun Li

引用次数: 0

Abstract

Speech synthesis technology has posed a serious threat to speaker verification systems. Currently, the most effective fake audio detection methods utilize pretrained models, and integrating features from various layers of pretrained model further enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning the pretrained models, resulting in excessively long training times and hindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on the Mixture of Experts, which extracts and integrates features relevant to fake audio detection from layer features, guided by a gating network based on the last layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets demonstrate that the proposed method achieves competitive performance compared to those requiring fine-tuning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用冻结的 wav2vec 2.0 进行专家混合假音频检测

语音合成技术对扬声器验证系统构成了严重威胁。目前，最有效的虚假音频检测方法是利用预训练模型，并将各层预训练模型的特征进行融合，从而进一步提高检测性能。然而，之前提出的大多数融合方法都需要对预设模型进行微调，导致训练时间过长，在面对新的语音合成技术时，模型迭代受到阻碍。针对这一问题，本文提出了一种基于专家混合的特征融合方法，该方法在冻结预训练模型的同时，通过基于最后一层特征的门控网络，从各层特征中提取并整合与假音频检测相关的特征。在 ASVspoof2019 和 ASVspoof2021 数据集上进行的实验表明，与那些需要微调的数据集相比，所提出的方法取得了具有竞争力的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量