利用大规模预训练模型实现免训练深度伪语音识别

arXiv - CS - Sound Pub Date : 2024-05-03 DOI:arxiv-2405.02179

Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva

{"title":"利用大规模预训练模型实现免训练深度伪语音识别","authors":"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva","doi":"arxiv-2405.02179","DOIUrl":null,"url":null,"abstract":"Generalization is a main issue for current audio deepfake detectors, which\nstruggle to provide reliable results on out-of-distribution data. Given the\nspeed at which more and more accurate synthesis methods are developed, it is\nvery important to design techniques that work well also on data they were not\ntrained for. In this paper we study the potential of large-scale pre-trained\nmodels for audio deepfake detection, with special focus on generalization\nability. To this end, the detection problem is reformulated in a speaker\nverification framework and fake audios are exposed by the mismatch between the\nvoice sample under test and the voice of the claimed identity. With this\nparadigm, no fake speech sample is necessary in training, cutting off any link\nwith the generation method at the root, and ensuring full generalization\nability. Features are extracted by general-purpose large pre-trained models,\nwith no need for training or fine-tuning on specific fake detection or speaker\nverification datasets. At detection time only a limited set of voice fragments\nof the identity under test is required. Experiments on several datasets\nwidespread in the community show that detectors based on pre-trained models\nachieve excellent performance and show strong generalization ability, rivaling\nsupervised methods on in-distribution data and largely overcoming them on\nout-of-distribution data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"80 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models\",\"authors\":\"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva\",\"doi\":\"arxiv-2405.02179\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generalization is a main issue for current audio deepfake detectors, which\\nstruggle to provide reliable results on out-of-distribution data. Given the\\nspeed at which more and more accurate synthesis methods are developed, it is\\nvery important to design techniques that work well also on data they were not\\ntrained for. In this paper we study the potential of large-scale pre-trained\\nmodels for audio deepfake detection, with special focus on generalization\\nability. To this end, the detection problem is reformulated in a speaker\\nverification framework and fake audios are exposed by the mismatch between the\\nvoice sample under test and the voice of the claimed identity. With this\\nparadigm, no fake speech sample is necessary in training, cutting off any link\\nwith the generation method at the root, and ensuring full generalization\\nability. Features are extracted by general-purpose large pre-trained models,\\nwith no need for training or fine-tuning on specific fake detection or speaker\\nverification datasets. At detection time only a limited set of voice fragments\\nof the identity under test is required. Experiments on several datasets\\nwidespread in the community show that detectors based on pre-trained models\\nachieve excellent performance and show strong generalization ability, rivaling\\nsupervised methods on in-distribution data and largely overcoming them on\\nout-of-distribution data.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":\"80 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.02179\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02179","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

通用化是当前音频深度伪造检测器的一个主要问题，因为这些检测器很难在非分布数据上提供可靠的结果。鉴于越来越多的精确合成方法被快速开发出来，设计出在未经训练的数据上也能良好工作的技术就显得非常重要。在本文中，我们研究了大规模预训练模型在音频深度防伪检测方面的潜力，并特别关注其通用性。为此，我们在扬声器验证框架中对检测问题进行了重新表述，通过被测声音样本与声称身份的声音之间的不匹配来揭露虚假音频。有了这种范式，在训练中就不需要假语音样本，从根本上切断了与生成方法的任何联系，确保了完全的通用性。特征由通用的大型预训练模型提取，无需在特定的假语音检测或说话人验证数据集上进行训练或微调。检测时只需要一组有限的被测身份语音片段。在社区广泛使用的几个数据集上进行的实验表明，基于预训练模型的检测器性能卓越，显示出很强的泛化能力，在分布内数据上可与有监督的方法相媲美，在分布外数据上则在很大程度上战胜了有监督的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量