Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva
{"title":"利用大规模预训练模型实现免训练深度伪语音识别","authors":"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva","doi":"arxiv-2405.02179","DOIUrl":null,"url":null,"abstract":"Generalization is a main issue for current audio deepfake detectors, which\nstruggle to provide reliable results on out-of-distribution data. Given the\nspeed at which more and more accurate synthesis methods are developed, it is\nvery important to design techniques that work well also on data they were not\ntrained for. In this paper we study the potential of large-scale pre-trained\nmodels for audio deepfake detection, with special focus on generalization\nability. To this end, the detection problem is reformulated in a speaker\nverification framework and fake audios are exposed by the mismatch between the\nvoice sample under test and the voice of the claimed identity. With this\nparadigm, no fake speech sample is necessary in training, cutting off any link\nwith the generation method at the root, and ensuring full generalization\nability. Features are extracted by general-purpose large pre-trained models,\nwith no need for training or fine-tuning on specific fake detection or speaker\nverification datasets. At detection time only a limited set of voice fragments\nof the identity under test is required. Experiments on several datasets\nwidespread in the community show that detectors based on pre-trained models\nachieve excellent performance and show strong generalization ability, rivaling\nsupervised methods on in-distribution data and largely overcoming them on\nout-of-distribution data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"80 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models\",\"authors\":\"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva\",\"doi\":\"arxiv-2405.02179\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generalization is a main issue for current audio deepfake detectors, which\\nstruggle to provide reliable results on out-of-distribution data. Given the\\nspeed at which more and more accurate synthesis methods are developed, it is\\nvery important to design techniques that work well also on data they were not\\ntrained for. In this paper we study the potential of large-scale pre-trained\\nmodels for audio deepfake detection, with special focus on generalization\\nability. To this end, the detection problem is reformulated in a speaker\\nverification framework and fake audios are exposed by the mismatch between the\\nvoice sample under test and the voice of the claimed identity. With this\\nparadigm, no fake speech sample is necessary in training, cutting off any link\\nwith the generation method at the root, and ensuring full generalization\\nability. Features are extracted by general-purpose large pre-trained models,\\nwith no need for training or fine-tuning on specific fake detection or speaker\\nverification datasets. At detection time only a limited set of voice fragments\\nof the identity under test is required. Experiments on several datasets\\nwidespread in the community show that detectors based on pre-trained models\\nachieve excellent performance and show strong generalization ability, rivaling\\nsupervised methods on in-distribution data and largely overcoming them on\\nout-of-distribution data.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":\"80 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.02179\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02179","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models
Generalization is a main issue for current audio deepfake detectors, which
struggle to provide reliable results on out-of-distribution data. Given the
speed at which more and more accurate synthesis methods are developed, it is
very important to design techniques that work well also on data they were not
trained for. In this paper we study the potential of large-scale pre-trained
models for audio deepfake detection, with special focus on generalization
ability. To this end, the detection problem is reformulated in a speaker
verification framework and fake audios are exposed by the mismatch between the
voice sample under test and the voice of the claimed identity. With this
paradigm, no fake speech sample is necessary in training, cutting off any link
with the generation method at the root, and ensuring full generalization
ability. Features are extracted by general-purpose large pre-trained models,
with no need for training or fine-tuning on specific fake detection or speaker
verification datasets. At detection time only a limited set of voice fragments
of the identity under test is required. Experiments on several datasets
widespread in the community show that detectors based on pre-trained models
achieve excellent performance and show strong generalization ability, rivaling
supervised methods on in-distribution data and largely overcoming them on
out-of-distribution data.