利用大规模预训练模型实现免训练深度伪语音识别

Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva
{"title":"利用大规模预训练模型实现免训练深度伪语音识别","authors":"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva","doi":"arxiv-2405.02179","DOIUrl":null,"url":null,"abstract":"Generalization is a main issue for current audio deepfake detectors, which\nstruggle to provide reliable results on out-of-distribution data. Given the\nspeed at which more and more accurate synthesis methods are developed, it is\nvery important to design techniques that work well also on data they were not\ntrained for. In this paper we study the potential of large-scale pre-trained\nmodels for audio deepfake detection, with special focus on generalization\nability. To this end, the detection problem is reformulated in a speaker\nverification framework and fake audios are exposed by the mismatch between the\nvoice sample under test and the voice of the claimed identity. With this\nparadigm, no fake speech sample is necessary in training, cutting off any link\nwith the generation method at the root, and ensuring full generalization\nability. Features are extracted by general-purpose large pre-trained models,\nwith no need for training or fine-tuning on specific fake detection or speaker\nverification datasets. At detection time only a limited set of voice fragments\nof the identity under test is required. Experiments on several datasets\nwidespread in the community show that detectors based on pre-trained models\nachieve excellent performance and show strong generalization ability, rivaling\nsupervised methods on in-distribution data and largely overcoming them on\nout-of-distribution data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models\",\"authors\":\"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva\",\"doi\":\"arxiv-2405.02179\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generalization is a main issue for current audio deepfake detectors, which\\nstruggle to provide reliable results on out-of-distribution data. Given the\\nspeed at which more and more accurate synthesis methods are developed, it is\\nvery important to design techniques that work well also on data they were not\\ntrained for. In this paper we study the potential of large-scale pre-trained\\nmodels for audio deepfake detection, with special focus on generalization\\nability. To this end, the detection problem is reformulated in a speaker\\nverification framework and fake audios are exposed by the mismatch between the\\nvoice sample under test and the voice of the claimed identity. With this\\nparadigm, no fake speech sample is necessary in training, cutting off any link\\nwith the generation method at the root, and ensuring full generalization\\nability. Features are extracted by general-purpose large pre-trained models,\\nwith no need for training or fine-tuning on specific fake detection or speaker\\nverification datasets. At detection time only a limited set of voice fragments\\nof the identity under test is required. Experiments on several datasets\\nwidespread in the community show that detectors based on pre-trained models\\nachieve excellent performance and show strong generalization ability, rivaling\\nsupervised methods on in-distribution data and largely overcoming them on\\nout-of-distribution data.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.02179\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02179","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

通用化是当前音频深度伪造检测器的一个主要问题,因为这些检测器很难在非分布数据上提供可靠的结果。鉴于越来越多的精确合成方法被快速开发出来,设计出在未经训练的数据上也能良好工作的技术就显得非常重要。在本文中,我们研究了大规模预训练模型在音频深度防伪检测方面的潜力,并特别关注其通用性。为此,我们在扬声器验证框架中对检测问题进行了重新表述,通过被测声音样本与声称身份的声音之间的不匹配来揭露虚假音频。有了这种范式,在训练中就不需要假语音样本,从根本上切断了与生成方法的任何联系,确保了完全的通用性。特征由通用的大型预训练模型提取,无需在特定的假语音检测或说话人验证数据集上进行训练或微调。检测时只需要一组有限的被测身份语音片段。在社区广泛使用的几个数据集上进行的实验表明,基于预训练模型的检测器性能卓越,显示出很强的泛化能力,在分布内数据上可与有监督的方法相媲美,在分布外数据上则在很大程度上战胜了有监督的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models
Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Benchmarking Sub-Genre Classification For Mainstage Dance Music PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification Evaluation of real-time transcriptions using end-to-end ASR models Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks Harmonic Reasoning in Large Language Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1