SafeEar: Content Privacy-Preserving Audio Deepfake Detection

arXiv - CS - Multimedia Pub Date : 2024-09-14 DOI:arxiv-2409.09272

Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu

{"title":"SafeEar: Content Privacy-Preserving Audio Deepfake Detection","authors":"Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu","doi":"arxiv-2409.09272","DOIUrl":null,"url":null,"abstract":"Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited\nremarkable performance in generating realistic and natural audio. However,\ntheir dark side, audio deepfake poses a significant threat to both society and\nindividuals. Existing countermeasures largely focus on determining the\ngenuineness of speech based on complete original audio recordings, which\nhowever often contain private content. This oversight may refrain deepfake\ndetection from many applications, particularly in scenarios involving sensitive\ninformation like business secrets. In this paper, we propose SafeEar, a novel\nframework that aims to detect deepfake audios without relying on accessing the\nspeech content within. Our key idea is to devise a neural audio codec into a\nnovel decoupling model that well separates the semantic and acoustic\ninformation from audio samples, and only use the acoustic information (e.g.,\nprosody and timbre) for deepfake detection. In this way, no semantic content\nwill be exposed to the detector. To overcome the challenge of identifying\ndiverse deepfake audio without semantic clues, we enhance our deepfake detector\nwith real-world codec augmentation. Extensive experiments conducted on four\nbenchmark datasets demonstrate SafeEar's effectiveness in detecting various\ndeepfake techniques with an equal error rate (EER) down to 2.02%.\nSimultaneously, it shields five-language speech content from being deciphered\nby both machine and human auditory analysis, demonstrated by word error rates\n(WERs) all above 93.93% and our user study. Furthermore, our benchmark\nconstructed for anti-deepfake and anti-content recovery evaluation helps\nprovide a basis for future research in the realms of audio privacy preservation\nand deepfake detection.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SafeEar：内容隐私保护音频深度伪造检测

文本到语音（TTS）和语音转换（VC）模型在生成逼真自然的音频方面表现出了卓越的性能。然而，它们的阴暗面--音频深度伪造对社会和个人都构成了重大威胁。现有的应对措施主要集中在根据完整的原始音频录音来确定语音的原始性，但这些录音往往包含私人内容。这种疏忽可能会导致深度伪造检测无法广泛应用，尤其是在涉及商业机密等敏感信息的场景中。在本文中，我们提出了 SafeEar，这是一个新颖的框架，旨在检测深度伪音频，而无需依赖访问其中的语音内容。我们的主要想法是将神经音频编解码器设计成一个高级解耦模型，该模型能很好地分离音频样本中的语义和声学信息，并仅使用声学信息（如前奏和音色）进行深度伪听检测。这样，就不会有语义内容暴露给检测器。为了克服在没有语义线索的情况下识别各种深度伪造音频的挑战，我们使用真实世界编解码器增强技术来增强我们的深度伪造检测器。在四个基准数据集上进行的广泛实验证明，SafeEar 能有效检测各种深度伪造技术，等效错误率（EER）低至 2.02%。同时，它还能保护五种语言的语音内容不被机器和人类听觉分析破译，词错误率（WER）均高于 93.93%，我们的用户研究也证明了这一点。此外，我们为反深度伪造和反内容恢复评估而构建的基准有助于为音频隐私保护和深度伪造检测领域的未来研究提供基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量

期刊最新文献

Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs