Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu
{"title":"SafeEar: Content Privacy-Preserving Audio Deepfake Detection","authors":"Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu","doi":"arxiv-2409.09272","DOIUrl":null,"url":null,"abstract":"Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited\nremarkable performance in generating realistic and natural audio. However,\ntheir dark side, audio deepfake poses a significant threat to both society and\nindividuals. Existing countermeasures largely focus on determining the\ngenuineness of speech based on complete original audio recordings, which\nhowever often contain private content. This oversight may refrain deepfake\ndetection from many applications, particularly in scenarios involving sensitive\ninformation like business secrets. In this paper, we propose SafeEar, a novel\nframework that aims to detect deepfake audios without relying on accessing the\nspeech content within. Our key idea is to devise a neural audio codec into a\nnovel decoupling model that well separates the semantic and acoustic\ninformation from audio samples, and only use the acoustic information (e.g.,\nprosody and timbre) for deepfake detection. In this way, no semantic content\nwill be exposed to the detector. To overcome the challenge of identifying\ndiverse deepfake audio without semantic clues, we enhance our deepfake detector\nwith real-world codec augmentation. Extensive experiments conducted on four\nbenchmark datasets demonstrate SafeEar's effectiveness in detecting various\ndeepfake techniques with an equal error rate (EER) down to 2.02%.\nSimultaneously, it shields five-language speech content from being deciphered\nby both machine and human auditory analysis, demonstrated by word error rates\n(WERs) all above 93.93% and our user study. Furthermore, our benchmark\nconstructed for anti-deepfake and anti-content recovery evaluation helps\nprovide a basis for future research in the realms of audio privacy preservation\nand deepfake detection.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited
remarkable performance in generating realistic and natural audio. However,
their dark side, audio deepfake poses a significant threat to both society and
individuals. Existing countermeasures largely focus on determining the
genuineness of speech based on complete original audio recordings, which
however often contain private content. This oversight may refrain deepfake
detection from many applications, particularly in scenarios involving sensitive
information like business secrets. In this paper, we propose SafeEar, a novel
framework that aims to detect deepfake audios without relying on accessing the
speech content within. Our key idea is to devise a neural audio codec into a
novel decoupling model that well separates the semantic and acoustic
information from audio samples, and only use the acoustic information (e.g.,
prosody and timbre) for deepfake detection. In this way, no semantic content
will be exposed to the detector. To overcome the challenge of identifying
diverse deepfake audio without semantic clues, we enhance our deepfake detector
with real-world codec augmentation. Extensive experiments conducted on four
benchmark datasets demonstrate SafeEar's effectiveness in detecting various
deepfake techniques with an equal error rate (EER) down to 2.02%.
Simultaneously, it shields five-language speech content from being deciphered
by both machine and human auditory analysis, demonstrated by word error rates
(WERs) all above 93.93% and our user study. Furthermore, our benchmark
constructed for anti-deepfake and anti-content recovery evaluation helps
provide a basis for future research in the realms of audio privacy preservation
and deepfake detection.