"Glitch in the Matrix!": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization

Comput. Vis. Image Underst. Pub Date : 2023-05-03 DOI:10.48550/arXiv.2305.01979

Zhixi Cai, Shreya Ghosh, Tom Gedeon, Abhinav Dhall, Kalin Stefanov, Munawar Hayat

{"title":"\"Glitch in the Matrix!\": A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization","authors":"Zhixi Cai, Shreya Ghosh, Tom Gedeon, Abhinav Dhall, Kalin Stefanov, Munawar Hayat","doi":"10.48550/arXiv.2305.01979","DOIUrl":null,"url":null,"abstract":"Most deepfake detection methods focus on detecting spatial and/or spatio-temporal changes in facial attributes and are centered around the binary classification task of detecting whether a video is real or fake. This is because available benchmark datasets contain mostly visual-only modifications present in the entirety of the video. However, a sophisticated deepfake may include small segments of audio or audio-visual manipulations that can completely change the meaning of the video content. To addresses this gap, we propose and benchmark a new dataset, Localized Audio Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture which effectively captures multimodal manipulations. We further improve (i.e. BA-TFD+) the baseline method by replacing the backbone with a Multiscale Vision Transformer and guide the training process with contrastive, frame classification, boundary matching and multimodal boundary matching loss functions. The quantitative analysis demonstrates the superiority of BA-TFD+ on temporal forgery localization and deepfake detection tasks using several benchmark datasets including our newly proposed dataset. The dataset, models and code are available at https://github.com/ControlNet/LAV-DF.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":"13 1","pages":"103818"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput. Vis. Image Underst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2305.01979","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Most deepfake detection methods focus on detecting spatial and/or spatio-temporal changes in facial attributes and are centered around the binary classification task of detecting whether a video is real or fake. This is because available benchmark datasets contain mostly visual-only modifications present in the entirety of the video. However, a sophisticated deepfake may include small segments of audio or audio-visual manipulations that can completely change the meaning of the video content. To addresses this gap, we propose and benchmark a new dataset, Localized Audio Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture which effectively captures multimodal manipulations. We further improve (i.e. BA-TFD+) the baseline method by replacing the backbone with a Multiscale Vision Transformer and guide the training process with contrastive, frame classification, boundary matching and multimodal boundary matching loss functions. The quantitative analysis demonstrates the superiority of BA-TFD+ on temporal forgery localization and deepfake detection tasks using several benchmark datasets including our newly proposed dataset. The dataset, models and code are available at https://github.com/ControlNet/LAV-DF.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

“黑客帝国中的故障!”内容驱动的视听伪造检测与定位的大规模基准

大多数深度假检测方法专注于检测面部属性的空间和/或时空变化，并围绕检测视频是真还是假的二元分类任务。这是因为可用的基准数据集在整个视频中包含了大部分仅用于视觉的修改。然而，一个复杂的深度伪造可能包括一小段音频或视听操作，这可以完全改变视频内容的含义。为了解决这一差距，我们提出了一个新的数据集，本地化视听深度造假(LAV-DF)，由战略内容驱动的音频、视觉和视听操作组成。提出的基线方法，边界感知时间伪造检测(BA-TFD)，是一种基于三维卷积神经网络的架构，可以有效捕获多模态操作。我们进一步改进(即BA-TFD+)基线方法，用一个多尺度视觉变压器替换主干，并使用对比、帧分类、边界匹配和多模态边界匹配损失函数来指导训练过程。定量分析证明了BA-TFD+在时间伪造定位和深度伪造检测任务上的优势，包括我们新提出的数据集。数据集、模型和代码可在https://github.com/ControlNet/LAV-DF上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Comput. Vis. Image Underst.

自引率

0.00%

发文量