Tex-Net: texture-based parallel branch cross-attention generalized robust Deepfake detector

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Multimedia Systems Pub Date : 2024-08-01 DOI:10.1007/s00530-024-01424-7

Deepak Dagar, Dinesh Kumar Vishwakarma

{"title":"Tex-Net: texture-based parallel branch cross-attention generalized robust Deepfake detector","authors":"Deepak Dagar, Dinesh Kumar Vishwakarma","doi":"10.1007/s00530-024-01424-7","DOIUrl":null,"url":null,"abstract":"<p>In recent years, artificial faces generated using Generative Adversarial Networks (GANs) and Variational Auto-encoders (VAEs) have become more lifelike and difficult for humans to distinguish. Deepfake refers to highly realistic and impressive media generated using deep learning technology. Convolutional Neural Networks (CNNs) have demonstrated significant potential in computer vision applications, particularly identifying fraudulent faces. However, if these networks are trained on insufficient data, they cannot effectively apply their knowledge to unfamiliar datasets, as they are susceptible to inherent biases in their learning process, such as translation, equivariance, and localization. The attention mechanism of vision transformers has effectively resolved these limits, leading to their growing popularity in recent years. This work introduces a novel module for extracting global texture information and a model that combines data from CNN (ResNet-18) and cross-attention vision transformers. The model takes in input and generates the global texture by utilizing Gram matrices and local binary patterns at each down sampling step of the ResNet-18 architecture. The ResNet-18 main branch and global texture module operate simultaneously before inputting into the visual transformer’s dual branch’s cross-attention mechanism. Initially, the empirical investigation demonstrates that counterfeit images typically display more uniform textures that are inconsistent across long distances. The model’s performance on the cross-forgery dataset is demonstrated by experiments conducted on various types of GAN images and Faceforensics + + categories. The results show that the model outperforms the scores of many state-of-the-art techniques, achieving an accuracy score of up to 85%. Furthermore, multiple tests are performed on different data samples (FF + +, DFDCPreview, Celeb-Df) that undergo post-processing techniques, including compression, noise addition, and blurring. These studies validate that the model acquires the shared distinguishing characteristics (global texture) that persist across different types of fake picture distributions, and the outcomes of these trials demonstrate that the model is resilient and can be used in many scenarios.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"34 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01424-7","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, artificial faces generated using Generative Adversarial Networks (GANs) and Variational Auto-encoders (VAEs) have become more lifelike and difficult for humans to distinguish. Deepfake refers to highly realistic and impressive media generated using deep learning technology. Convolutional Neural Networks (CNNs) have demonstrated significant potential in computer vision applications, particularly identifying fraudulent faces. However, if these networks are trained on insufficient data, they cannot effectively apply their knowledge to unfamiliar datasets, as they are susceptible to inherent biases in their learning process, such as translation, equivariance, and localization. The attention mechanism of vision transformers has effectively resolved these limits, leading to their growing popularity in recent years. This work introduces a novel module for extracting global texture information and a model that combines data from CNN (ResNet-18) and cross-attention vision transformers. The model takes in input and generates the global texture by utilizing Gram matrices and local binary patterns at each down sampling step of the ResNet-18 architecture. The ResNet-18 main branch and global texture module operate simultaneously before inputting into the visual transformer’s dual branch’s cross-attention mechanism. Initially, the empirical investigation demonstrates that counterfeit images typically display more uniform textures that are inconsistent across long distances. The model’s performance on the cross-forgery dataset is demonstrated by experiments conducted on various types of GAN images and Faceforensics + + categories. The results show that the model outperforms the scores of many state-of-the-art techniques, achieving an accuracy score of up to 85%. Furthermore, multiple tests are performed on different data samples (FF + +, DFDCPreview, Celeb-Df) that undergo post-processing techniques, including compression, noise addition, and blurring. These studies validate that the model acquires the shared distinguishing characteristics (global texture) that persist across different types of fake picture distributions, and the outcomes of these trials demonstrate that the model is resilient and can be used in many scenarios.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Tex-Net：基于纹理的并行分支交叉注意广义鲁棒 Deepfake 检测器

近年来，利用生成式对抗网络（GAN）和变异自动编码器（VAE）生成的人工人脸越来越逼真，人类也越来越难以分辨。Deepfake 指的是利用深度学习技术生成的高度逼真、令人印象深刻的媒体。卷积神经网络（CNN）已在计算机视觉应用中展现出巨大潜力，尤其是在识别欺诈性人脸方面。然而，如果这些网络是在数据不足的情况下进行训练的，它们就无法有效地将其知识应用于陌生的数据集，因为它们在学习过程中容易受到固有偏差的影响，例如平移、等差数列和定位。视觉转换器的注意力机制有效地解决了这些限制，因此近年来越来越受欢迎。本作品介绍了一个用于提取全局纹理信息的新模块，以及一个结合了 CNN（ResNet-18）和交叉注意力视觉转换器数据的模型。该模型接收输入，并在 ResNet-18 架构的每个向下采样步骤中利用格兰氏矩阵和局部二进制模式生成全局纹理。ResNet-18 主分支和全局纹理模块同时运行，然后输入视觉转换器双分支的交叉注意机制。最初的实证调查表明，伪造图像通常显示出较均匀的纹理，而这种纹理在长距离上是不一致的。通过对各种类型的 GAN 图像和 Faceforensics + + 类别进行实验，证明了该模型在交叉伪造数据集上的性能。结果表明，该模型优于许多最先进技术的得分，准确率高达 85%。此外，还对经过压缩、噪声添加和模糊等后处理技术的不同数据样本（FF + +、DFDCPreview、Celeb-Df）进行了多次测试。这些研究验证了该模型获得了在不同类型的假图片分布中持续存在的共同识别特征（全局纹理），这些试验结果表明该模型具有弹性，可用于多种场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Multimedia Systems 工程技术-计算机：理论方法

CiteScore

5.40

自引率

7.70%

发文量

148

审稿时长

4.5 months

期刊介绍： This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.