基于时空双变压器的视频伪造检测

Chenyu Liu, Jia Li, Junxian Duan, Huaibo Huang
{"title":"基于时空双变压器的视频伪造检测","authors":"Chenyu Liu, Jia Li, Junxian Duan, Huaibo Huang","doi":"10.1145/3581807.3581847","DOIUrl":null,"url":null,"abstract":"The fake videos generated by deep generation technology pose a potential threat to social stability, which makes it critical to detect fake videos. Although the previous detection methods have achieved high accuracy, the generalization to different datasets and in realistic scenes is not effective. We find several novel temporal and spatial clues. In the frequency domain, the inter-frame differences between the real and fake videos are significantly more obvious than the intra-frame differences. In the shallow texture on the CbCr color channels, the forged areas of the fake videos appear more distinct blurring compared to the real videos. And the optical flow of the real video changes gradually, while the optical flow of the fake video changes drastically. This paper proposes a spatio-temporal dual Transformer network for video forgery detection that integrates spatio-temporal clues with the temporal consistency of consecutive frames to improve generalization. Specifically, an EfficientNet is first used to extract spatial artifacts of shallow textures and high-frequency information. We add a new loss function to EfficientNet to extract more robust face features, as well as introduce an attention mechanism to enhance the extracted features. Next, a Swin Transformer is used to capture the subtle temporal artifacts in inter-frame spectrum difference and the optical flow. A feature interaction module is added to fuse local features and global representations. Finally, another Swin Transformer is used to classify the videos according to the extracted spatio-temporal features. We evaluate our method on datasets such as FaceForensics++, Celeb-DF (v2) and DFDC. Extensive experiments show that the proposed framework has high accuracy and generalization, outperforming the current state-of-the-art methods.","PeriodicalId":292813,"journal":{"name":"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Video Forgery Detection Using Spatio-Temporal Dual Transformer\",\"authors\":\"Chenyu Liu, Jia Li, Junxian Duan, Huaibo Huang\",\"doi\":\"10.1145/3581807.3581847\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The fake videos generated by deep generation technology pose a potential threat to social stability, which makes it critical to detect fake videos. Although the previous detection methods have achieved high accuracy, the generalization to different datasets and in realistic scenes is not effective. We find several novel temporal and spatial clues. In the frequency domain, the inter-frame differences between the real and fake videos are significantly more obvious than the intra-frame differences. In the shallow texture on the CbCr color channels, the forged areas of the fake videos appear more distinct blurring compared to the real videos. And the optical flow of the real video changes gradually, while the optical flow of the fake video changes drastically. This paper proposes a spatio-temporal dual Transformer network for video forgery detection that integrates spatio-temporal clues with the temporal consistency of consecutive frames to improve generalization. Specifically, an EfficientNet is first used to extract spatial artifacts of shallow textures and high-frequency information. We add a new loss function to EfficientNet to extract more robust face features, as well as introduce an attention mechanism to enhance the extracted features. Next, a Swin Transformer is used to capture the subtle temporal artifacts in inter-frame spectrum difference and the optical flow. A feature interaction module is added to fuse local features and global representations. Finally, another Swin Transformer is used to classify the videos according to the extracted spatio-temporal features. We evaluate our method on datasets such as FaceForensics++, Celeb-DF (v2) and DFDC. Extensive experiments show that the proposed framework has high accuracy and generalization, outperforming the current state-of-the-art methods.\",\"PeriodicalId\":292813,\"journal\":{\"name\":\"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3581807.3581847\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3581807.3581847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

深度生成技术产生的虚假视频对社会稳定构成潜在威胁,因此对虚假视频的检测至关重要。虽然以前的检测方法已经达到了很高的精度,但对不同数据集和真实场景的泛化效果并不好。我们发现了一些新的时空线索。在频域上,真假视频的帧间差异明显大于帧内差异。在CbCr颜色通道上的浅纹理中,假视频的伪造区域比真实视频出现更明显的模糊。真实视频的光流变化是渐变的,而假视频的光流变化是剧烈的。本文提出了一种用于视频伪造检测的时空双Transformer网络,该网络将时空线索与连续帧的时间一致性相结合,提高了视频伪造检测的泛化能力。具体而言,首先使用effentnet提取浅纹理和高频信息的空间伪影。我们在effentnet中增加了一个新的损失函数来提取更鲁棒的人脸特征,并引入了一个注意机制来增强提取的特征。然后,使用Swin变压器捕获帧间频谱差和光流中的细微时间伪影。添加了一个功能交互模块来融合本地功能和全局表示。最后,利用Swin Transformer根据提取的时空特征对视频进行分类。我们在FaceForensics++、Celeb-DF (v2)和DFDC等数据集上评估了我们的方法。大量的实验表明,该框架具有较高的准确性和泛化性,优于目前最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Video Forgery Detection Using Spatio-Temporal Dual Transformer
The fake videos generated by deep generation technology pose a potential threat to social stability, which makes it critical to detect fake videos. Although the previous detection methods have achieved high accuracy, the generalization to different datasets and in realistic scenes is not effective. We find several novel temporal and spatial clues. In the frequency domain, the inter-frame differences between the real and fake videos are significantly more obvious than the intra-frame differences. In the shallow texture on the CbCr color channels, the forged areas of the fake videos appear more distinct blurring compared to the real videos. And the optical flow of the real video changes gradually, while the optical flow of the fake video changes drastically. This paper proposes a spatio-temporal dual Transformer network for video forgery detection that integrates spatio-temporal clues with the temporal consistency of consecutive frames to improve generalization. Specifically, an EfficientNet is first used to extract spatial artifacts of shallow textures and high-frequency information. We add a new loss function to EfficientNet to extract more robust face features, as well as introduce an attention mechanism to enhance the extracted features. Next, a Swin Transformer is used to capture the subtle temporal artifacts in inter-frame spectrum difference and the optical flow. A feature interaction module is added to fuse local features and global representations. Finally, another Swin Transformer is used to classify the videos according to the extracted spatio-temporal features. We evaluate our method on datasets such as FaceForensics++, Celeb-DF (v2) and DFDC. Extensive experiments show that the proposed framework has high accuracy and generalization, outperforming the current state-of-the-art methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Multi-Scale Channel Attention for Chinese Scene Text Recognition Vehicle Re-identification Based on Multi-Scale Attention Feature Fusion Comparative Study on EEG Feature Recognition based on Deep Belief Network VA-TransUNet: A U-shaped Medical Image Segmentation Network with Visual Attention Traffic Flow Forecasting Research Based on Delay Reconstruction and GRU-SVR
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1