Deepfake videos significantly threaten digital media credibility and public trust. While existing multimodal detection methods have advanced, they struggle to generalize across diverse real-world scenarios. Most current approaches focus exclusively on either synchronization detection or content consistency checking, limiting their effectiveness. To tackle these challenges, this study introduces a new dual-branch architecture that simultaneously learns synchronization features and content consistency representations. The model includes a synchronization branch to capture temporal misalignments and a content branch to detect semantic anomalies, with decoupling loss to enhance task specificity. In the content branch, a conditional generation task is introduced to reconstruct the fused feature sequence based on the content token, enhancing the resilience of feature representations through self-supervised learning. The proposed method also includes a hierarchical cross-modal interaction mechanism with cross-attention and fine-grained embeddings. Cross-attention combines features from different modalities to improve feature representations. Fine-grained embeddings provide the model with detailed information. Experimental results show that our approach attains an AUC of 98.30% on the FakeAVCeleb dataset, approaching the current SOTA. When evaluated across datasets, it outperformed the SOTA approaches by 0.08%, 13.46%, and 10.12% on the DeepfakeTIMIT, LAV-DF, and MAVOS-DD datasets, respectively, with AUC scores of 99.11%, 86.97%, and 67.23%. Our code is available at https://github.com/zhudedede5-droid/AVSCNet.
扫码关注我们
求助内容:
应助结果提醒方式:
