The proliferation of sophisticated video editing tools has escalated video forgery into a severe security crisis. While existing forensic methods can achieve preliminary results in pixel-level tampering localization on conventional datasets, they often struggle to effectively address scenarios involving tampering with multiple small objects, which represent a prevalent and challenging scenario in real-world forensic tasks that has long been overlooked by major studies. To address this core challenge, we propose a novel universal video forensic vision network (VFVNet). VFVNet employs a dual-stream architecture: a multi-view feature extractor (MVFE) capturing diverse low-to-mid-level clues (e.g., texture, artifacts) from multiple perspectives, and a Contextual Semantic Feature Extractor (CSFE) modeling higher-level semantic context and spatial relationships to detect unnatural placements or variations in forensic traces. Then, fused features from both streams enter a Swin Deep Attention Module (SDAM), which explores latent feature correlations across scales and locations. SDAM’s deep attention refines the representation, amplifying tampering-relevant cues while suppressing background noise. Finally, attention-guided features combined with initially extracted features enable precise discrimination between authentic and manipulated regions for pixel-level localization. Extensive experiments on three public and self-built datasets demonstrate that VFVNet achieves state-of-the-art performance in pixel-level localization. On average, it delivers relative improvements of 13.6% in F1-score and 14.9% in MCC over the prior best method, while sustaining superior AUC.
扫码关注我们
求助内容:
应助结果提醒方式:
