Robustifying vision transformer for image forgery localization with multi-exit architectures

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2025-03-12 DOI:10.1016/j.patcog.2025.111565

Zenan Shi , Haipeng Chen , Dong Zhang

{"title":"Robustifying vision transformer for image forgery localization with multi-exit architectures","authors":"Zenan Shi , Haipeng Chen , Dong Zhang","doi":"10.1016/j.patcog.2025.111565","DOIUrl":null,"url":null,"abstract":"<div><div>The proliferation of image manipulation tools has led to an increase in the number of manipulated images being disseminated online, posing risks like the propagation of fake news and telecom fraud. Thus, there is an increasing demand for precise, generic, and robust methods for detecting and locating manipulated images. In this paper, we propose a simple and clean model, named MEAFormer, for image forgery localization that does not heavily rely on pre-trained models. MEAFormer comprises three main components: an <em>encoder network</em>, a <em>neck network</em>, and a <em>decoder network</em>. Specifically, the transformer-based <em>encoder network</em> extracts hierarchical feature representations from the input image, providing rich contextual information in each layer. The <em>neck network</em>, incorporating our proposed cross-layer feature aggregation (CFA), aggregates these hierarchical features. To achieve better spatial feature co-occurrence, instead of using noise or edge artifacts, we introduce a multi-scale graph reasoning (MGR) module within the <em>decoder network</em> via bipartite graphs over the encoder and decoder features in a multi-scale fashion. The cross-level enhancement (CLE) further performs adjacent-level feature fusion to amplify the regions of interest in aggregated manipulation features. Finally, the multi-exit architecture (MEA) guides the model to learn fine-grained features and segment out the manipulated region. Extensive experiments across diverse and challenging datasets conclusively establish the superiority of MEAFormer over existing state-of-the-art methods, excelling in accuracy, generalization, and robustness.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"164 ","pages":"Article 111565"},"PeriodicalIF":7.5000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325002250","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The proliferation of image manipulation tools has led to an increase in the number of manipulated images being disseminated online, posing risks like the propagation of fake news and telecom fraud. Thus, there is an increasing demand for precise, generic, and robust methods for detecting and locating manipulated images. In this paper, we propose a simple and clean model, named MEAFormer, for image forgery localization that does not heavily rely on pre-trained models. MEAFormer comprises three main components: an encoder network, a neck network, and a decoder network. Specifically, the transformer-based encoder network extracts hierarchical feature representations from the input image, providing rich contextual information in each layer. The neck network, incorporating our proposed cross-layer feature aggregation (CFA), aggregates these hierarchical features. To achieve better spatial feature co-occurrence, instead of using noise or edge artifacts, we introduce a multi-scale graph reasoning (MGR) module within the decoder network via bipartite graphs over the encoder and decoder features in a multi-scale fashion. The cross-level enhancement (CLE) further performs adjacent-level feature fusion to amplify the regions of interest in aggregated manipulation features. Finally, the multi-exit architecture (MEA) guides the model to learn fine-grained features and segment out the manipulated region. Extensive experiments across diverse and challenging datasets conclusively establish the superiority of MEAFormer over existing state-of-the-art methods, excelling in accuracy, generalization, and robustness.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.