Quan Tang;Chuanjian Liu;Fagui Liu;Jun Jiang;Bowen Zhang;C. L. Philip Chen;Kai Han;Yunhe Wang
{"title":"Rethinking Feature Reconstruction via Category Prototype in Semantic Segmentation","authors":"Quan Tang;Chuanjian Liu;Fagui Liu;Jun Jiang;Bowen Zhang;C. L. Philip Chen;Kai Han;Yunhe Wang","doi":"10.1109/TIP.2025.3534532","DOIUrl":null,"url":null,"abstract":"The encoder-decoder architecture is a prevailing paradigm for semantic segmentation. It has been discovered that aggregation of multi-stage encoder features plays a significant role in capturing discriminative pixel representation. In this work, we rethink feature reconstruction for scale alignment of multi-stage pyramidal features and treat it as a Query Update (Q-UP) task. Pixel-wise affinity scores are calculated between the high-resolution query map and low-resolution feature map to dynamically broadcast low-resolution pixel features to match a higher resolution. Unlike prior works (e.g. bilinear interpolation) that only exploit sub-pixel neighborhoods, Q-UP samples contextual information within a global receptive field via a data-dependent manner. To alleviate intra-category feature variance, we substitute source pixel features for feature reconstruction with their corresponding category prototype that is assessed by averaging all pixel features belonging to that category. Besides, a memory module is proposed to explore the capacity of category prototypes at the dataset level. We refer to the method as Category Prototype Transformer (CPT). We conduct extensive experiments on popular benchmarks. Integrating CPT into a feature pyramid structure exhibits superior performance for semantic segmentation even with low-resolution feature maps, e.g. 1/32 of the input size, significantly reducing computational complexity. Specifically, the proposed method obtains a compelling 55.5% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1036-1047"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10869305/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The encoder-decoder architecture is a prevailing paradigm for semantic segmentation. It has been discovered that aggregation of multi-stage encoder features plays a significant role in capturing discriminative pixel representation. In this work, we rethink feature reconstruction for scale alignment of multi-stage pyramidal features and treat it as a Query Update (Q-UP) task. Pixel-wise affinity scores are calculated between the high-resolution query map and low-resolution feature map to dynamically broadcast low-resolution pixel features to match a higher resolution. Unlike prior works (e.g. bilinear interpolation) that only exploit sub-pixel neighborhoods, Q-UP samples contextual information within a global receptive field via a data-dependent manner. To alleviate intra-category feature variance, we substitute source pixel features for feature reconstruction with their corresponding category prototype that is assessed by averaging all pixel features belonging to that category. Besides, a memory module is proposed to explore the capacity of category prototypes at the dataset level. We refer to the method as Category Prototype Transformer (CPT). We conduct extensive experiments on popular benchmarks. Integrating CPT into a feature pyramid structure exhibits superior performance for semantic segmentation even with low-resolution feature maps, e.g. 1/32 of the input size, significantly reducing computational complexity. Specifically, the proposed method obtains a compelling 55.5% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.