Kunliang Liu , Rize Jin , Yuelong Li , Jianming Wang , Wonjun Hwang
{"title":"Channel and Spatial Enhancement Network for human parsing","authors":"Kunliang Liu , Rize Jin , Yuelong Li , Jianming Wang , Wonjun Hwang","doi":"10.1016/j.imavis.2024.105332","DOIUrl":null,"url":null,"abstract":"<div><div>The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105332"},"PeriodicalIF":4.2000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004372","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.