Channel and Spatial Enhancement Network for human parsing

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2024-11-08 DOI:10.1016/j.imavis.2024.105332

Kunliang Liu , Rize Jin , Yuelong Li , Jianming Wang , Wonjun Hwang

{"title":"Channel and Spatial Enhancement Network for human parsing","authors":"Kunliang Liu , Rize Jin , Yuelong Li , Jianming Wang , Wonjun Hwang","doi":"10.1016/j.imavis.2024.105332","DOIUrl":null,"url":null,"abstract":"<div><div>The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105332"},"PeriodicalIF":4.2000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004372","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于人类解析的通道和空间增强网络

用于场景解析的神经网络的主要骨干由多个阶段组成，不同阶段的特征图通常包含不同程度的空间和语义信息。高级特征传递更多语义，空间细节较少，而低级特征则语义较少，空间细节较多。因此，不同层次的特征之间存在语义和空间上的差距，尤其是在人类解析任务中。现有的许多方法都是直接对多级特征进行上采样，然后通过加法或并集的方式将它们聚合在一起，而没有解决这些特征之间存在的语义空间差距问题。这不可避免地会导致空间不对齐、语义不匹配，并最终导致解析过程中的错误分类，尤其是对于人类解析来说，由于复杂的纹理、多样的服装风格以及不同人体部位的严重尺度变化，人类解析需要更多的语义信息和更精细的特征图细节。在本文中，我们通过创新性地利用减法和加法运算来识别语义和空间差异，并对其进行补偿，从而有效地缓解了长期以来解决不同阶段特征之间语义和空间差距的难题。基于这些原理，我们提出了用于解析的通道和空间增强网络（CSENet），通过向低级特征注入高语义信息，反之亦然，向高级特征引入精细细节，为解决语义空间差距问题提供了一种简单直观的解决方案。在三个密集预测任务中进行的广泛实验证明了我们方法的有效性。具体来说，我们的方法在 LIP 和 CIHP 数据集上取得了最佳性能，我们还在 ADE20K 数据集上验证了我们方法的通用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.