{"title":"CRENet: Crowd region enhancement network for multi-person 3D pose estimation","authors":"Zhaokun Li, Qiong Liu","doi":"10.1016/j.imavis.2024.105243","DOIUrl":null,"url":null,"abstract":"<div><p>Recovering multi-person 3D poses from a single image is a challenging problem due to inherent depth ambiguities, including root-relative depth and absolute root depth. Current bottom-up methods show promising potential to mitigate absolute root depth ambiguity through explicitly aggregating global contextual cues. However, these methods treat the entire image region equally during root depth regression, ignoring the negative impact of irrelevant regions. Moreover, they learn shared features for both depths, each of which focuses on different information. This sharing mechanism may result in negative transfer, thus diminishing root depth prediction accuracy. To address these challenges, we present a novel bottom-up method, Crowd Region Enhancement Network (CRENet), incorporating a Feature Decoupling Module (FDM) and a Global Attention Module (GAM). FDM explicitly learns the discriminative feature for each depth through adaptively recalibrating its channel-wise responses and fusing multi-level features, which makes the model focus on each depth prediction separately and thus avoids the adverse effect of negative transfer. GAM highlights crowd regions while suppressing irrelevant regions using the attention mechanism and further refines the attention regions based on the confidence measure about the attention, which is beneficial to learn depth-related cues from informative crowd regions and facilitate root depth estimation. Comprehensive experiments on benchmarks MuPoTS-3D and CMU Panoptic demonstrate that our method outperforms the state-of-the-art bottom-up methods in absolute 3D pose estimation and is applicable to in-the-wild images, which also indicates that learning depth-specific features and suppressing the noise signals can significantly benefit multi-person absolute 3D pose estimation.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105243"},"PeriodicalIF":4.2000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003482","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recovering multi-person 3D poses from a single image is a challenging problem due to inherent depth ambiguities, including root-relative depth and absolute root depth. Current bottom-up methods show promising potential to mitigate absolute root depth ambiguity through explicitly aggregating global contextual cues. However, these methods treat the entire image region equally during root depth regression, ignoring the negative impact of irrelevant regions. Moreover, they learn shared features for both depths, each of which focuses on different information. This sharing mechanism may result in negative transfer, thus diminishing root depth prediction accuracy. To address these challenges, we present a novel bottom-up method, Crowd Region Enhancement Network (CRENet), incorporating a Feature Decoupling Module (FDM) and a Global Attention Module (GAM). FDM explicitly learns the discriminative feature for each depth through adaptively recalibrating its channel-wise responses and fusing multi-level features, which makes the model focus on each depth prediction separately and thus avoids the adverse effect of negative transfer. GAM highlights crowd regions while suppressing irrelevant regions using the attention mechanism and further refines the attention regions based on the confidence measure about the attention, which is beneficial to learn depth-related cues from informative crowd regions and facilitate root depth estimation. Comprehensive experiments on benchmarks MuPoTS-3D and CMU Panoptic demonstrate that our method outperforms the state-of-the-art bottom-up methods in absolute 3D pose estimation and is applicable to in-the-wild images, which also indicates that learning depth-specific features and suppressing the noise signals can significantly benefit multi-person absolute 3D pose estimation.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.