{"title":"CRENet:用于多人三维姿态估计的人群区域增强网络","authors":"Zhaokun Li, Qiong Liu","doi":"10.1016/j.imavis.2024.105243","DOIUrl":null,"url":null,"abstract":"<div><p>Recovering multi-person 3D poses from a single image is a challenging problem due to inherent depth ambiguities, including root-relative depth and absolute root depth. Current bottom-up methods show promising potential to mitigate absolute root depth ambiguity through explicitly aggregating global contextual cues. However, these methods treat the entire image region equally during root depth regression, ignoring the negative impact of irrelevant regions. Moreover, they learn shared features for both depths, each of which focuses on different information. This sharing mechanism may result in negative transfer, thus diminishing root depth prediction accuracy. To address these challenges, we present a novel bottom-up method, Crowd Region Enhancement Network (CRENet), incorporating a Feature Decoupling Module (FDM) and a Global Attention Module (GAM). FDM explicitly learns the discriminative feature for each depth through adaptively recalibrating its channel-wise responses and fusing multi-level features, which makes the model focus on each depth prediction separately and thus avoids the adverse effect of negative transfer. GAM highlights crowd regions while suppressing irrelevant regions using the attention mechanism and further refines the attention regions based on the confidence measure about the attention, which is beneficial to learn depth-related cues from informative crowd regions and facilitate root depth estimation. Comprehensive experiments on benchmarks MuPoTS-3D and CMU Panoptic demonstrate that our method outperforms the state-of-the-art bottom-up methods in absolute 3D pose estimation and is applicable to in-the-wild images, which also indicates that learning depth-specific features and suppressing the noise signals can significantly benefit multi-person absolute 3D pose estimation.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105243"},"PeriodicalIF":4.2000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CRENet: Crowd region enhancement network for multi-person 3D pose estimation\",\"authors\":\"Zhaokun Li, Qiong Liu\",\"doi\":\"10.1016/j.imavis.2024.105243\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Recovering multi-person 3D poses from a single image is a challenging problem due to inherent depth ambiguities, including root-relative depth and absolute root depth. Current bottom-up methods show promising potential to mitigate absolute root depth ambiguity through explicitly aggregating global contextual cues. However, these methods treat the entire image region equally during root depth regression, ignoring the negative impact of irrelevant regions. Moreover, they learn shared features for both depths, each of which focuses on different information. This sharing mechanism may result in negative transfer, thus diminishing root depth prediction accuracy. To address these challenges, we present a novel bottom-up method, Crowd Region Enhancement Network (CRENet), incorporating a Feature Decoupling Module (FDM) and a Global Attention Module (GAM). FDM explicitly learns the discriminative feature for each depth through adaptively recalibrating its channel-wise responses and fusing multi-level features, which makes the model focus on each depth prediction separately and thus avoids the adverse effect of negative transfer. GAM highlights crowd regions while suppressing irrelevant regions using the attention mechanism and further refines the attention regions based on the confidence measure about the attention, which is beneficial to learn depth-related cues from informative crowd regions and facilitate root depth estimation. Comprehensive experiments on benchmarks MuPoTS-3D and CMU Panoptic demonstrate that our method outperforms the state-of-the-art bottom-up methods in absolute 3D pose estimation and is applicable to in-the-wild images, which also indicates that learning depth-specific features and suppressing the noise signals can significantly benefit multi-person absolute 3D pose estimation.</p></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"151 \",\"pages\":\"Article 105243\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624003482\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003482","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
由于固有的深度模糊性(包括根相关深度和绝对根深度),从单张图像中恢复多人三维姿势是一个具有挑战性的问题。当前的自下而上方法通过明确聚合全局上下文线索,在减轻绝对根深模糊性方面显示出巨大潜力。然而,这些方法在根深度回归时对整个图像区域一视同仁,忽略了无关区域的负面影响。此外,这些方法为两个深度学习共享特征,而每个特征都侧重于不同的信息。这种共享机制可能会导致负迁移,从而降低根深度预测的准确性。为了应对这些挑战,我们提出了一种自下而上的新方法--人群区域增强网络(Crowd Region Enhancement Network,CRENet),其中包含一个特征去耦模块(FDM)和一个全局注意力模块(GAM)。FDM 通过自适应地重新校准信道响应和融合多层次特征,明确地学习每个深度的判别特征,这使得模型能够分别关注每个深度的预测,从而避免了负迁移的不利影响。GAM 利用注意力机制突出人群区域,同时抑制无关区域,并根据对注意力的置信度进一步完善注意力区域,这有利于从信息丰富的人群区域学习与深度相关的线索,促进根深度估计。在基准MuPoTS-3D和CMU Panoptic上进行的综合实验表明,我们的方法在绝对三维姿态估计方面优于最先进的自下而上方法,并且适用于野外图像,这也表明学习特定深度特征和抑制噪声信号对多人绝对三维姿态估计大有裨益。
CRENet: Crowd region enhancement network for multi-person 3D pose estimation
Recovering multi-person 3D poses from a single image is a challenging problem due to inherent depth ambiguities, including root-relative depth and absolute root depth. Current bottom-up methods show promising potential to mitigate absolute root depth ambiguity through explicitly aggregating global contextual cues. However, these methods treat the entire image region equally during root depth regression, ignoring the negative impact of irrelevant regions. Moreover, they learn shared features for both depths, each of which focuses on different information. This sharing mechanism may result in negative transfer, thus diminishing root depth prediction accuracy. To address these challenges, we present a novel bottom-up method, Crowd Region Enhancement Network (CRENet), incorporating a Feature Decoupling Module (FDM) and a Global Attention Module (GAM). FDM explicitly learns the discriminative feature for each depth through adaptively recalibrating its channel-wise responses and fusing multi-level features, which makes the model focus on each depth prediction separately and thus avoids the adverse effect of negative transfer. GAM highlights crowd regions while suppressing irrelevant regions using the attention mechanism and further refines the attention regions based on the confidence measure about the attention, which is beneficial to learn depth-related cues from informative crowd regions and facilitate root depth estimation. Comprehensive experiments on benchmarks MuPoTS-3D and CMU Panoptic demonstrate that our method outperforms the state-of-the-art bottom-up methods in absolute 3D pose estimation and is applicable to in-the-wild images, which also indicates that learning depth-specific features and suppressing the noise signals can significantly benefit multi-person absolute 3D pose estimation.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.