SSpose: Self-Supervised Spatial-Aware Model for Human Pose Estimation

Linfang Yu;Zhen Qin;Liqun Xu;Zhiguang Qin;Kim-Kwang Raymond Choo
{"title":"SSpose: Self-Supervised Spatial-Aware Model for Human Pose Estimation","authors":"Linfang Yu;Zhen Qin;Liqun Xu;Zhiguang Qin;Kim-Kwang Raymond Choo","doi":"10.1109/TAI.2024.3440220","DOIUrl":null,"url":null,"abstract":"Human pose estimation (HPE) relies on the anatomical relationships among different body parts to locate keypoints. Despite the significant progress achieved by convolutional neural networks (CNN)-based models in HPE, they typically fail to explicitly learn the global dependencies among various body parts. To overcome this limitation, we propose a spatial-aware HPE model called SSpose that explicitly captures the spatial dependencies between specific key points and different locations in an image. The proposed SSpose model adopts a hybrid CNN-Transformer encoder to simultaneously capture local features and global dependencies. To better preserve image details, a multiscale fusion module is introduced to integrate coarse- and fine-grained image information. By establishing a connection with the activation maximization (AM) principle, the final attention layer of the Transformer aggregates contributions (i.e., attention scores) from all image positions and forms the maximum position in the heatmap, thereby achieving keypoint localization in the head structure. Additionally, to address the issue of visible information leakage in convolutional reconstruction, we have devised a self-supervised training framework for the SSpose model. This framework incorporates mask autoencoder (MAE) technology into SSpose models by utilizing masked convolution and hierarchical masking strategy, thereby facilitating efficient self-supervised learning. Extensive experiments demonstrate that SSpose performs exceptionally well in the pose estimation task. On the COCO val set, it achieves an AP and AR of 77.3% and 82.1%, respectively, while on the COCO test-dev set, the AP and AR are 76.4% and 81.5%. Moreover, the model exhibits strong generalization capabilities on MPII.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"5 11","pages":"5403-5417"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10631686/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Human pose estimation (HPE) relies on the anatomical relationships among different body parts to locate keypoints. Despite the significant progress achieved by convolutional neural networks (CNN)-based models in HPE, they typically fail to explicitly learn the global dependencies among various body parts. To overcome this limitation, we propose a spatial-aware HPE model called SSpose that explicitly captures the spatial dependencies between specific key points and different locations in an image. The proposed SSpose model adopts a hybrid CNN-Transformer encoder to simultaneously capture local features and global dependencies. To better preserve image details, a multiscale fusion module is introduced to integrate coarse- and fine-grained image information. By establishing a connection with the activation maximization (AM) principle, the final attention layer of the Transformer aggregates contributions (i.e., attention scores) from all image positions and forms the maximum position in the heatmap, thereby achieving keypoint localization in the head structure. Additionally, to address the issue of visible information leakage in convolutional reconstruction, we have devised a self-supervised training framework for the SSpose model. This framework incorporates mask autoencoder (MAE) technology into SSpose models by utilizing masked convolution and hierarchical masking strategy, thereby facilitating efficient self-supervised learning. Extensive experiments demonstrate that SSpose performs exceptionally well in the pose estimation task. On the COCO val set, it achieves an AP and AR of 77.3% and 82.1%, respectively, while on the COCO test-dev set, the AP and AR are 76.4% and 81.5%. Moreover, the model exhibits strong generalization capabilities on MPII.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SSpose:用于人体姿态估计的自监督空间感知模型
人体姿态估计(HPE)依赖于不同身体部位之间的解剖关系来定位关键点。尽管基于卷积神经网络(CNN)的模型在 HPE 方面取得了重大进展,但它们通常无法明确学习不同身体部位之间的全局依赖关系。为了克服这一局限,我们提出了一种名为 SSpose 的空间感知 HPE 模型,它能明确捕捉图像中特定关键点与不同位置之间的空间依赖关系。所提出的 SSpose 模型采用混合 CNN 变换器编码器,可同时捕捉局部特征和全局依赖关系。为了更好地保留图像细节,还引入了多尺度融合模块来整合粗粒度和细粒度图像信息。通过与激活最大化(AM)原理建立联系,变换器的最终注意力层汇总了来自所有图像位置的贡献(即注意力分数),并形成热图中的最大位置,从而实现头部结构中的关键点定位。此外,为了解决卷积重建中的可见信息泄漏问题,我们还为 SSpose 模型设计了一个自监督训练框架。该框架利用掩码卷积和分层掩码策略,将掩码自动编码器(MAE)技术融入 SSpose 模型,从而促进了高效的自我监督学习。大量实验证明,SSpose 在姿态估计任务中表现优异。在 COCO val 集上,它的 AP 和 AR 分别达到 77.3% 和 82.1%,而在 COCO test-dev 集上,AP 和 AR 分别为 76.4% 和 81.5%。此外,该模型在 MPII 上也表现出很强的泛化能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
7.70
自引率
0.00%
发文量
0
期刊最新文献
Table of Contents Front Cover IEEE Transactions on Artificial Intelligence Publication Information Table of Contents Front Cover
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1