Geometric Consistency-Guaranteed Spatio-Temporal Transformer for Unsupervised Multiview 3-D Pose Estimation

IF 5.9 2区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Instrumentation and Measurement Pub Date : 2024-09-02 DOI:10.1109/TIM.2024.3440376

Kaiwen Dong;Kévin Riou;Jingwen Zhu;Andréas Pastor;Kévin Subrin;Yu Zhou;Xiao Yun;Yanjing Sun;Patrick Le Callet

{"title":"Geometric Consistency-Guaranteed Spatio-Temporal Transformer for Unsupervised Multiview 3-D Pose Estimation","authors":"Kaiwen Dong;Kévin Riou;Jingwen Zhu;Andréas Pastor;Kévin Subrin;Yu Zhou;Xiao Yun;Yanjing Sun;Patrick Le Callet","doi":"10.1109/TIM.2024.3440376","DOIUrl":null,"url":null,"abstract":"Unsupervised 3-D pose estimation has gained prominence due to the challenges in acquiring labeled 3-D data for training. Despite promising progress, unsupervised approaches still lag behind supervised methods in performance. Two factors impede the progress of unsupervised approaches: incomplete geometric constraint and inadequate interaction among spatial, temporal, and multiview features. This article introduces an unsupervised pipeline that uses calibrated camera parameters as geometric constraints across views and coordinate spaces to optimize the model by minimizing inconsistencies between the 2-D input pose and the reprojection of the predicted 3-D pose. This pipeline utilizes the novel hierarchical cross transformer (HCT) to encode higher levels of information by enabling interactions among hierarchical features containing different levels of temporal, spatial, and cross-view information. By minimizing the reliance on human-specific parts, the HCT shows potential for adapting to various pose estimation tasks. To validate the adaptability, we build a connection between human pose estimation and scene pose estimation, introducing a dynamic-keypoints-3-D (DKs-3D) dataset tailored for 3-D scene pose estimation in robotic manipulation. Experiments on two 3-D human pose estimation datasets demonstrate our method’s new state-of-the-art performance among weakly and unsupervised approaches. The adaptability of our method is confirmed through experiments on DK-3D, setting the initial benchmark for unsupervised 2-D-to-3-D scene pose lifting.","PeriodicalId":13341,"journal":{"name":"IEEE Transactions on Instrumentation and Measurement","volume":"73 ","pages":"1-12"},"PeriodicalIF":5.9000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Instrumentation and Measurement","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10663570/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Unsupervised 3-D pose estimation has gained prominence due to the challenges in acquiring labeled 3-D data for training. Despite promising progress, unsupervised approaches still lag behind supervised methods in performance. Two factors impede the progress of unsupervised approaches: incomplete geometric constraint and inadequate interaction among spatial, temporal, and multiview features. This article introduces an unsupervised pipeline that uses calibrated camera parameters as geometric constraints across views and coordinate spaces to optimize the model by minimizing inconsistencies between the 2-D input pose and the reprojection of the predicted 3-D pose. This pipeline utilizes the novel hierarchical cross transformer (HCT) to encode higher levels of information by enabling interactions among hierarchical features containing different levels of temporal, spatial, and cross-view information. By minimizing the reliance on human-specific parts, the HCT shows potential for adapting to various pose estimation tasks. To validate the adaptability, we build a connection between human pose estimation and scene pose estimation, introducing a dynamic-keypoints-3-D (DKs-3D) dataset tailored for 3-D scene pose estimation in robotic manipulation. Experiments on two 3-D human pose estimation datasets demonstrate our method’s new state-of-the-art performance among weakly and unsupervised approaches. The adaptability of our method is confirmed through experiments on DK-3D, setting the initial benchmark for unsupervised 2-D-to-3-D scene pose lifting.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于无监督多视角三维姿态估计的几何一致性保证时空变换器

由于在获取用于训练的标记三维数据方面存在挑战，无监督三维姿态估计日益受到重视。尽管取得了可喜的进展，但无监督方法在性能上仍落后于有监督方法。有两个因素阻碍了无监督方法的发展：不完整的几何约束以及空间、时间和多视角特征之间的交互不足。本文介绍了一种无监督流水线，它使用校准过的相机参数作为跨视图和坐标空间的几何约束，通过最小化二维输入姿态与预测三维姿态的重投影之间的不一致性来优化模型。该管道利用新颖的分层交叉变换器（HCT），通过在包含不同层次的时间、空间和跨视图信息的分层特征之间进行交互，对更高层次的信息进行编码。通过最大限度地减少对人类特定部分的依赖，HCT 显示出适应各种姿势估计任务的潜力。为了验证其适应性，我们在人体姿态估计和场景姿态估计之间建立了联系，引入了一个为机器人操纵中的三维场景姿态估计量身定制的动态关键点三维（DKs-3D）数据集。在两个三维人体姿态估计数据集上的实验证明，在弱监督和无监督方法中，我们的方法具有全新的一流性能。我们方法的适应性通过在 DK-3D 上的实验得到了证实，DK-3D 为无监督的二维到三维场景姿态提升设定了初始基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Instrumentation and Measurement 工程技术-工程：电子与电气

CiteScore

9.00

自引率

23.20%

发文量

1294

审稿时长

3.9 months

期刊介绍： Papers are sought that address innovative solutions to the development and use of electrical and electronic instruments and equipment to measure, monitor and/or record physical phenomena for the purpose of advancing measurement science, methods, functionality and applications. The scope of these papers may encompass: (1) theory, methodology, and practice of measurement; (2) design, development and evaluation of instrumentation and measurement systems and components used in generating, acquiring, conditioning and processing signals; (3) analysis, representation, display, and preservation of the information obtained from a set of measurements; and (4) scientific and technical support to establishment and maintenance of technical standards in the field of Instrumentation and Measurement.