IRPE: Instance-level reconstruction-based 6D pose estimator

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2025-02-01 DOI:10.1016/j.imavis.2024.105340

Le Jin , Guoshun Zhou , Zherong Liu , Yuanchao Yu , Teng Zhang , Minghui Yang , Jun Zhou

{"title":"IRPE: Instance-level reconstruction-based 6D pose estimator","authors":"Le Jin , Guoshun Zhou , Zherong Liu , Yuanchao Yu , Teng Zhang , Minghui Yang , Jun Zhou","doi":"10.1016/j.imavis.2024.105340","DOIUrl":null,"url":null,"abstract":"<div><div>The estimation of an object’s 6D pose is a fundamental task in modern commercial and industrial applications. Vision-based pose estimation has gained popularity due to its cost-effectiveness and ease of setup in the field. However, this type of estimation tends to be less robust compared to other methods due to its sensitivity to the operating environment. For instance, in robot manipulation applications, heavy occlusion and clutter are common, posing significant challenges. For safety and robustness in industrial environments, depth information is often leveraged instead of relying solely on RGB images. Nevertheless, even with depth information, 6D pose estimation in such scenarios still remains challenging. In this paper, we introduce a novel 6D pose estimation method that promotes the network’s learning of high-level object features through self-supervised learning and instance reconstruction. The feature representation of the reconstructed instance is subsequently utilized in direct 6D pose regression via a multi-task learning scheme. As a result, the proposed method can differentiate and retrieve each object instance from a scene that is heavily occluded and cluttered, thereby surpassing conventional pose estimators in such scenarios. Additionally, due to the standardized prediction of reconstructed image, our estimator exhibits robustness performance against variations in lighting conditions and color drift. This is a significant improvement over traditional methods that depend on pixel-level sparse or dense features. We demonstrate that our method achieves state-of-the-art performance (e.g., 85.4% on LM-O) on the most commonly used benchmarks with respect to the ADD(-S) metric. Lastly, we present a CLIP dataset that emulates intense occlusion scenarios of industrial environment and conduct a real-world experiment for manipulation applications to verify the effectiveness and robustness of our proposed method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105340"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004451","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The estimation of an object’s 6D pose is a fundamental task in modern commercial and industrial applications. Vision-based pose estimation has gained popularity due to its cost-effectiveness and ease of setup in the field. However, this type of estimation tends to be less robust compared to other methods due to its sensitivity to the operating environment. For instance, in robot manipulation applications, heavy occlusion and clutter are common, posing significant challenges. For safety and robustness in industrial environments, depth information is often leveraged instead of relying solely on RGB images. Nevertheless, even with depth information, 6D pose estimation in such scenarios still remains challenging. In this paper, we introduce a novel 6D pose estimation method that promotes the network’s learning of high-level object features through self-supervised learning and instance reconstruction. The feature representation of the reconstructed instance is subsequently utilized in direct 6D pose regression via a multi-task learning scheme. As a result, the proposed method can differentiate and retrieve each object instance from a scene that is heavily occluded and cluttered, thereby surpassing conventional pose estimators in such scenarios. Additionally, due to the standardized prediction of reconstructed image, our estimator exhibits robustness performance against variations in lighting conditions and color drift. This is a significant improvement over traditional methods that depend on pixel-level sparse or dense features. We demonstrate that our method achieves state-of-the-art performance (e.g., 85.4% on LM-O) on the most commonly used benchmarks with respect to the ADD(-S) metric. Lastly, we present a CLIP dataset that emulates intense occlusion scenarios of industrial environment and conduct a real-world experiment for manipulation applications to verify the effectiveness and robustness of our proposed method.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

IRPE：基于实例级重建的6D姿态估计器

在现代商业和工业应用中，物体的6D姿态估计是一项基本任务。基于视觉的姿态估计由于其成本效益和易于设置而在该领域获得了广泛的应用。然而，由于对操作环境的敏感性，与其他方法相比，这种估计的鲁棒性往往较差。例如，在机器人操作应用中，严重的遮挡和杂乱是常见的，构成了重大的挑战。为了工业环境中的安全性和鲁棒性，通常利用深度信息而不是仅仅依赖于RGB图像。然而，即使有深度信息，在这种情况下的6D姿态估计仍然具有挑战性。在本文中，我们引入了一种新的6D姿态估计方法，该方法通过自监督学习和实例重建来促进网络对高级目标特征的学习。重构实例的特征表示随后通过多任务学习方案用于直接6D姿态回归。因此，该方法可以从严重遮挡和混乱的场景中区分和检索每个对象实例，从而在此类场景中优于传统的姿态估计器。此外，由于重建图像的标准化预测，我们的估计器对光照条件和颜色漂移的变化表现出鲁棒性。这是对依赖于像素级稀疏或密集特征的传统方法的重大改进。我们证明了我们的方法在与ADD（-S）度量相关的最常用基准上达到了最先进的性能（例如，在LM-O上达到了85.4%）。最后，我们提出了一个CLIP数据集，该数据集模拟了工业环境的强烈遮挡场景，并对操作应用进行了真实世界的实验，以验证我们提出的方法的有效性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.