The estimation of an object’s 6D pose is a fundamental task in modern commercial and industrial applications. Vision-based pose estimation has gained popularity due to its cost-effectiveness and ease of setup in the field. However, this type of estimation tends to be less robust compared to other methods due to its sensitivity to the operating environment. For instance, in robot manipulation applications, heavy occlusion and clutter are common, posing significant challenges. For safety and robustness in industrial environments, depth information is often leveraged instead of relying solely on RGB images. Nevertheless, even with depth information, 6D pose estimation in such scenarios still remains challenging. In this paper, we introduce a novel 6D pose estimation method that promotes the network’s learning of high-level object features through self-supervised learning and instance reconstruction. The feature representation of the reconstructed instance is subsequently utilized in direct 6D pose regression via a multi-task learning scheme. As a result, the proposed method can differentiate and retrieve each object instance from a scene that is heavily occluded and cluttered, thereby surpassing conventional pose estimators in such scenarios. Additionally, due to the standardized prediction of reconstructed image, our estimator exhibits robustness performance against variations in lighting conditions and color drift. This is a significant improvement over traditional methods that depend on pixel-level sparse or dense features. We demonstrate that our method achieves state-of-the-art performance (e.g., 85.4% on LM-O) on the most commonly used benchmarks with respect to the ADD(-S) metric. Lastly, we present a CLIP dataset that emulates intense occlusion scenarios of industrial environment and conduct a real-world experiment for manipulation applications to verify the effectiveness and robustness of our proposed method.