Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation.

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, Shaojie Shen
{"title":"Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation.","authors":"Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, Shaojie Shen","doi":"10.1109/TPAMI.2024.3444912","DOIUrl":null,"url":null,"abstract":"<p><p>We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. State-of-the-art (SoTA) monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method currently ranks the 1st on various zero-shot and non-zero-shot benchmarks for metric depth, affine-invariant-depth as well as surface-normal prediction, shown in Fig. 1. Notably, we surpassed the ultra-recent MarigoldDepth and DepthAnything on various depth benchmarks including NYUv2 and KITTI. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. These applications highlight the versatility of Metric3D v2 models as geometric foundation models. Our project page is at https://JUGGHM.github.io/Metric3Dv2.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3444912","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. State-of-the-art (SoTA) monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method currently ranks the 1st on various zero-shot and non-zero-shot benchmarks for metric depth, affine-invariant-depth as well as surface-normal prediction, shown in Fig. 1. Notably, we surpassed the ultra-recent MarigoldDepth and DepthAnything on various depth benchmarks including NYUv2 and KITTI. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. These applications highlight the versatility of Metric3D v2 models as geometric foundation models. Our project page is at https://JUGGHM.github.io/Metric3Dv2.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Metric3D v2:用于零镜头度量深度和表面法线估算的多功能单目几何基础模型。
我们介绍了 Metric3D v2,这是一种几何基础模型,用于从单幅图像中进行零镜头度量深度和表面法线估算,这对于度量三维复原至关重要。虽然深度和法线在几何上相互关联、互为补充,但它们也面临着不同的挑战。最先进的(SoTA)单目深度方法通过学习仿射不变深度来实现零点泛化,但无法恢复真实世界的度量。同时,由于缺乏大规模标注数据,SoTA 正常估计方法的零镜头性能有限。为了解决这些问题,我们提出了度量深度估计和表面法线估计的解决方案。在度量深度估算方面,我们发现零镜头单视角模型的关键在于解决来自各种相机模型和大规模数据训练的度量模糊性。我们提出了一个典型相机空间转换模块,它明确地解决了模糊性问题,并能毫不费力地插入到现有的单目模型中。对于表面法线估计,我们提出了一个深度-法线联合优化模块,从度量深度中提炼出多样化的数据知识,使法线估计器能够学习法线标签以外的知识。有了这些模块,我们的深度-法线模型就能稳定地训练来自成千上万不同类型注释的相机模型的 1600 多万张图像,从而实现对未见相机设置的野外图像的零误差泛化。目前,我们的方法在公制深度、仿射不变深度以及表面法线预测的各种零拍摄和非零拍摄基准测试中排名第一,如图 1 所示。值得注意的是,在包括 NYUv2 和 KITTI 在内的各种深度基准测试中,我们超越了最新的 MarigoldDepth 和 DepthAnything。我们的方法能够在随机收集的互联网图像上准确恢复度量三维结构,为可信的单图像计量铺平了道路。我们的潜在优势还可延伸到下游任务,只需插入我们的模型,这些任务就能得到显著改善。例如,我们的模型解决了单目-SLAM 的尺度漂移问题(图 3),从而实现了高质量的度量尺度密集映射。这些应用凸显了 Metric3D v2 模型作为几何基础模型的多功能性。我们的项目页面是 https://JUGGHM.github.io/Metric3Dv2。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Diversifying Policies with Non-Markov Dispersion to Expand the Solution Space. Integrating Neural Radiance Fields End-to-End for Cognitive Visuomotor Navigation. Variational Label Enhancement for Instance-Dependent Partial Label Learning. TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation. Efficient Neural Collaborative Search for Pickup and Delivery Problems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1