ReconX：利用视频扩散模型从稀疏视图重建任何场景

arXiv - CS - Graphics Pub Date : 2024-08-29 DOI:arxiv-2408.16767

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan

{"title":"ReconX：利用视频扩散模型从稀疏视图重建任何场景","authors":"Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan","doi":"arxiv-2408.16767","DOIUrl":null,"url":null,"abstract":"Advancements in 3D scene reconstruction have transformed 2D images from the\nreal world into 3D models, producing realistic 3D results from hundreds of\ninput photos. Despite great success in dense-view reconstruction scenarios,\nrendering a detailed scene from insufficient captured views is still an\nill-posed optimization problem, often resulting in artifacts and distortions in\nunseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction\nparadigm that reframes the ambiguous reconstruction challenge as a temporal\ngeneration task. The key insight is to unleash the strong generative prior of\nlarge pre-trained video diffusion models for sparse-view reconstruction.\nHowever, 3D view consistency struggles to be accurately preserved in directly\ngenerated video frames from pre-trained models. To address this, given limited\ninput views, the proposed ReconX first constructs a global point cloud and\nencodes it into a contextual space as the 3D structure condition. Guided by the\ncondition, the video diffusion model then synthesizes video frames that are\nboth detail-preserved and exhibit a high degree of 3D consistency, ensuring the\ncoherence of the scene from various perspectives. Finally, we recover the 3D\nscene from the generated video through a confidence-aware 3D Gaussian Splatting\noptimization scheme. Extensive experiments on various real-world datasets show\nthe superiority of our ReconX over state-of-the-art methods in terms of quality\nand generalizability.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"40 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model\",\"authors\":\"Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan\",\"doi\":\"arxiv-2408.16767\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advancements in 3D scene reconstruction have transformed 2D images from the\\nreal world into 3D models, producing realistic 3D results from hundreds of\\ninput photos. Despite great success in dense-view reconstruction scenarios,\\nrendering a detailed scene from insufficient captured views is still an\\nill-posed optimization problem, often resulting in artifacts and distortions in\\nunseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction\\nparadigm that reframes the ambiguous reconstruction challenge as a temporal\\ngeneration task. The key insight is to unleash the strong generative prior of\\nlarge pre-trained video diffusion models for sparse-view reconstruction.\\nHowever, 3D view consistency struggles to be accurately preserved in directly\\ngenerated video frames from pre-trained models. To address this, given limited\\ninput views, the proposed ReconX first constructs a global point cloud and\\nencodes it into a contextual space as the 3D structure condition. Guided by the\\ncondition, the video diffusion model then synthesizes video frames that are\\nboth detail-preserved and exhibit a high degree of 3D consistency, ensuring the\\ncoherence of the scene from various perspectives. Finally, we recover the 3D\\nscene from the generated video through a confidence-aware 3D Gaussian Splatting\\noptimization scheme. Extensive experiments on various real-world datasets show\\nthe superiority of our ReconX over state-of-the-art methods in terms of quality\\nand generalizability.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":\"40 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.16767\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.16767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

三维场景重建技术的进步将现实世界中的二维图像转化为三维模型，从数百张输入照片中生成逼真的三维结果。尽管在密集视图重建场景中取得了巨大成功，但从捕捉到的不足视图中渲染出详细的场景仍然是一个难以解决的优化问题，往往会在不可见的区域造成伪影和失真。在本文中，我们提出了一种新颖的三维场景重建范式 ReconX，它将模糊重建挑战重构为时间生成任务。然而，在根据预训练模型直接生成的视频帧中，三维视图的一致性很难得到准确保留。为了解决这个问题，在输入视图有限的情况下，建议的 ReconX 首先构建一个全局点云，并将其编码到上下文空间作为三维结构条件。然后，在该条件的指导下，视频扩散模型合成既能保留细节又能表现出高度三维一致性的视频帧，从而确保从不同视角观察场景的一致性。最后，我们通过置信度感知的三维高斯拼接优化方案从生成的视频中恢复三维场景。在各种真实世界数据集上进行的广泛实验表明，我们的 ReconX 在质量和通用性方面都优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Graphics

自引率

0.00%

发文量