{"title":"StereoCrafter:从单目视频生成基于扩散的长尺寸高保真立体三维图像","authors":"Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan","doi":"arxiv-2409.07447","DOIUrl":null,"url":null,"abstract":"This paper presents a novel framework for converting 2D videos to immersive\nstereoscopic 3D, addressing the growing demand for 3D content in immersive\nexperience. Leveraging foundation models as priors, our approach overcomes the\nlimitations of traditional methods and boosts the performance to ensure the\nhigh-fidelity generation required by the display devices. The proposed system\nconsists of two main steps: depth-based video splatting for warping and\nextracting occlusion mask, and stereo video inpainting. We utilize pre-trained\nstable video diffusion as the backbone and introduce a fine-tuning protocol for\nthe stereo video inpainting task. To handle input video with varying lengths\nand resolutions, we explore auto-regressive strategies and tiled processing.\nFinally, a sophisticated data processing pipeline has been developed to\nreconstruct a large-scale and high-quality dataset to support our training. Our\nframework demonstrates significant improvements in 2D-to-3D video conversion,\noffering a practical solution for creating immersive content for 3D devices\nlike Apple Vision Pro and 3D displays. In summary, this work contributes to the\nfield by presenting an effective method for generating high-quality\nstereoscopic videos from monocular input, potentially transforming how we\nexperience digital media.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos\",\"authors\":\"Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan\",\"doi\":\"arxiv-2409.07447\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a novel framework for converting 2D videos to immersive\\nstereoscopic 3D, addressing the growing demand for 3D content in immersive\\nexperience. Leveraging foundation models as priors, our approach overcomes the\\nlimitations of traditional methods and boosts the performance to ensure the\\nhigh-fidelity generation required by the display devices. The proposed system\\nconsists of two main steps: depth-based video splatting for warping and\\nextracting occlusion mask, and stereo video inpainting. We utilize pre-trained\\nstable video diffusion as the backbone and introduce a fine-tuning protocol for\\nthe stereo video inpainting task. To handle input video with varying lengths\\nand resolutions, we explore auto-regressive strategies and tiled processing.\\nFinally, a sophisticated data processing pipeline has been developed to\\nreconstruct a large-scale and high-quality dataset to support our training. Our\\nframework demonstrates significant improvements in 2D-to-3D video conversion,\\noffering a practical solution for creating immersive content for 3D devices\\nlike Apple Vision Pro and 3D displays. In summary, this work contributes to the\\nfield by presenting an effective method for generating high-quality\\nstereoscopic videos from monocular input, potentially transforming how we\\nexperience digital media.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07447\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
本文提出了一种将 2D 视频转换为沉浸式立体 3D 的新型框架,以满足沉浸式体验对 3D 内容日益增长的需求。利用基础模型作为先验,我们的方法克服了传统方法的局限性,并提高了性能,以确保显示设备所需的高保真生成。我们提出的系统包括两个主要步骤:基于深度的视频拼接(用于扭曲和提取遮挡)和立体视频内绘。我们利用预训练的稳定视频扩散作为骨干,并为立体视频绘制任务引入了微调协议。为了处理不同长度和分辨率的输入视频,我们探索了自动回归策略和平铺处理方法。最后,我们开发了一个复杂的数据处理管道,以重建一个大规模、高质量的数据集来支持我们的训练。我们的框架在 2D 到 3D 视频转换方面取得了重大改进,为苹果 Vision Pro 等 3D 设备和 3D 显示器创建身临其境的内容提供了实用的解决方案。总之,这项工作提出了一种从单眼输入生成高质量立体视频的有效方法,可能会改变我们体验数字媒体的方式,从而为该领域做出贡献。
StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos
This paper presents a novel framework for converting 2D videos to immersive
stereoscopic 3D, addressing the growing demand for 3D content in immersive
experience. Leveraging foundation models as priors, our approach overcomes the
limitations of traditional methods and boosts the performance to ensure the
high-fidelity generation required by the display devices. The proposed system
consists of two main steps: depth-based video splatting for warping and
extracting occlusion mask, and stereo video inpainting. We utilize pre-trained
stable video diffusion as the backbone and introduce a fine-tuning protocol for
the stereo video inpainting task. To handle input video with varying lengths
and resolutions, we explore auto-regressive strategies and tiled processing.
Finally, a sophisticated data processing pipeline has been developed to
reconstruct a large-scale and high-quality dataset to support our training. Our
framework demonstrates significant improvements in 2D-to-3D video conversion,
offering a practical solution for creating immersive content for 3D devices
like Apple Vision Pro and 3D displays. In summary, this work contributes to the
field by presenting an effective method for generating high-quality
stereoscopic videos from monocular input, potentially transforming how we
experience digital media.