Capturing and rendering novel views of complex real-world scenes is a long-standing problem in computer graphics and vision, with applications in augmented and virtual reality, immersive experiences and 3D photography. The advent of deep learning has enabled revolutionary advances in this area, classically known as image-based rendering. However, previous approaches require intractably dense view sampling or provide little or no guidance for how users should sample views of a scene to reliably render high-quality novel views. Local light field fusion proposes an algorithm for practical view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image scene representation, then renders novel views by blending adjacent local light fields. Crucially, we extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. We achieve the perceptual quality of Nyquist rate view sampling while using up to 4000x fewer views. Subsequent developments have led to new scene representations for deep learning with view synthesis, notably neural radiance fields, but the problem of sparse view synthesis from a small number of images has only grown in importance. We reprise some of the recent results on sparse and even single image view synthesis, while posing the question of whether prescriptive sampling guidelines are feasible for the new generation of image-based rendering algorithms.
{"title":"Sampling for View Synthesis: From Local Light Field Fusion to Neural Radiance Fields and Beyond","authors":"Ravi Ramamoorthi","doi":"arxiv-2408.04586","DOIUrl":"https://doi.org/arxiv-2408.04586","url":null,"abstract":"Capturing and rendering novel views of complex real-world scenes is a\u0000long-standing problem in computer graphics and vision, with applications in\u0000augmented and virtual reality, immersive experiences and 3D photography. The\u0000advent of deep learning has enabled revolutionary advances in this area,\u0000classically known as image-based rendering. However, previous approaches\u0000require intractably dense view sampling or provide little or no guidance for\u0000how users should sample views of a scene to reliably render high-quality novel\u0000views. Local light field fusion proposes an algorithm for practical view\u0000synthesis from an irregular grid of sampled views that first expands each\u0000sampled view into a local light field via a multiplane image scene\u0000representation, then renders novel views by blending adjacent local light\u0000fields. Crucially, we extend traditional plenoptic sampling theory to derive a\u0000bound that specifies precisely how densely users should sample views of a given\u0000scene when using our algorithm. We achieve the perceptual quality of Nyquist\u0000rate view sampling while using up to 4000x fewer views. Subsequent developments\u0000have led to new scene representations for deep learning with view synthesis,\u0000notably neural radiance fields, but the problem of sparse view synthesis from a\u0000small number of images has only grown in importance. We reprise some of the\u0000recent results on sparse and even single image view synthesis, while posing the\u0000question of whether prescriptive sampling guidelines are feasible for the new\u0000generation of image-based rendering algorithms.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141932559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongzhi Xu, Yonhon Ng, Yifu Wang, Inkyu Sa, Yunfei Duan, Yang Li, Pan Ji, Hongdong Li
3D Content Generation is at the heart of many computer graphics applications, including video gaming, film-making, virtual and augmented reality, etc. This paper proposes a novel deep-learning based approach for automatically generating interactive and playable 3D game scenes, all from the user's casual prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and convenient way to convey the user's design intention in the content creation process. To circumvent the data-deficient challenge in learning (i.e. the lack of large training data of 3D scenes), our method leverages a pre-trained 2D denoising diffusion model to generate a 2D image of the scene as the conceptual guidance. In this process, we adopt the isometric projection mode to factor out unknown camera poses while obtaining the scene layout. From the generated isometric image, we use a pre-trained image understanding method to segment the image into meaningful parts, such as off-ground objects, trees, and buildings, and extract the 2D scene layout. These segments and layouts are subsequently fed into a procedural content generation (PCG) engine, such as a 3D video game engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can be seamlessly integrated into a game development environment and is readily playable. Extensive tests demonstrate that our method can efficiently generate high-quality and interactive 3D game scenes with layouts that closely follow the user's intention.
三维内容生成是视频游戏、电影制作、虚拟现实和增强现实等众多计算机图形应用的核心。本文提出了一种新颖的基于深度学习的方法,用于自动生成可交互和可播放的 3D 游戏场景,所有这些都来自用户的随意素描(如手绘草图)。基于草图的输入为在内容创建过程中传达用户的设计意图提供了一种自然、便捷的方式。为了规避学习过程中数据不足的难题(即缺乏大量三维场景的训练数据),我们的方法利用预先训练好的二维去噪扩散模型生成二维场景图像作为概念指导。在这一过程中,我们采用等距投影模式,在获取场景布局的同时,将未知的摄像机姿势因素排除在外。从生成的等轴测图像中,我们使用预先训练好的图像理解方法将图像分割成有意义的部分,例如离地物体、树木和建筑物,并提取二维场景布局。这些分割和布局随后被输入程序内容生成(PCG)引擎,如 Unity 或 Unreal 等 3D 视频游戏引擎,以创建 3D 场景。生成的三维场景可以无缝集成到游戏开发环境中,并且可以随时播放。广泛的测试表明,我们的方法可以高效地生成高质量的交互式三维游戏场景,其布局紧贴用户的意图。
{"title":"Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches","authors":"Yongzhi Xu, Yonhon Ng, Yifu Wang, Inkyu Sa, Yunfei Duan, Yang Li, Pan Ji, Hongdong Li","doi":"arxiv-2408.04567","DOIUrl":"https://doi.org/arxiv-2408.04567","url":null,"abstract":"3D Content Generation is at the heart of many computer graphics applications,\u0000including video gaming, film-making, virtual and augmented reality, etc. This\u0000paper proposes a novel deep-learning based approach for automatically\u0000generating interactive and playable 3D game scenes, all from the user's casual\u0000prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and\u0000convenient way to convey the user's design intention in the content creation\u0000process. To circumvent the data-deficient challenge in learning (i.e. the lack\u0000of large training data of 3D scenes), our method leverages a pre-trained 2D\u0000denoising diffusion model to generate a 2D image of the scene as the conceptual\u0000guidance. In this process, we adopt the isometric projection mode to factor out\u0000unknown camera poses while obtaining the scene layout. From the generated\u0000isometric image, we use a pre-trained image understanding method to segment the\u0000image into meaningful parts, such as off-ground objects, trees, and buildings,\u0000and extract the 2D scene layout. These segments and layouts are subsequently\u0000fed into a procedural content generation (PCG) engine, such as a 3D video game\u0000engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can\u0000be seamlessly integrated into a game development environment and is readily\u0000playable. Extensive tests demonstrate that our method can efficiently generate\u0000high-quality and interactive 3D game scenes with layouts that closely follow\u0000the user's intention.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141932560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The generalized winding number is an essential part of the geometry processing toolkit, allowing to quantify how much a given point is inside a surface, often represented by a mesh or a point cloud, even when the surface is open, noisy, or non-manifold. Parameterized surfaces, which often contain intentional and unintentional gaps and imprecisions, would also benefit from a generalized winding number. Standard methods to compute it, however, rely on a surface integral, challenging to compute without surface discretization, leading to loss of precision characteristic of parametric surfaces. We propose an alternative method to compute a generalized winding number, based only on the surface boundary and the intersections of a single ray with the surface. For parametric surfaces, we show that all the necessary operations can be done via a Sum-of-Squares (SOS) formulation, thus computing generalized winding numbers without surface discretization with machine precision. We show that by discretizing only the boundary of the surface, this becomes an efficient method. We demonstrate an application of our method to the problem of computing a generalized winding number of a surface represented by a curve network, where each curve loop is surfaced via Laplace equation. We use the Boundary Element Method to express the solution as a parametric surface, allowing us to apply our method without meshing the surfaces. As a bonus, we also demonstrate that for meshes with many triangles and a simple boundary, our method is faster than the hierarchical evaluation of the generalized winding number while still being precise. We validate our algorithms theoretically, numerically, and by demonstrating a gallery of results new{on a variety of parametric surfaces and meshes}, as well uses in a variety of applications, including voxelizations and boolean operations.
{"title":"One-Shot Method for Computing Generalized Winding Numbers","authors":"Cedric Martens, Mikhail Bessmeltsev","doi":"arxiv-2408.04466","DOIUrl":"https://doi.org/arxiv-2408.04466","url":null,"abstract":"The generalized winding number is an essential part of the geometry\u0000processing toolkit, allowing to quantify how much a given point is inside a\u0000surface, often represented by a mesh or a point cloud, even when the surface is\u0000open, noisy, or non-manifold. Parameterized surfaces, which often contain\u0000intentional and unintentional gaps and imprecisions, would also benefit from a\u0000generalized winding number. Standard methods to compute it, however, rely on a\u0000surface integral, challenging to compute without surface discretization,\u0000leading to loss of precision characteristic of parametric surfaces. We propose an alternative method to compute a generalized winding number,\u0000based only on the surface boundary and the intersections of a single ray with\u0000the surface. For parametric surfaces, we show that all the necessary operations\u0000can be done via a Sum-of-Squares (SOS) formulation, thus computing generalized\u0000winding numbers without surface discretization with machine precision. We show\u0000that by discretizing only the boundary of the surface, this becomes an\u0000efficient method. We demonstrate an application of our method to the problem of computing a\u0000generalized winding number of a surface represented by a curve network, where\u0000each curve loop is surfaced via Laplace equation. We use the Boundary Element\u0000Method to express the solution as a parametric surface, allowing us to apply\u0000our method without meshing the surfaces. As a bonus, we also demonstrate that\u0000for meshes with many triangles and a simple boundary, our method is faster than\u0000the hierarchical evaluation of the generalized winding number while still being\u0000precise. We validate our algorithms theoretically, numerically, and by demonstrating a\u0000gallery of results new{on a variety of parametric surfaces and meshes}, as\u0000well uses in a variety of applications, including voxelizations and boolean\u0000operations.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongcheng Song, Dmitry Kachkovski, Shaimaa Monem, Abraham Kassauhun Negash, David I. W. Levin
In this work, we show that exploiting additional variables in a mixed finite element formulation of deformation leads to an efficient physics-based character skinning algorithm. Taking as input, a user-defined rig, we show how to efficiently compute deformations of the character mesh which respect artist-supplied handle positions and orientations, but without requiring complicated constraints on the physics solver, which can cause poor performance. Rather we demonstrate an efficient, user controllable skinning pipeline that can generate compelling character deformations, using a variety of physics material models.
{"title":"Automatic Skinning using the Mixed Finite Element Method","authors":"Hongcheng Song, Dmitry Kachkovski, Shaimaa Monem, Abraham Kassauhun Negash, David I. W. Levin","doi":"arxiv-2408.04066","DOIUrl":"https://doi.org/arxiv-2408.04066","url":null,"abstract":"In this work, we show that exploiting additional variables in a mixed finite\u0000element formulation of deformation leads to an efficient physics-based\u0000character skinning algorithm. Taking as input, a user-defined rig, we show how\u0000to efficiently compute deformations of the character mesh which respect\u0000artist-supplied handle positions and orientations, but without requiring\u0000complicated constraints on the physics solver, which can cause poor\u0000performance. Rather we demonstrate an efficient, user controllable skinning\u0000pipeline that can generate compelling character deformations, using a variety\u0000of physics material models.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an approach to decomposing animated graphics into sprites, a set of basic elements or layers. Our approach builds on the optimization of sprite parameters to fit the raster video. For efficiency, we assume static textures for sprites to reduce the search space while preventing artifacts using a texture prior model. To further speed up the optimization, we introduce the initialization of the sprite parameters utilizing a pre-trained video object segmentation model and user input of single frame annotations. For our study, we construct the Crello Animation dataset from an online design service and define quantitative metrics to measure the quality of the extracted sprites. Experiments show that our method significantly outperforms baselines for similar decomposition tasks in terms of the quality/efficiency tradeoff.
{"title":"Fast Sprite Decomposition from Animated Graphics","authors":"Tomoyuki Suzuki, Kotaro Kikuchi, Kota Yamaguchi","doi":"arxiv-2408.03923","DOIUrl":"https://doi.org/arxiv-2408.03923","url":null,"abstract":"This paper presents an approach to decomposing animated graphics into\u0000sprites, a set of basic elements or layers. Our approach builds on the\u0000optimization of sprite parameters to fit the raster video. For efficiency, we\u0000assume static textures for sprites to reduce the search space while preventing\u0000artifacts using a texture prior model. To further speed up the optimization, we\u0000introduce the initialization of the sprite parameters utilizing a pre-trained\u0000video object segmentation model and user input of single frame annotations. For\u0000our study, we construct the Crello Animation dataset from an online design\u0000service and define quantitative metrics to measure the quality of the extracted\u0000sprites. Experiments show that our method significantly outperforms baselines\u0000for similar decomposition tasks in terms of the quality/efficiency tradeoff.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141932613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differentiable volumetric rendering-based methods made significant progress in novel view synthesis. On one hand, innovative methods have replaced the Neural Radiance Fields (NeRF) network with locally parameterized structures, enabling high-quality renderings in a reasonable time. On the other hand, approaches have used differentiable splatting instead of NeRF's ray casting to optimize radiance fields rapidly using Gaussian kernels, allowing for fine adaptation to the scene. However, differentiable ray casting of irregularly spaced kernels has been scarcely explored, while splatting, despite enabling fast rendering times, is susceptible to clearly visible artifacts. Our work closes this gap by providing a physically consistent formulation of the emitted radiance c and density {sigma}, decomposed with Gaussian functions associated with Spherical Gaussians/Harmonics for all-frequency colorimetric representation. We also introduce a method enabling differentiable ray casting of irregularly distributed Gaussians using an algorithm that integrates radiance fields slab by slab and leverages a BVH structure. This allows our approach to finely adapt to the scene while avoiding splatting artifacts. As a result, we achieve superior rendering quality compared to the state-of-the-art while maintaining reasonable training times and achieving inference speeds of 25 FPS on the Blender dataset. Project page with videos and code: https://raygauss.github.io/
{"title":"RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis","authors":"Hugo Blanc, Jean-Emmanuel Deschaud, Alexis Paljic","doi":"arxiv-2408.03356","DOIUrl":"https://doi.org/arxiv-2408.03356","url":null,"abstract":"Differentiable volumetric rendering-based methods made significant progress\u0000in novel view synthesis. On one hand, innovative methods have replaced the\u0000Neural Radiance Fields (NeRF) network with locally parameterized structures,\u0000enabling high-quality renderings in a reasonable time. On the other hand,\u0000approaches have used differentiable splatting instead of NeRF's ray casting to\u0000optimize radiance fields rapidly using Gaussian kernels, allowing for fine\u0000adaptation to the scene. However, differentiable ray casting of irregularly\u0000spaced kernels has been scarcely explored, while splatting, despite enabling\u0000fast rendering times, is susceptible to clearly visible artifacts. Our work closes this gap by providing a physically consistent formulation of\u0000the emitted radiance c and density {sigma}, decomposed with Gaussian functions\u0000associated with Spherical Gaussians/Harmonics for all-frequency colorimetric\u0000representation. We also introduce a method enabling differentiable ray casting\u0000of irregularly distributed Gaussians using an algorithm that integrates\u0000radiance fields slab by slab and leverages a BVH structure. This allows our\u0000approach to finely adapt to the scene while avoiding splatting artifacts. As a\u0000result, we achieve superior rendering quality compared to the state-of-the-art\u0000while maintaining reasonable training times and achieving inference speeds of\u000025 FPS on the Blender dataset. Project page with videos and code:\u0000https://raygauss.github.io/","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141932611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a new approach for generating realistic 3D models with UV maps through a representation termed "Object Images." This approach encapsulates surface geometry, appearance, and patch structures within a 64x64 pixel image, effectively converting complex 3D shapes into a more manageable 2D format. By doing so, we address the challenges of both geometric and semantic irregularity inherent in polygonal meshes. This method allows us to use image generation models, such as Diffusion Transformers, directly for 3D shape generation. Evaluated on the ABO dataset, our generated shapes with patch structures achieve point cloud FID comparable to recent 3D generative models, while naturally supporting PBR material generation.
{"title":"An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion","authors":"Xingguang Yan, Han-Hung Lee, Ziyu Wan, Angel X. Chang","doi":"arxiv-2408.03178","DOIUrl":"https://doi.org/arxiv-2408.03178","url":null,"abstract":"We introduce a new approach for generating realistic 3D models with UV maps\u0000through a representation termed \"Object Images.\" This approach encapsulates\u0000surface geometry, appearance, and patch structures within a 64x64 pixel image,\u0000effectively converting complex 3D shapes into a more manageable 2D format. By\u0000doing so, we address the challenges of both geometric and semantic irregularity\u0000inherent in polygonal meshes. This method allows us to use image generation\u0000models, such as Diffusion Transformers, directly for 3D shape generation.\u0000Evaluated on the ABO dataset, our generated shapes with patch structures\u0000achieve point cloud FID comparable to recent 3D generative models, while\u0000naturally supporting PBR material generation.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141932612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tengfei Wang, Zongqian Zhan, Rui Xia, Linxia Ji, Xin Wang
Over the last few decades, image-based building surface reconstruction has garnered substantial research interest and has been applied across various fields, such as heritage preservation, architectural planning, etc. Compared to the traditional photogrammetric and NeRF-based solutions, recently, Gaussian fields-based methods have exhibited significant potential in generating surface meshes due to their time-efficient training and detailed 3D information preservation. However, most gaussian fields-based methods are trained with all image pixels, encompassing building and nonbuilding areas, which results in a significant noise for building meshes and degeneration in time efficiency. This paper proposes a novel framework, Masked Gaussian Fields (MGFs), designed to generate accurate surface reconstruction for building in a time-efficient way. The framework first applies EfficientSAM and COLMAP to generate multi-level masks of building and the corresponding masked point clouds. Subsequently, the masked gaussian fields are trained by integrating two innovative losses: a multi-level perceptual masked loss focused on constructing building regions and a boundary loss aimed at enhancing the details of the boundaries between different masks. Finally, we improve the tetrahedral surface mesh extraction method based on the masked gaussian spheres. Comprehensive experiments on UAV images demonstrate that, compared to the traditional method and several NeRF-based and Gaussian-based SOTA solutions, our approach significantly improves both the accuracy and efficiency of building surface reconstruction. Notably, as a byproduct, there is an additional gain in the novel view synthesis of building.
{"title":"MGFs: Masked Gaussian Fields for Meshing Building based on Multi-View Images","authors":"Tengfei Wang, Zongqian Zhan, Rui Xia, Linxia Ji, Xin Wang","doi":"arxiv-2408.03060","DOIUrl":"https://doi.org/arxiv-2408.03060","url":null,"abstract":"Over the last few decades, image-based building surface reconstruction has\u0000garnered substantial research interest and has been applied across various\u0000fields, such as heritage preservation, architectural planning, etc. Compared to\u0000the traditional photogrammetric and NeRF-based solutions, recently, Gaussian\u0000fields-based methods have exhibited significant potential in generating surface\u0000meshes due to their time-efficient training and detailed 3D information\u0000preservation. However, most gaussian fields-based methods are trained with all\u0000image pixels, encompassing building and nonbuilding areas, which results in a\u0000significant noise for building meshes and degeneration in time efficiency. This\u0000paper proposes a novel framework, Masked Gaussian Fields (MGFs), designed to\u0000generate accurate surface reconstruction for building in a time-efficient way.\u0000The framework first applies EfficientSAM and COLMAP to generate multi-level\u0000masks of building and the corresponding masked point clouds. Subsequently, the\u0000masked gaussian fields are trained by integrating two innovative losses: a\u0000multi-level perceptual masked loss focused on constructing building regions and\u0000a boundary loss aimed at enhancing the details of the boundaries between\u0000different masks. Finally, we improve the tetrahedral surface mesh extraction\u0000method based on the masked gaussian spheres. Comprehensive experiments on UAV\u0000images demonstrate that, compared to the traditional method and several\u0000NeRF-based and Gaussian-based SOTA solutions, our approach significantly\u0000improves both the accuracy and efficiency of building surface reconstruction.\u0000Notably, as a byproduct, there is an additional gain in the novel view\u0000synthesis of building.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"85 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141932607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimitris Angelis, Prodromos Kolyvakis, Manos Kamarianakis, George Papagiannakis
This paper introduces a novel integration of Large Language Models (LLMs) with Conformal Geometric Algebra (CGA) to revolutionize controllable 3D scene editing, particularly for object repositioning tasks, which traditionally requires intricate manual processes and specialized expertise. These conventional methods typically suffer from reliance on large training datasets or lack a formalized language for precise edits. Utilizing CGA as a robust formal language, our system, shenlong, precisely models spatial transformations necessary for accurate object repositioning. Leveraging the zero-shot learning capabilities of pre-trained LLMs, shenlong translates natural language instructions into CGA operations which are then applied to the scene, facilitating exact spatial transformations within 3D scenes without the need for specialized pre-training. Implemented in a realistic simulation environment, shenlong ensures compatibility with existing graphics pipelines. To accurately assess the impact of CGA, we benchmark against robust Euclidean Space baselines, evaluating both latency and accuracy. Comparative performance evaluations indicate that shenlong significantly reduces LLM response times by 16% and boosts success rates by 9.6% on average compared to the traditional methods. Notably, shenlong achieves a 100% perfect success rate in common practical queries, a benchmark where other systems fall short. These advancements underscore shenlong's potential to democratize 3D scene editing, enhancing accessibility and fostering innovation across sectors such as education, digital entertainment, and virtual reality.
{"title":"Geometric Algebra Meets Large Language Models: Instruction-Based Transformations of Separate Meshes in 3D, Interactive and Controllable Scenes","authors":"Dimitris Angelis, Prodromos Kolyvakis, Manos Kamarianakis, George Papagiannakis","doi":"arxiv-2408.02275","DOIUrl":"https://doi.org/arxiv-2408.02275","url":null,"abstract":"This paper introduces a novel integration of Large Language Models (LLMs)\u0000with Conformal Geometric Algebra (CGA) to revolutionize controllable 3D scene\u0000editing, particularly for object repositioning tasks, which traditionally\u0000requires intricate manual processes and specialized expertise. These\u0000conventional methods typically suffer from reliance on large training datasets\u0000or lack a formalized language for precise edits. Utilizing CGA as a robust\u0000formal language, our system, shenlong, precisely models spatial transformations\u0000necessary for accurate object repositioning. Leveraging the zero-shot learning\u0000capabilities of pre-trained LLMs, shenlong translates natural language\u0000instructions into CGA operations which are then applied to the scene,\u0000facilitating exact spatial transformations within 3D scenes without the need\u0000for specialized pre-training. Implemented in a realistic simulation\u0000environment, shenlong ensures compatibility with existing graphics pipelines.\u0000To accurately assess the impact of CGA, we benchmark against robust Euclidean\u0000Space baselines, evaluating both latency and accuracy. Comparative performance\u0000evaluations indicate that shenlong significantly reduces LLM response times by\u000016% and boosts success rates by 9.6% on average compared to the traditional\u0000methods. Notably, shenlong achieves a 100% perfect success rate in common\u0000practical queries, a benchmark where other systems fall short. These\u0000advancements underscore shenlong's potential to democratize 3D scene editing,\u0000enhancing accessibility and fostering innovation across sectors such as\u0000education, digital entertainment, and virtual reality.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141932608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hou In Ivan Tam, Hou In Derek Pun, Austin T. Wang, Angel X. Chang, Manolis Savva
Despite advances in text-to-3D generation methods, generation of multi-object arrangements remains challenging. Current methods exhibit failures in generating physically plausible arrangements that respect the provided text description. We present SceneMotifCoder (SMC), an example-driven framework for generating 3D object arrangements through visual program learning. SMC leverages large language models (LLMs) and program synthesis to overcome these challenges by learning visual programs from example arrangements. These programs are generalized into compact, editable meta-programs. When combined with 3D object retrieval and geometry-aware optimization, they can be used to create object arrangements varying in arrangement structure and contained objects. Our experiments show that SMC generates high-quality arrangements using meta-programs learned from few examples. Evaluation results demonstrates that object arrangements generated by SMC better conform to user-specified text descriptions and are more physically plausible when compared with state-of-the-art text-to-3D generation and layout methods.
{"title":"SceneMotifCoder: Example-driven Visual Program Learning for Generating 3D Object Arrangements","authors":"Hou In Ivan Tam, Hou In Derek Pun, Austin T. Wang, Angel X. Chang, Manolis Savva","doi":"arxiv-2408.02211","DOIUrl":"https://doi.org/arxiv-2408.02211","url":null,"abstract":"Despite advances in text-to-3D generation methods, generation of multi-object\u0000arrangements remains challenging. Current methods exhibit failures in\u0000generating physically plausible arrangements that respect the provided text\u0000description. We present SceneMotifCoder (SMC), an example-driven framework for\u0000generating 3D object arrangements through visual program learning. SMC\u0000leverages large language models (LLMs) and program synthesis to overcome these\u0000challenges by learning visual programs from example arrangements. These\u0000programs are generalized into compact, editable meta-programs. When combined\u0000with 3D object retrieval and geometry-aware optimization, they can be used to\u0000create object arrangements varying in arrangement structure and contained\u0000objects. Our experiments show that SMC generates high-quality arrangements\u0000using meta-programs learned from few examples. Evaluation results demonstrates\u0000that object arrangements generated by SMC better conform to user-specified text\u0000descriptions and are more physically plausible when compared with\u0000state-of-the-art text-to-3D generation and layout methods.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}