Generating high-quality 3D objects from textual descriptions remains a challenging problem due to computational cost, the scarcity of 3D data, and complex 3D representations. We introduce Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures. By integrating a Collaborative Control mechanism, we exploit the rich 2D priors of existing Text-to-Image models such as Stable Diffusion. This enables strong generalization even with limited 3D training data (allowing us to use only high-quality training data) as well as retaining compatibility with guidance techniques such as IPAdapter. In short, GIMDiffusion enables the generation of 3D assets at speeds comparable to current Text-to-Image models. The generated objects consist of semantically meaningful, separate parts and include internal structures, enhancing both usability and versatility.
{"title":"Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation","authors":"Slava Elizarov, Ciara Rowles, Simon Donné","doi":"arxiv-2409.03718","DOIUrl":"https://doi.org/arxiv-2409.03718","url":null,"abstract":"Generating high-quality 3D objects from textual descriptions remains a\u0000challenging problem due to computational cost, the scarcity of 3D data, and\u0000complex 3D representations. We introduce Geometry Image Diffusion\u0000(GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to\u0000efficiently represent 3D shapes using 2D images, thereby avoiding the need for\u0000complex 3D-aware architectures. By integrating a Collaborative Control\u0000mechanism, we exploit the rich 2D priors of existing Text-to-Image models such\u0000as Stable Diffusion. This enables strong generalization even with limited 3D\u0000training data (allowing us to use only high-quality training data) as well as\u0000retaining compatibility with guidance techniques such as IPAdapter. In short,\u0000GIMDiffusion enables the generation of 3D assets at speeds comparable to\u0000current Text-to-Image models. The generated objects consist of semantically\u0000meaningful, separate parts and include internal structures, enhancing both\u0000usability and versatility.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefano Esposito, Anpei Chen, Christian Reiser, Samuel Rota Bulò, Lorenzo Porzi, Katja Schwarz, Christian Richardt, Michael Zollhöfer, Peter Kontschieder, Andreas Geiger
High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). These problems are exacerbated on low-performance graphics hardware, e.g. on mobile devices. We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed layer order from outermost to innermost. We model mesh layers as SDF shells with optimal spacing learned during training. After baking, we fit UV textures to the corresponding meshes. We show that our method can represent challenging fuzzy objects while achieving higher frame rates than volume-based and splatting-based methods on low-end and mobile devices.
{"title":"Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes","authors":"Stefano Esposito, Anpei Chen, Christian Reiser, Samuel Rota Bulò, Lorenzo Porzi, Katja Schwarz, Christian Richardt, Michael Zollhöfer, Peter Kontschieder, Andreas Geiger","doi":"arxiv-2409.02482","DOIUrl":"https://doi.org/arxiv-2409.02482","url":null,"abstract":"High-quality real-time view synthesis methods are based on volume rendering,\u0000splatting, or surface rendering. While surface-based methods generally are the\u0000fastest, they cannot faithfully model fuzzy geometry like hair. In turn,\u0000alpha-blending techniques excel at representing fuzzy materials but require an\u0000unbounded number of samples per ray (P1). Further overheads are induced by\u0000empty space skipping in volume rendering (P2) and sorting input primitives in\u0000splatting (P3). These problems are exacerbated on low-performance graphics\u0000hardware, e.g. on mobile devices. We present a novel representation for\u0000real-time view synthesis where the (P1) number of sampling locations is small\u0000and bounded, (P2) sampling locations are efficiently found via rasterization,\u0000and (P3) rendering is sorting-free. We achieve this by representing objects as\u0000semi-transparent multi-layer meshes, rendered in fixed layer order from\u0000outermost to innermost. We model mesh layers as SDF shells with optimal spacing\u0000learned during training. After baking, we fit UV textures to the corresponding\u0000meshes. We show that our method can represent challenging fuzzy objects while\u0000achieving higher frame rates than volume-based and splatting-based methods on\u0000low-end and mobile devices.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modeling outdoor scenes for the synthetic 3D environment requires the recovery of reflectance/albedo information from raw images, which is an ill-posed problem due to the complicated unmodeled physics in this process (e.g., indirect lighting, volume scattering, specular reflection). The problem remains unsolved in a practical context. The recovered albedo can facilitate model relighting and shading, which can further enhance the realism of rendered models and the applications of digital twins. Typically, photogrammetric 3D models simply take the source images as texture materials, which inherently embed unwanted lighting artifacts (at the time of capture) into the texture. Therefore, these polluted textures are suboptimal for a synthetic environment to enable realistic rendering. In addition, these embedded environmental lightings further bring challenges to photo-consistencies across different images that cause image-matching uncertainties. This paper presents a general image formation model for albedo recovery from typical aerial photogrammetric images under natural illuminations and derives the inverse model to resolve the albedo information through inverse rendering intrinsic image decomposition. Our approach builds on the fact that both the sun illumination and scene geometry are estimable in aerial photogrammetry, thus they can provide direct inputs for this ill-posed problem. This physics-based approach does not require additional input other than data acquired through the typical drone-based photogrammetric collection and was shown to favorably outperform existing approaches. We also demonstrate that the recovered albedo image can in turn improve typical image processing tasks in photogrammetry such as feature and dense matching, edge, and line extraction.
{"title":"A General Albedo Recovery Approach for Aerial Photogrammetric Images through Inverse Rendering","authors":"Shuang Song, Rongjun Qin","doi":"arxiv-2409.03032","DOIUrl":"https://doi.org/arxiv-2409.03032","url":null,"abstract":"Modeling outdoor scenes for the synthetic 3D environment requires the\u0000recovery of reflectance/albedo information from raw images, which is an\u0000ill-posed problem due to the complicated unmodeled physics in this process\u0000(e.g., indirect lighting, volume scattering, specular reflection). The problem\u0000remains unsolved in a practical context. The recovered albedo can facilitate\u0000model relighting and shading, which can further enhance the realism of rendered\u0000models and the applications of digital twins. Typically, photogrammetric 3D\u0000models simply take the source images as texture materials, which inherently\u0000embed unwanted lighting artifacts (at the time of capture) into the texture.\u0000Therefore, these polluted textures are suboptimal for a synthetic environment\u0000to enable realistic rendering. In addition, these embedded environmental\u0000lightings further bring challenges to photo-consistencies across different\u0000images that cause image-matching uncertainties. This paper presents a general\u0000image formation model for albedo recovery from typical aerial photogrammetric\u0000images under natural illuminations and derives the inverse model to resolve the\u0000albedo information through inverse rendering intrinsic image decomposition. Our\u0000approach builds on the fact that both the sun illumination and scene geometry\u0000are estimable in aerial photogrammetry, thus they can provide direct inputs for\u0000this ill-posed problem. This physics-based approach does not require additional\u0000input other than data acquired through the typical drone-based photogrammetric\u0000collection and was shown to favorably outperform existing approaches. We also\u0000demonstrate that the recovered albedo image can in turn improve typical image\u0000processing tasks in photogrammetry such as feature and dense matching, edge,\u0000and line extraction.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.
{"title":"DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos","authors":"Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, Ying Shan","doi":"arxiv-2409.02095","DOIUrl":"https://doi.org/arxiv-2409.02095","url":null,"abstract":"Despite significant advancements in monocular depth estimation for static\u0000images, estimating video depth in the open world remains challenging, since\u0000open-world videos are extremely diverse in content, motion, camera movement,\u0000and length. We present DepthCrafter, an innovative method for generating\u0000temporally consistent long depth sequences with intricate details for\u0000open-world videos, without requiring any supplementary information such as\u0000camera poses or optical flow. DepthCrafter achieves generalization ability to\u0000open-world videos by training a video-to-depth model from a pre-trained\u0000image-to-video diffusion model, through our meticulously designed three-stage\u0000training strategy with the compiled paired video-depth datasets. Our training\u0000approach enables the model to generate depth sequences with variable lengths at\u0000one time, up to 110 frames, and harvest both precise depth details and rich\u0000content diversity from realistic and synthetic datasets. We also propose an\u0000inference strategy that processes extremely long videos through segment-wise\u0000estimation and seamless stitching. Comprehensive evaluations on multiple\u0000datasets reveal that DepthCrafter achieves state-of-the-art performance in\u0000open-world video depth estimation under zero-shot settings. Furthermore,\u0000DepthCrafter facilitates various downstream applications, including depth-based\u0000visual effects and conditional video generation.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"175 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.
{"title":"Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers","authors":"Sohan Anisetty, James Hays","doi":"arxiv-2409.01591","DOIUrl":"https://doi.org/arxiv-2409.01591","url":null,"abstract":"Our research presents a novel motion generation framework designed to produce\u0000whole-body motion sequences conditioned on multiple modalities simultaneously,\u0000specifically text and audio inputs. Leveraging Vector Quantized Variational\u0000Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked\u0000Language Modeling (MLM) strategy for efficient token prediction, our approach\u0000achieves improved processing efficiency and coherence in the generated motions.\u0000By integrating spatial attention mechanisms and a token critic we ensure\u0000consistency and naturalness in the generated motions. This framework expands\u0000the possibilities of motion generation, addressing the limitations of existing\u0000approaches and opening avenues for multimodal motion synthesis.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhangsihao Yang, Mengyi Shan, Mohammad Farazi, Wenhui Zhu, Yanxi Chen, Xuanzhao Dong, Yalin Wang
Human video generation task has gained significant attention with the advancement of deep generative models. Generating realistic videos with human movements is challenging in nature, due to the intricacies of human body topology and sensitivity to visual artifacts. The extensively studied 2D media generation methods take advantage of massive human media datasets, but struggle with 3D-aware control; whereas 3D avatar-based approaches, while offering more freedom in control, lack photorealism and cannot be harmonized seamlessly with background scene. We propose AMG, a method that combines the 2D photorealism and 3D controllability by conditioning video diffusion models on controlled rendering of 3D avatars. We additionally introduce a novel data processing pipeline that reconstructs and renders human avatar movements from dynamic camera videos. AMG is the first method that enables multi-person diffusion video generation with precise control over camera positions, human motions, and background style. We also demonstrate through extensive evaluation that it outperforms existing human video generation methods conditioned on pose sequences or driving videos in terms of realism and adaptability.
{"title":"AMG: Avatar Motion Guided Video Generation","authors":"Zhangsihao Yang, Mengyi Shan, Mohammad Farazi, Wenhui Zhu, Yanxi Chen, Xuanzhao Dong, Yalin Wang","doi":"arxiv-2409.01502","DOIUrl":"https://doi.org/arxiv-2409.01502","url":null,"abstract":"Human video generation task has gained significant attention with the\u0000advancement of deep generative models. Generating realistic videos with human\u0000movements is challenging in nature, due to the intricacies of human body\u0000topology and sensitivity to visual artifacts. The extensively studied 2D media\u0000generation methods take advantage of massive human media datasets, but struggle\u0000with 3D-aware control; whereas 3D avatar-based approaches, while offering more\u0000freedom in control, lack photorealism and cannot be harmonized seamlessly with\u0000background scene. We propose AMG, a method that combines the 2D photorealism\u0000and 3D controllability by conditioning video diffusion models on controlled\u0000rendering of 3D avatars. We additionally introduce a novel data processing\u0000pipeline that reconstructs and renders human avatar movements from dynamic\u0000camera videos. AMG is the first method that enables multi-person diffusion\u0000video generation with precise control over camera positions, human motions, and\u0000background style. We also demonstrate through extensive evaluation that it\u0000outperforms existing human video generation methods conditioned on pose\u0000sequences or driving videos in terms of realism and adaptability.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"136 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haocheng Yuan, Adrien Bousseau, Hao Pan, Chengquan Zhang, Niloy J. Mitra, Changjian Li
Differentiable rendering is a key ingredient for inverse rendering and machine learning, as it allows to optimize scene parameters (shape, materials, lighting) to best fit target images. Differentiable rendering requires that each scene parameter relates to pixel values through differentiable operations. While 3D mesh rendering algorithms have been implemented in a differentiable way, these algorithms do not directly extend to Constructive-Solid-Geometry (CSG), a popular parametric representation of shapes, because the underlying boolean operations are typically performed with complex black-box mesh-processing libraries. We present an algorithm, DiffCSG, to render CSG models in a differentiable manner. Our algorithm builds upon CSG rasterization, which displays the result of boolean operations between primitives without explicitly computing the resulting mesh and, as such, bypasses black-box mesh processing. We describe how to implement CSG rasterization within a differentiable rendering pipeline, taking special care to apply antialiasing along primitive intersections to obtain gradients in such critical areas. Our algorithm is simple and fast, can be easily incorporated into modern machine learning setups, and enables a range of applications for computer-aided design, including direct and image-based editing of CSG primitives. Code and data: https://yyyyyhc.github.io/DiffCSG/.
{"title":"DiffCSG: Differentiable CSG via Rasterization","authors":"Haocheng Yuan, Adrien Bousseau, Hao Pan, Chengquan Zhang, Niloy J. Mitra, Changjian Li","doi":"arxiv-2409.01421","DOIUrl":"https://doi.org/arxiv-2409.01421","url":null,"abstract":"Differentiable rendering is a key ingredient for inverse rendering and\u0000machine learning, as it allows to optimize scene parameters (shape, materials,\u0000lighting) to best fit target images. Differentiable rendering requires that\u0000each scene parameter relates to pixel values through differentiable operations.\u0000While 3D mesh rendering algorithms have been implemented in a differentiable\u0000way, these algorithms do not directly extend to Constructive-Solid-Geometry\u0000(CSG), a popular parametric representation of shapes, because the underlying\u0000boolean operations are typically performed with complex black-box\u0000mesh-processing libraries. We present an algorithm, DiffCSG, to render CSG\u0000models in a differentiable manner. Our algorithm builds upon CSG rasterization,\u0000which displays the result of boolean operations between primitives without\u0000explicitly computing the resulting mesh and, as such, bypasses black-box mesh\u0000processing. We describe how to implement CSG rasterization within a\u0000differentiable rendering pipeline, taking special care to apply antialiasing\u0000along primitive intersections to obtain gradients in such critical areas. Our\u0000algorithm is simple and fast, can be easily incorporated into modern machine\u0000learning setups, and enables a range of applications for computer-aided design,\u0000including direct and image-based editing of CSG primitives. Code and data:\u0000https://yyyyyhc.github.io/DiffCSG/.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we present a novel approach for reconstructing shape point clouds using planar sparse cross-sections with the help of generative modeling. We present unique challenges pertaining to the representation and reconstruction in this problem setting. Most methods in the classical literature lack the ability to generalize based on object class and employ complex mathematical machinery to reconstruct reliable surfaces. We present a simple learnable approach to generate a large number of points from a small number of input cross-sections over a large dataset. We use a compact parametric polyline representation using adaptive splitting to represent the cross-sections and perform learning using a Graph Neural Network to reconstruct the underlying shape in an adaptive manner reducing the dependence on the number of cross-sections provided.
{"title":"Curvy: A Parametric Cross-section based Surface Reconstruction","authors":"Aradhya N. Mathur, Apoorv Khattar, Ojaswa Sharma","doi":"arxiv-2409.00829","DOIUrl":"https://doi.org/arxiv-2409.00829","url":null,"abstract":"In this work, we present a novel approach for reconstructing shape point\u0000clouds using planar sparse cross-sections with the help of generative modeling.\u0000We present unique challenges pertaining to the representation and\u0000reconstruction in this problem setting. Most methods in the classical\u0000literature lack the ability to generalize based on object class and employ\u0000complex mathematical machinery to reconstruct reliable surfaces. We present a\u0000simple learnable approach to generate a large number of points from a small\u0000number of input cross-sections over a large dataset. We use a compact\u0000parametric polyline representation using adaptive splitting to represent the\u0000cross-sections and perform learning using a Graph Neural Network to reconstruct\u0000the underlying shape in an adaptive manner reducing the dependence on the\u0000number of cross-sections provided.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuxiao Zhou, Menglei Chai, Daoye Wang, Sebastian Winberg, Erroll Wood, Kripasindhu Sarkar, Markus Gross, Thabo Beeler
Despite recent advances in multi-view hair reconstruction, achieving strand-level precision remains a significant challenge due to inherent limitations in existing capture pipelines. We introduce GroomCap, a novel multi-view hair capture method that reconstructs faithful and high-fidelity hair geometry without relying on external data priors. To address the limitations of conventional reconstruction algorithms, we propose a neural implicit representation for hair volume that encodes high-resolution 3D orientation and occupancy from input views. This implicit hair volume is trained with a new volumetric 3D orientation rendering algorithm, coupled with 2D orientation distribution supervision, to effectively prevent the loss of structural information caused by undesired orientation blending. We further propose a Gaussian-based hair optimization strategy to refine the traced hair strands with a novel chained Gaussian representation, utilizing direct photometric supervision from images. Our results demonstrate that GroomCap is able to capture high-quality hair geometries that are not only more precise and detailed than existing methods but also versatile enough for a range of applications.
{"title":"GroomCap: High-Fidelity Prior-Free Hair Capture","authors":"Yuxiao Zhou, Menglei Chai, Daoye Wang, Sebastian Winberg, Erroll Wood, Kripasindhu Sarkar, Markus Gross, Thabo Beeler","doi":"arxiv-2409.00831","DOIUrl":"https://doi.org/arxiv-2409.00831","url":null,"abstract":"Despite recent advances in multi-view hair reconstruction, achieving\u0000strand-level precision remains a significant challenge due to inherent\u0000limitations in existing capture pipelines. We introduce GroomCap, a novel\u0000multi-view hair capture method that reconstructs faithful and high-fidelity\u0000hair geometry without relying on external data priors. To address the\u0000limitations of conventional reconstruction algorithms, we propose a neural\u0000implicit representation for hair volume that encodes high-resolution 3D\u0000orientation and occupancy from input views. This implicit hair volume is\u0000trained with a new volumetric 3D orientation rendering algorithm, coupled with\u00002D orientation distribution supervision, to effectively prevent the loss of\u0000structural information caused by undesired orientation blending. We further\u0000propose a Gaussian-based hair optimization strategy to refine the traced hair\u0000strands with a novel chained Gaussian representation, utilizing direct\u0000photometric supervision from images. Our results demonstrate that GroomCap is\u0000able to capture high-quality hair geometries that are not only more precise and\u0000detailed than existing methods but also versatile enough for a range of\u0000applications.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cochlear Implant (CI) procedures involve performing an invasive mastoidectomy to insert an electrode array into the cochlea. In this paper, we introduce a novel pipeline that is capable of generating synthetic multi-view videos from a single CI microscope image. In our approach, we use a patient's pre-operative CT scan to predict the post-mastoidectomy surface using a method designed for this purpose. We manually align the surface with a selected microscope frame to obtain an accurate initial pose of the reconstructed CT mesh relative to the microscope. We then perform UV projection to transfer the colors from the frame to surface textures. Novel views of the textured surface can be used to generate a large dataset of synthetic frames with ground truth poses. We evaluated the quality of synthetic views rendered using Pytorch3D and PyVista. We found both rendering engines lead to similarly high-quality synthetic novel-view frames compared to ground truth with a structural similarity index for both methods averaging about 0.86. A large dataset of novel views with known poses is critical for ongoing training of a method to automatically estimate microscope pose for 2D to 3D registration with the pre-operative CT to facilitate augmented reality surgery. This dataset will empower various downstream tasks, such as integrating Augmented Reality (AR) in the OR, tracking surgical tools, and supporting other video analysis studies.
{"title":"Mastoidectomy Multi-View Synthesis from a Single Microscopy Image","authors":"Yike Zhang, Jack Noble","doi":"arxiv-2409.03190","DOIUrl":"https://doi.org/arxiv-2409.03190","url":null,"abstract":"Cochlear Implant (CI) procedures involve performing an invasive mastoidectomy\u0000to insert an electrode array into the cochlea. In this paper, we introduce a\u0000novel pipeline that is capable of generating synthetic multi-view videos from a\u0000single CI microscope image. In our approach, we use a patient's pre-operative\u0000CT scan to predict the post-mastoidectomy surface using a method designed for\u0000this purpose. We manually align the surface with a selected microscope frame to\u0000obtain an accurate initial pose of the reconstructed CT mesh relative to the\u0000microscope. We then perform UV projection to transfer the colors from the frame\u0000to surface textures. Novel views of the textured surface can be used to\u0000generate a large dataset of synthetic frames with ground truth poses. We\u0000evaluated the quality of synthetic views rendered using Pytorch3D and PyVista.\u0000We found both rendering engines lead to similarly high-quality synthetic\u0000novel-view frames compared to ground truth with a structural similarity index\u0000for both methods averaging about 0.86. A large dataset of novel views with\u0000known poses is critical for ongoing training of a method to automatically\u0000estimate microscope pose for 2D to 3D registration with the pre-operative CT to\u0000facilitate augmented reality surgery. This dataset will empower various\u0000downstream tasks, such as integrating Augmented Reality (AR) in the OR,\u0000tracking surgical tools, and supporting other video analysis studies.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}