Sipeng Yang, Jiayu Ji, Junhao Zhuge, Jinzhe Zhao, Qiang Qiu, Chen Li, Yuzhong Yan, Kerong Wang, Lingqi Yan, Xiaogang Jin
Supersampling has proven highly effective in enhancing visual fidelity by reducing aliasing, increasing resolution, and generating interpolated frames. It has become a standard component of modern real-time rendering pipelines. However, on mobile platforms, deep learning-based supersampling methods remain impractical due to stringent hardware constraints, while non-neural supersampling techniques often fall short in delivering perceptually high-quality results. In particular, producing visually pleasing reconstructions and temporally coherent interpolations is still a significant challenge in mobile settings. In this work, we present a novel, lightweight supersampling framework tailored for mobile devices. Our approach substantially improves both image reconstruction quality and temporal consistency while maintaining real-time performance. For super-resolution, we propose an intra-pixel object coverage estimation method for reconstructing high-quality anti-aliased pixels in edge regions, a gradient-guided strategy for non-edge areas, and a temporal sample accumulation approach to improve overall image quality. For frame interpolation, we develop an efficient motion estimation module coupled with a lightweight fusion scheme that integrates both estimated optical flow and rendered motion vectors, enabling temporally coherent interpolation of object dynamics and lighting variations. Extensive experiments demonstrate that our method consistently outperforms existing baselines in both perceptual image quality and temporal smoothness, while maintaining real-time performance on mobile GPUs. A demo application and supplementary materials are available on the project page.
{"title":"Lightweight, Edge-Aware, and Temporally Consistent Supersampling for Mobile Real-Time Rendering","authors":"Sipeng Yang, Jiayu Ji, Junhao Zhuge, Jinzhe Zhao, Qiang Qiu, Chen Li, Yuzhong Yan, Kerong Wang, Lingqi Yan, Xiaogang Jin","doi":"10.1145/3763348","DOIUrl":"https://doi.org/10.1145/3763348","url":null,"abstract":"Supersampling has proven highly effective in enhancing visual fidelity by reducing aliasing, increasing resolution, and generating interpolated frames. It has become a standard component of modern real-time rendering pipelines. However, on mobile platforms, deep learning-based supersampling methods remain impractical due to stringent hardware constraints, while non-neural supersampling techniques often fall short in delivering perceptually high-quality results. In particular, producing visually pleasing reconstructions and temporally coherent interpolations is still a significant challenge in mobile settings. In this work, we present a novel, lightweight supersampling framework tailored for mobile devices. Our approach substantially improves both image reconstruction quality and temporal consistency while maintaining real-time performance. For super-resolution, we propose an intra-pixel object coverage estimation method for reconstructing high-quality anti-aliased pixels in edge regions, a gradient-guided strategy for non-edge areas, and a temporal sample accumulation approach to improve overall image quality. For frame interpolation, we develop an efficient motion estimation module coupled with a lightweight fusion scheme that integrates both estimated optical flow and rendered motion vectors, enabling temporally coherent interpolation of object dynamics and lighting variations. Extensive experiments demonstrate that our method consistently outperforms existing baselines in both perceptual image quality and temporal smoothness, while maintaining real-time performance on mobile GPUs. A demo application and supplementary materials are available on the project page.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"125 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Humans possess the ability to master a wide range of motor skills, enabling them to quickly and flexibly adapt to the surrounding environment. Despite recent progress in replicating such versatile human motor skills, existing research often oversimplifies or inadequately captures the complex interplay between human body movements and highly dynamic environments, such as interactions with fluids. In this paper, we present a world model for Character-Fluid Coupling (CFC) for simulating human-fluid interactions via two-way coupling. We introduce a two-level world model which consists of a Physics-Informed Neural Network (PINN)-based model for fluid dynamics and a character world model capturing body dynamics under various external forces. This two-level world model adeptly predicts the dynamics of fluid and its influence on rigid bodies via force prediction, sidestepping the computational burden of fluid simulation and providing policy gradients for efficient policy training. Once trained, our system can control characters to complete high-level tasks while adaptively responding to environmental changes. We also present that the fluid initiates emergent behaviors of the characters, enhancing motion diversity and interactivity. Extensive experiments underscore the effectiveness of CFC, demonstrating its ability to produce high-quality, realistic human-fluid interaction animations.
{"title":"CFC: Simulating Character-Fluid Coupling using a Two-Level World Model","authors":"Zhiyang Dou, Chen Peng, Xinyu Lu, Xiaohan Ye, Lixing Fang, Yuan Liu, Wenping Wang, Chuang Gan, Lingjie Liu, Taku Komura","doi":"10.1145/3763318","DOIUrl":"https://doi.org/10.1145/3763318","url":null,"abstract":"Humans possess the ability to master a wide range of motor skills, enabling them to quickly and flexibly adapt to the surrounding environment. Despite recent progress in replicating such versatile human motor skills, existing research often oversimplifies or inadequately captures the complex interplay between human body movements and highly dynamic environments, such as interactions with fluids. In this paper, we present a world model for Character-Fluid Coupling (CFC) for simulating human-fluid interactions via two-way coupling. We introduce a two-level world model which consists of a Physics-Informed Neural Network (PINN)-based model for fluid dynamics and a character world model capturing body dynamics under various external forces. This two-level world model adeptly predicts the dynamics of fluid and its influence on rigid bodies via force prediction, sidestepping the computational burden of fluid simulation and providing policy gradients for efficient policy training. Once trained, our system can control characters to complete high-level tasks while adaptively responding to environmental changes. We also present that the fluid initiates emergent behaviors of the characters, enhancing motion diversity and interactivity. Extensive experiments underscore the effectiveness of CFC, demonstrating its ability to produce high-quality, realistic human-fluid interaction animations.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"15 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yotam Erel, Rishabh Dabral, Vladislav Golyanik, Amit H. Bermano, Christian Theobalt
Light control in generated images is a difficult task, posing specific challenges, spanning over the entire image and frequency spectrum. Most approaches tackle this problem by training on extensive yet domain-specific datasets, limiting the inherent generalization and applicability of the foundational backbones used. Instead, PractiLight is a practical approach, effectively leveraging foundational understanding of recent generative models for the task. Our key insight is that lighting relationships in an image are similar in nature to token interaction in self-attention layers, and hence are best represented there. Based on this and other analyses regarding the importance of early diffusion iterations, PractiLight trains a lightweight LoRA regressor to produce the direct-irradiance map for a given image, using a small set of training images. We then employ this regressor to incorporate the desired lighting into the generation process of another image using Classifier Guidance. This careful design generalizes well to diverse conditions and image domains. We demonstrate state-of-the-art performance in terms of quality and control with proven parameter and data efficiency compared to leading works over a wide variety of scene types. We hope this work affirms that image lighting can feasibly be controlled by tapping into foundational knowledge, enabling practical and general relighting.
{"title":"PractiLight: Practical Light Control Using Foundational Diffusion Models","authors":"Yotam Erel, Rishabh Dabral, Vladislav Golyanik, Amit H. Bermano, Christian Theobalt","doi":"10.1145/3763342","DOIUrl":"https://doi.org/10.1145/3763342","url":null,"abstract":"Light control in generated images is a difficult task, posing specific challenges, spanning over the entire image and frequency spectrum. Most approaches tackle this problem by training on extensive yet domain-specific datasets, limiting the inherent generalization and applicability of the foundational backbones used. Instead, <jats:italic toggle=\"yes\">PractiLight</jats:italic> is a practical approach, effectively leveraging foundational understanding of recent generative models for the task. Our key insight is that lighting relationships in an image are similar in nature to token interaction in self-attention layers, and hence are best represented there. Based on this and other analyses regarding the importance of early diffusion iterations, <jats:italic toggle=\"yes\">PractiLight</jats:italic> trains a lightweight LoRA regressor to produce the direct-irradiance map for a given image, using a small set of training images. We then employ this regressor to incorporate the desired lighting into the generation process of another image using Classifier Guidance. This careful design generalizes well to diverse conditions and image domains. We demonstrate state-of-the-art performance in terms of quality and control with proven parameter and data efficiency compared to leading works over a wide variety of scene types. We hope this work affirms that image lighting can feasibly be controlled by tapping into foundational knowledge, enabling practical and general relighting.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"34 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Ma, Qian He, Gaofeng He, Huang Chen, Chen Liu, Xiaogang Jin, Huamin Wang
Diffusion models have significantly advanced image manipulation techniques, and their ability to generate photorealistic images is beginning to transform retail workflows, particularly in presale visualization. Beyond artistic style transfer, the capability to perform fine-grained visual feature transfer is becoming increasingly important. Embroidery is a textile art form characterized by intricate interplay of diverse stitch patterns and material properties, which poses unique challenges for existing style transfer methods. To explore the customization for such fine-grained features, we propose a novel contrastive learning framework that disentangles fine-grained style and content features with a single reference image, building on the classic concept of image analogy. We first construct an image pair to define the target style, and then adopt a similarity metric based on the decoupled representations of pretrained diffusion models for style-content separation. Subsequently, we propose a two-stage contrastive LoRA modulation technique to capture fine-grained style features. In the first stage, we iteratively update the whole LoRA and the selected style blocks to initially separate style from content. In the second stage, we design a contrastive learning strategy to further decouple style and content through self-knowledge distillation. Finally, we build an inference pipeline to handle image or text inputs with only the style blocks. To evaluate our method on fine-grained style transfer, we build a benchmark for embroidery customization. Our approach surpasses prior methods on this task and further demonstrates strong generalization to three additional domains: artistic style transfer, sketch colorization, and appearance transfer. Our project is available at: https://style3d.github.io/embroidery_customization.
{"title":"One-shot Embroidery Customization via Contrastive LoRA Modulation","authors":"Jun Ma, Qian He, Gaofeng He, Huang Chen, Chen Liu, Xiaogang Jin, Huamin Wang","doi":"10.1145/3763290","DOIUrl":"https://doi.org/10.1145/3763290","url":null,"abstract":"Diffusion models have significantly advanced image manipulation techniques, and their ability to generate photorealistic images is beginning to transform retail workflows, particularly in presale visualization. Beyond artistic style transfer, the capability to perform fine-grained visual feature transfer is becoming increasingly important. Embroidery is a textile art form characterized by intricate interplay of diverse stitch patterns and material properties, which poses unique challenges for existing style transfer methods. To explore the customization for such fine-grained features, we propose a novel contrastive learning framework that disentangles fine-grained style and content features with a single reference image, building on the classic concept of image analogy. We first construct an image pair to define the target style, and then adopt a similarity metric based on the decoupled representations of pretrained diffusion models for style-content separation. Subsequently, we propose a two-stage contrastive LoRA modulation technique to capture fine-grained style features. In the first stage, we iteratively update the whole LoRA and the selected style blocks to initially separate style from content. In the second stage, we design a contrastive learning strategy to further decouple style and content through self-knowledge distillation. Finally, we build an inference pipeline to handle image or text inputs with only the style blocks. To evaluate our method on fine-grained style transfer, we build a benchmark for embroidery customization. Our approach surpasses prior methods on this task and further demonstrates strong generalization to three additional domains: artistic style transfer, sketch colorization, and appearance transfer. Our project is available at: https://style3d.github.io/embroidery_customization.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"70 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross fields play a critical role in various geometry processing tasks, especially for quad mesh generation. Existing methods for cross field generation often struggle to balance computational efficiency with generation quality, using slow per-shape optimization. We introduce CrossGen , a novel framework that supports both feed-forward prediction and latent generative modeling of cross fields for quad meshing by unifying geometry and cross field representations within a joint latent space. Our method enables extremely fast computation of high-quality cross fields of general input shapes, typically within one second without per-shape optimization. Our method assumes a point-sampled surface, also called a point-cloud surface , as input, so we can accommodate various surface representations by a straightforward point sampling process. Using an auto-encoder network architecture, we encode input point-cloud surfaces into a sparse voxel grid with fine-grained latent spaces, which are decoded into both SDF-based surface geometry and cross fields (see the teaser figure). We also contribute a dataset of models with both high-quality signed distance fields (SDFs) representations and their corresponding cross fields, and use it to train our network. Once trained, the network is capable of computing a cross field of an input surface in a feed-forward manner, ensuring high geometric fidelity, noise resilience, and rapid inference. Furthermore, leveraging the same unified latent representation, we incorporate a diffusion model for computing cross fields of new shapes generated from partial input, such as sketches. To demonstrate its practical applications, we validate CrossGen on the quad mesh generation task for a large variety of surface shapes. Experimental results demonstrate that CrossGen generalizes well across diverse shapes and consistently yields high-fidelity cross fields, thus facilitating the generation of high-quality quad meshes.
{"title":"CrossGen: Learning and Generating Cross Fields for Quad Meshing","authors":"Qiujie Dong, Jiepeng Wang, Rui Xu, Cheng Lin, Yuan Liu, Shiqing Xin, Zichun Zhong, Xin Li, Changhe Tu, Taku Komura, Leif Kobbelt, Scott Schaefer, Wenping Wang","doi":"10.1145/3763299","DOIUrl":"https://doi.org/10.1145/3763299","url":null,"abstract":"Cross fields play a critical role in various geometry processing tasks, especially for quad mesh generation. Existing methods for cross field generation often struggle to balance computational efficiency with generation quality, using slow per-shape optimization. We introduce <jats:italic toggle=\"yes\">CrossGen</jats:italic> , a novel framework that supports both feed-forward prediction and latent generative modeling of cross fields for quad meshing by unifying geometry and cross field representations within a joint latent space. Our method enables extremely fast computation of high-quality cross fields of general input shapes, typically within one second without per-shape optimization. Our method assumes a point-sampled surface, also called a <jats:italic toggle=\"yes\">point-cloud surface</jats:italic> , as input, so we can accommodate various surface representations by a straightforward point sampling process. Using an auto-encoder network architecture, we encode input point-cloud surfaces into a sparse voxel grid with fine-grained latent spaces, which are decoded into both SDF-based surface geometry and cross fields (see the teaser figure). We also contribute a dataset of models with both high-quality signed distance fields (SDFs) representations and their corresponding cross fields, and use it to train our network. Once trained, the network is capable of computing a cross field of an input surface in a feed-forward manner, ensuring high geometric fidelity, noise resilience, and rapid inference. Furthermore, leveraging the same unified latent representation, we incorporate a diffusion model for computing cross fields of new shapes generated from partial input, such as sketches. To demonstrate its practical applications, we validate <jats:italic toggle=\"yes\">CrossGen</jats:italic> on the quad mesh generation task for a large variety of surface shapes. Experimental results demonstrate that <jats:italic toggle=\"yes\">CrossGen</jats:italic> generalizes well across diverse shapes and consistently yields high-fidelity cross fields, thus facilitating the generation of high-quality quad meshes.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"26 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Otman Benchekroun, Eitan Grinspun, Maurizio Chiaramonte, Philip Allen Etter
Designing subspaces for Reduced Order Modeling (ROM) is crucial for accelerating finite element simulations in graphics and engineering. Unfortunately, it's not always clear which subspace is optimal for arbitrary dynamic simulation. We propose to construct simulation subspaces from force distributions, allowing us to tailor such subspaces to common scene interactions involving constraint penalties, handles-based control, contact and musculoskeletal actuation. To achieve this we adopt a statistical perspective on Reduced Order Modelling, which allows us to push such user-designed force distributions through a linearized simulation to obtain a dual distribution on displacements. To construct our subspace, we then fit a low-rank Gaussian model to this displacement distribution, which we show generalizes Linear Modal Analysis subspaces for uncorrelated unit variance force distributions, as well as Green's Function subspaces for low rank force distributions. We show our framework allows for the construction of subspaces that are optimal both with respect to physical material properties, as well as arbitrary force distributions as observed in handle-based, contact, and musculoskeletal scene interactions.
{"title":"Force-Dual Modes: Subspace Design from Stochastic Forces","authors":"Otman Benchekroun, Eitan Grinspun, Maurizio Chiaramonte, Philip Allen Etter","doi":"10.1145/3763310","DOIUrl":"https://doi.org/10.1145/3763310","url":null,"abstract":"Designing subspaces for Reduced Order Modeling (ROM) is crucial for accelerating finite element simulations in graphics and engineering. Unfortunately, it's not always clear which subspace is optimal for arbitrary dynamic simulation. We propose to construct simulation subspaces from force distributions, allowing us to tailor such subspaces to common scene interactions involving constraint penalties, handles-based control, contact and musculoskeletal actuation. To achieve this we adopt a statistical perspective on Reduced Order Modelling, which allows us to push such user-designed force distributions through a linearized simulation to obtain a dual distribution on displacements. To construct our subspace, we then fit a low-rank Gaussian model to this displacement distribution, which we show generalizes Linear Modal Analysis subspaces for uncorrelated unit variance force distributions, as well as Green's Function subspaces for low rank force distributions. We show our framework allows for the construction of subspaces that are optimal both with respect to physical material properties, as well as arbitrary force distributions as observed in handle-based, contact, and musculoskeletal scene interactions.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"110 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanjeev Muralikrishnan, Niladri Shekhar Dutt, Niloy J. Mitra
Animation retargetting applies sparse motion description (e.g., keypoint sequences) to a character mesh to produce a semantically plausible and temporally coherent full-body mesh sequence. Existing approaches come with restrictions - they require access to template-based shape priors or artist-designed deformation rigs, suffer from limited generalization to unseen motion and/or shapes, or exhibit motion jitter. We propose Self-supervised Motion Fields (SMF), a self-supervised framework that is trained with only sparse motion representations, without requiring dataset-specific annotations, templates, or rigs. At the heart of our method are Kinetic Codes, a novel autoencoder-based sparse motion encoding, that exposes a semantically rich latent space, simplifying large-scale training. Our architecture comprises dedicated spatial and temporal gradient predictors, which are jointly trained in an end-to-end fashion. The combined network, regularized by the Kinetic Codes' latent space, has good generalization across both unseen shapes and new motions. We evaluated our method on unseen motion sampled from AMASS, D4D, Mixamo, and raw monocular video for animation transfer on various characters with varying shapes and topology. We report a new SoTA on the AMASS dataset in the context of generalization to unseen motion.
{"title":"SMF: Template-free and Rig-free Animation Transfer using Kinetic Codes","authors":"Sanjeev Muralikrishnan, Niladri Shekhar Dutt, Niloy J. Mitra","doi":"10.1145/3763309","DOIUrl":"https://doi.org/10.1145/3763309","url":null,"abstract":"Animation retargetting applies sparse motion description (e.g., keypoint sequences) to a character mesh to produce a semantically plausible and temporally coherent full-body mesh sequence. Existing approaches come with restrictions - they require access to template-based shape priors or artist-designed deformation rigs, suffer from limited generalization to unseen motion and/or shapes, or exhibit motion jitter. We propose Self-supervised Motion Fields (SMF), a self-supervised framework that is trained with only sparse motion representations, without requiring dataset-specific annotations, templates, or rigs. At the heart of our method are Kinetic Codes, a novel autoencoder-based sparse motion encoding, that exposes a semantically rich latent space, simplifying large-scale training. Our architecture comprises dedicated spatial and temporal gradient predictors, which are jointly trained in an end-to-end fashion. The combined network, regularized by the Kinetic Codes' latent space, has good generalization across both unseen shapes and new motions. We evaluated our method on unseen motion sampled from AMASS, D4D, Mixamo, and raw monocular video for animation transfer on various characters with varying shapes and topology. We report a new SoTA on the AMASS dataset in the context of generalization to unseen motion.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"203 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.
{"title":"MALeR: Improving Compositional Fidelity in Layout-Guided Generation","authors":"Shivank Saxena, Dhruv Srivastava, Makarand Tapaswi","doi":"10.1145/3763341","DOIUrl":"https://doi.org/10.1145/3763341","url":null,"abstract":"Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"29 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Stippel, Felix Mujkanovic, Thomas Leimkühler, Pedro Hermosilla
Accurate surface geometry representation is crucial in 3D visual computing. Explicit representations, such as polygonal meshes, and implicit representations, like signed distance functions, each have distinct advantages, making efficient conversions between them increasingly important. Conventional surface extraction methods for implicit representations, such as the widely used Marching Cubes algorithm, rely on spatial decomposition and sampling, leading to inaccuracies due to fixed and limited resolution. We introduce a novel approach for analytically extracting surfaces from neural implicit functions. Our method operates natively in parallel and can navigate large neural architectures. By leveraging the fact that each neuron partitions the domain, we develop a depth-first traversal strategy to efficiently track the encoded surface. The resulting meshes faithfully capture the full geometric information from the network without ad-hoc spatial discretization, achieving unprecedented accuracy across diverse shapes and network architectures while maintaining competitive speed.
{"title":"Marching Neurons: Accurate Surface Extraction for Neural Implicit Shapes","authors":"Christian Stippel, Felix Mujkanovic, Thomas Leimkühler, Pedro Hermosilla","doi":"10.1145/3763328","DOIUrl":"https://doi.org/10.1145/3763328","url":null,"abstract":"Accurate surface geometry representation is crucial in 3D visual computing. Explicit representations, such as polygonal meshes, and implicit representations, like signed distance functions, each have distinct advantages, making efficient conversions between them increasingly important. Conventional surface extraction methods for implicit representations, such as the widely used Marching Cubes algorithm, rely on spatial decomposition and sampling, leading to inaccuracies due to fixed and limited resolution. We introduce a novel approach for analytically extracting surfaces from neural implicit functions. Our method operates natively in parallel and can navigate large neural architectures. By leveraging the fact that each neuron partitions the domain, we develop a depth-first traversal strategy to efficiently track the encoded surface. The resulting meshes faithfully capture the full geometric information from the network without ad-hoc spatial discretization, achieving unprecedented accuracy across diverse shapes and network architectures while maintaining competitive speed.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"33 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A core operation in Monte Carlo volume rendering is transmittance estimation: Given a segment along a ray, the goal is to estimate the fraction of light that will pass through this segment without encountering absorption or out-scattering. A naive approach is to estimate optical depth τ using unbiased ray marching and to then use exp(-τ) as transmittance estimate. However, this strategy systematically overestimates transmittance due to Jensen's inequality. On the other hand, existing unbiased transmittance estimators either suffer from high variance or have a cost governed by random decisions, which makes them less suitable for SIMD architectures. We propose a biased transmittance estimator with significantly reduced bias compared to the naive approach and a deterministic and low cost. We observe that ray marching with stratified jittered sampling results in estimates of optical depth that are nearly normal-distributed. We then apply the unique minimum variance unbiased (UMVU) estimator of exp(- τ ) based on two such estimates (using two different sets of random numbers). Bias only arises from violations of the assumption of normal-distributed inputs. We further reduce bias and variance using a variance-aware importance sampling scheme. The underlying theory can be used to estimate any analytic function of optical depth. We use this generalization to estimate multiple importance sampling (MIS) weights and introduce two integrators: Unbiased MIS with biased MIS weights and a more efficient but biased combination of MIS and transmittance estimation.
{"title":"Jackknife Transmittance and MIS Weight Estimation","authors":"Christoph Peters","doi":"10.1145/3763273","DOIUrl":"https://doi.org/10.1145/3763273","url":null,"abstract":"A core operation in Monte Carlo volume rendering is transmittance estimation: Given a segment along a ray, the goal is to estimate the fraction of light that will pass through this segment without encountering absorption or out-scattering. A naive approach is to estimate optical depth τ using unbiased ray marching and to then use exp(-τ) as transmittance estimate. However, this strategy systematically overestimates transmittance due to Jensen's inequality. On the other hand, existing unbiased transmittance estimators either suffer from high variance or have a cost governed by random decisions, which makes them less suitable for SIMD architectures. We propose a biased transmittance estimator with significantly reduced bias compared to the naive approach and a deterministic and low cost. We observe that ray marching with stratified jittered sampling results in estimates of optical depth that are nearly normal-distributed. We then apply the unique minimum variance unbiased (UMVU) estimator of exp(- <jats:italic toggle=\"yes\">τ</jats:italic> ) based on two such estimates (using two different sets of random numbers). Bias only arises from violations of the assumption of normal-distributed inputs. We further reduce bias and variance using a variance-aware importance sampling scheme. The underlying theory can be used to estimate any analytic function of optical depth. We use this generalization to estimate multiple importance sampling (MIS) weights and introduce two integrators: Unbiased MIS with biased MIS weights and a more efficient but biased combination of MIS and transmittance estimation.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"155 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}