Neural video game simulators emerged as powerful tools to generate and edit videos. Their idea is to represent games as the evolution of an environment’s state driven by the actions of its agents. While such a paradigm enables users to play a game action-by-action, its rigidity precludes more semantic forms of control. To overcome this limitation, we augment game models with prompts specified as a set of natural language actions and desired states. The result—a Promptable Game Model (PGM)—makes it possible for a user to play the game by prompting it with high- and low-level action sequences. Most captivatingly, our PGM unlocks the director’s mode, where the game is played by specifying goals for the agents in the form of a prompt. This requires learning “game AI”, encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, and devise a strategy to win a point. To render the resulting state, we use a compositional NeRF representation encapsulated in our synthesis model. To foster future research, we present newly collected, annotated and calibrated Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art. Our framework, data, and models are available at snap-research.github.io/promptable-game-models.
{"title":"Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion Models","authors":"Willi Menapace, Aliaksandr Siarohin, Stéphane Lathuilière, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, Elisa Ricci","doi":"10.1145/3635705","DOIUrl":"https://doi.org/10.1145/3635705","url":null,"abstract":"<p>Neural video game simulators emerged as powerful tools to generate and edit videos. Their idea is to represent games as the evolution of an environment’s state driven by the actions of its agents. While such a paradigm enables users to <i>play</i> a game action-by-action, its rigidity precludes more semantic forms of control. To overcome this limitation, we augment game models with <i>prompts</i> specified as a set of <i>natural language</i> actions and <i>desired states</i>. The result—a Promptable Game Model (PGM)—makes it possible for a user to <i>play</i> the game by prompting it with high- and low-level action sequences. Most captivatingly, our PGM unlocks the <i>director’s mode</i>, where the game is played by specifying goals for the agents in the form of a prompt. This requires learning “game AI”, encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, and devise a strategy to win a point. To render the resulting state, we use a compositional NeRF representation encapsulated in our synthesis model. To foster future research, we present newly collected, annotated and calibrated Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art. Our framework, data, and models are available at snap-research.github.io/promptable-game-models.</p>","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"11 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138544775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingyu Hu, Ka-Hei Hui, Zhengzhe Liu, Ruihui Li, Chi-Wing Fu
This paper presents a new approach for 3D shape generation, inversion, and manipulation, through a direct generative modeling on a continuous implicit representation in wavelet domain. Specifically, we propose a compact wavelet representation with a pair of coarse and detail coefficient volumes to implicitly represent 3D shapes via truncated signed distance functions and multi-scale biorthogonal wavelets. Then, we design a pair of neural networks: a diffusion-based generator to produce diverse shapes in the form of the coarse coefficient volumes and a detail predictor to produce compatible detail coefficient volumes for introducing fine structures and details. Further, we may jointly train an encoder network to learn a latent space for inverting shapes, allowing us to enable a rich variety of whole-shape and region-aware shape manipulations. Both quantitative and qualitative experimental results manifest the compelling shape generation, inversion, and manipulation capabilities of our approach over the state-of-the-art methods.
{"title":"Neural Wavelet-domain Diffusion for 3D Shape Generation, Inversion, and Manipulation","authors":"Jingyu Hu, Ka-Hei Hui, Zhengzhe Liu, Ruihui Li, Chi-Wing Fu","doi":"10.1145/3635304","DOIUrl":"https://doi.org/10.1145/3635304","url":null,"abstract":"<p>This paper presents a new approach for 3D shape generation, inversion, and manipulation, through a direct generative modeling on a continuous implicit representation in wavelet domain. Specifically, we propose a <i>compact wavelet representation</i> with a pair of coarse and detail coefficient volumes to implicitly represent 3D shapes via truncated signed distance functions and multi-scale biorthogonal wavelets. Then, we design a pair of neural networks: a diffusion-based <i>generator</i> to produce diverse shapes in the form of the coarse coefficient volumes and a <i>detail predictor</i> to produce compatible detail coefficient volumes for introducing fine structures and details. Further, we may jointly train an <i>encoder network</i> to learn a latent space for inverting shapes, allowing us to enable a rich variety of whole-shape and region-aware shape manipulations. Both quantitative and qualitative experimental results manifest the compelling shape generation, inversion, and manipulation capabilities of our approach over the state-of-the-art methods.</p>","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":" 14","pages":""},"PeriodicalIF":6.2,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138473490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jia-Mu Sun, Jie Yang, Kaichun Mo, Yu-Kun Lai, Leonidas Guibas, Lin Gao
3D scene synthesis facilitates and benefits many real-world applications. Most scene generators focus on making indoor scenes plausible via learning from training data and leveraging extra constraints such as adjacency and symmetry. Although the generated 3D scenes are mostly plausible with visually realistic layouts, they can be functionally unsuitable for human users to navigate and interact with furniture. Our key observation is that human activity plays a critical role and sufficient free space is essential for human-scene interactions. This is exactly where many existing synthesized scenes fail – the seemingly correct layouts are often not fit for living. To tackle this, we present a human-aware optimization framework Haisor for 3D indoor scene arrangement via reinforcement learning, which aims to find an action sequence to optimize the indoor scene layout automatically. Based on the hierarchical scene graph representation, an optimal action sequence is predicted and performed via Deep Q-Learning with Monte Carlo Tree Search (MCTS), where MCTS is our key feature to search for the optimal solution in long-term sequences and large action space. Multiple human-aware rewards are designed as our core criteria of human-scene interaction, aiming to identify the next smart action by leveraging powerful reinforcement learning. Our framework is optimized end-to-end by giving the indoor scenes with part-level furniture layout including part mobility information. Furthermore, our methodology is extensible and allows utilizing different reward designs to achieve personalized indoor scene synthesis. Extensive experiments demonstrate that our approach optimizes the layout of 3D indoor scenes in a human-aware manner, which is more realistic and plausible than original state-of-the-art generator results, and our approach produces superior smart actions, outperforming alternative baselines.
{"title":"Haisor: Human-Aware Indoor Scene Optimization via Deep Reinforcement Learning","authors":"Jia-Mu Sun, Jie Yang, Kaichun Mo, Yu-Kun Lai, Leonidas Guibas, Lin Gao","doi":"10.1145/3632947","DOIUrl":"https://doi.org/10.1145/3632947","url":null,"abstract":"<p>3D scene synthesis facilitates and benefits many real-world applications. Most scene generators focus on making indoor scenes plausible via learning from training data and leveraging extra constraints such as adjacency and symmetry. Although the generated 3D scenes are mostly plausible with visually realistic layouts, they can be functionally unsuitable for human users to navigate and interact with furniture. Our key observation is that human activity plays a critical role and sufficient free space is essential for human-scene interactions. This is exactly where many existing synthesized scenes fail – the seemingly correct layouts are often not fit for living. To tackle this, we present a human-aware optimization framework <span>Haisor</span> for 3D indoor scene arrangement via reinforcement learning, which aims to find an action sequence to optimize the indoor scene layout automatically. Based on the hierarchical scene graph representation, an optimal action sequence is predicted and performed via Deep Q-Learning with Monte Carlo Tree Search (MCTS), where MCTS is our key feature to search for the optimal solution in long-term sequences and large action space. Multiple human-aware rewards are designed as our core criteria of human-scene interaction, aiming to identify the next smart action by leveraging powerful reinforcement learning. Our framework is optimized end-to-end by giving the indoor scenes with part-level furniture layout including part mobility information. Furthermore, our methodology is extensible and allows utilizing different reward designs to achieve personalized indoor scene synthesis. Extensive experiments demonstrate that our approach optimizes the layout of 3D indoor scenes in a human-aware manner, which is more realistic and plausible than original state-of-the-art generator results, and our approach produces superior smart actions, outperforming alternative baselines.</p>","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"86 19","pages":""},"PeriodicalIF":6.2,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138438944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vladimir Garanzha, Igor Kaporin, Liudmila Kudryavtseva, Francois Protais, Dmitry Sokolov
Optimal mapping is one of the longest-standing problems in computational mathematics. It is natural to measure the relative curve length error under map to assess its quality. The maximum of such error is called the quasi-isometry constant, and its minimization is a nontrivial max-norm optimization problem. We present a physics-based quasi-isometric stiffening (QIS) algorithm for the max-norm minimization of hyperelastic distortion.
QIS perfectly equidistributes distortion over the entire domain for the ground truth test (unit hemisphere flattening) and, when it is not possible, tends to create zones where all cells have the same distortion. Such zones correspond to fragments of elastic material that became rigid under stiffening, reaching the deformation limit. As such, maps built by QIS are related to the de Boor equidistribution principle, which asks for an integral of a certain error indicator function to be the same over each mesh cell.
Under certain assumptions on the minimization toolbox, we prove that our method can build, in a finite number of steps, a deformation whose maximum distortion is arbitrarily close to the (unknown) minimum. We performed extensive testing: on more than 10,000 domains QIS was reliably better than the competing methods. In summary, we reliably build 2D and 3D mesh deformations with the smallest known distortion estimates for very stiff problems.
{"title":"In the Quest for Scale-Optimal Mappings","authors":"Vladimir Garanzha, Igor Kaporin, Liudmila Kudryavtseva, Francois Protais, Dmitry Sokolov","doi":"10.1145/3627102","DOIUrl":"https://doi.org/10.1145/3627102","url":null,"abstract":"<p>Optimal mapping is one of the longest-standing problems in computational mathematics. It is natural to measure the relative curve length error under map to assess its quality. The maximum of such error is called the quasi-isometry constant, and its minimization is a nontrivial max-norm optimization problem. We present a physics-based quasi-isometric stiffening (QIS) algorithm for the max-norm minimization of hyperelastic distortion. </p><p>QIS perfectly equidistributes distortion over the entire domain for the ground truth test (unit hemisphere flattening) and, when it is not possible, tends to create zones where all cells have the same distortion. Such zones correspond to fragments of elastic material that became rigid under stiffening, reaching the deformation limit. As such, maps built by QIS are related to the de Boor equidistribution principle, which asks for an integral of a certain error indicator function to be the same over each mesh cell. </p><p>Under certain assumptions on the minimization toolbox, we prove that our method can build, in a finite number of steps, a deformation whose maximum distortion is arbitrarily close to the (unknown) minimum. We performed extensive testing: on more than 10,000 domains QIS was reliably better than the competing methods. In summary, we reliably build 2D and 3D mesh deformations with the smallest known distortion estimates for very stiff problems.</p>","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"86 23","pages":""},"PeriodicalIF":6.2,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138438943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We develop an optimization-based method to model smocking, a surface embroidery technique that provides decorative geometric texturing while maintaining stretch properties of the fabric. During smocking, multiple pairs of points on the fabric are stitched together, creating non-manifold geometric features and visually pleasing textures. Designing smocking patterns is challenging, because the outcome of stitching is unpredictable: the final texture is often revealed only when the whole smocking process is completed, necessitating painstaking physical fabrication and time consuming trial-and-error experimentation. This motivates us to seek a digital smocking design method. Straightforward attempts to compute smocked fabric geometry using surface deformation or cloth simulation methods fail to produce realistic results, likely due to the intricate structure of the designs, the large number of contacts and high-curvature folds. We instead formulate smocking as a graph embedding and shape deformation problem. We extract a coarse graph representing the fabric and the stitching constraints, and then derive the graph structure of the smocked result. We solve for the 3D embedding of this graph, which in turn reliably guides the deformation of the high-resolution fabric mesh. Our optimization based method is simple, efficient, and flexible, which allows us to build an interactive system for smocking pattern exploration. To demonstrate the accuracy of our method, we compare our results to real fabrications on a large set of smocking patterns.
{"title":"Digital 3D Smocking Design","authors":"Jing Ren, Aviv Segall, Olga Sorkine-Hornung","doi":"10.1145/3631945","DOIUrl":"https://doi.org/10.1145/3631945","url":null,"abstract":"<p>We develop an optimization-based method to model <i>smocking</i>, a surface embroidery technique that provides decorative geometric texturing while maintaining stretch properties of the fabric. During smocking, multiple pairs of points on the fabric are stitched together, creating non-manifold geometric features and visually pleasing textures. Designing smocking patterns is challenging, because the outcome of stitching is unpredictable: the final texture is often revealed only when the whole smocking process is completed, necessitating painstaking physical fabrication and time consuming trial-and-error experimentation. This motivates us to seek a digital smocking design method. Straightforward attempts to compute smocked fabric geometry using surface deformation or cloth simulation methods fail to produce realistic results, likely due to the intricate structure of the designs, the large number of contacts and high-curvature folds. We instead formulate smocking as a graph embedding and shape deformation problem. We extract a coarse graph representing the fabric and the stitching constraints, and then derive the graph structure of the smocked result. We solve for the 3D embedding of this graph, which in turn reliably guides the deformation of the high-resolution fabric mesh. Our optimization based method is simple, efficient, and flexible, which allows us to build an interactive system for smocking pattern exploration. To demonstrate the accuracy of our method, we compare our results to real fabrications on a large set of smocking patterns.</p>","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"87 1","pages":""},"PeriodicalIF":6.2,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138438942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Rhys Jeske, Lukas Westhofen, Fabian Löschner, José Antonio Fernández-Fernández, Jan Bender
The numerical simulation of surface tension is an active area of research in many different fields of application and has been attempted using a wide range of methods. Our contribution is the derivation and implementation of an implicit cohesion force based approach for the simulation of surface tension effects using the Smoothed Particle Hydrodynamics (SPH) method. We define a continuous formulation inspired by the properties of surface tension at the molecular scale which is spatially discretized using SPH. An adapted variant of the linearized backward Euler method is used for time discretization, which we also strongly couple with an implicit viscosity model. Finally, we extend our formulation with adhesion forces for interfaces with rigid objects. Existing SPH approaches for surface tension in computer graphics are mostly based on explicit time integration, thereby lacking in stability for challenging settings. We compare our implicit surface tension method to these approaches and further evaluate our model on a wider variety of complex scenarios, showcasing its efficacy and versatility. Among others, these include but are not limited to simulations of a water crown, a dripping faucet and a droplet-toy.
{"title":"Implicit Surface Tension for SPH Fluid Simulation","authors":"Stefan Rhys Jeske, Lukas Westhofen, Fabian Löschner, José Antonio Fernández-Fernández, Jan Bender","doi":"10.1145/3631936","DOIUrl":"https://doi.org/10.1145/3631936","url":null,"abstract":"The numerical simulation of surface tension is an active area of research in many different fields of application and has been attempted using a wide range of methods. Our contribution is the derivation and implementation of an implicit cohesion force based approach for the simulation of surface tension effects using the Smoothed Particle Hydrodynamics (SPH) method. We define a continuous formulation inspired by the properties of surface tension at the molecular scale which is spatially discretized using SPH. An adapted variant of the linearized backward Euler method is used for time discretization, which we also strongly couple with an implicit viscosity model. Finally, we extend our formulation with adhesion forces for interfaces with rigid objects. Existing SPH approaches for surface tension in computer graphics are mostly based on explicit time integration, thereby lacking in stability for challenging settings. We compare our implicit surface tension method to these approaches and further evaluate our model on a wider variety of complex scenarios, showcasing its efficacy and versatility. Among others, these include but are not limited to simulations of a water crown, a dripping faucet and a droplet-toy.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"277 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135474947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We show how a Transformer can encode hierarchical tree-like string structures by introducing a new deep learning-based framework for generating 3D biological tree models represented as Lindenmayer system (L-system) strings. L-systems are string-rewriting procedural systems that encode tree topology and geometry. L-systems are efficient, but creating the production rules is one of the most critical problems precluding their usage in practice. We substitute the procedural rules creation with a deep neural model. Instead of writing the rules, we train a deep neural model that produces the output strings. We train our model on 155k tree geometries that are encoded as L-strings, de-parameterized, and converted to a hierarchy of linear sequences corresponding to branches. An end-to-end deep learning model with an attention mechanism then learns the distributions of geometric operations and branches from the input, effectively replacing the L-system rewriting rule generation. The trained deep model generates new L-strings representing 3D tree models in the same way L-systems do by providing the starting string. Our model allows for the generation of a wide variety of new trees, and the deep model agrees with the input by 93.7% in branching angles, 97.2% in branch lengths, and 92.3% in an extracted list of geometric features. We also validate the generated trees using perceptual metrics showing 97% agreement with input geometric models.
{"title":"Latent L-systems: Transformer-based Tree Generator","authors":"Jae Joong Lee, Bosheng Li, Bedrich Benes","doi":"10.1145/3627101","DOIUrl":"https://doi.org/10.1145/3627101","url":null,"abstract":"We show how a Transformer can encode hierarchical tree-like string structures by introducing a new deep learning-based framework for generating 3D biological tree models represented as Lindenmayer system (L-system) strings. L-systems are string-rewriting procedural systems that encode tree topology and geometry. L-systems are efficient, but creating the production rules is one of the most critical problems precluding their usage in practice. We substitute the procedural rules creation with a deep neural model. Instead of writing the rules, we train a deep neural model that produces the output strings. We train our model on 155k tree geometries that are encoded as L-strings, de-parameterized, and converted to a hierarchy of linear sequences corresponding to branches. An end-to-end deep learning model with an attention mechanism then learns the distributions of geometric operations and branches from the input, effectively replacing the L-system rewriting rule generation. The trained deep model generates new L-strings representing 3D tree models in the same way L-systems do by providing the starting string. Our model allows for the generation of a wide variety of new trees, and the deep model agrees with the input by 93.7% in branching angles, 97.2% in branch lengths, and 92.3% in an extracted list of geometric features. We also validate the generated trees using perceptual metrics showing 97% agreement with input geometric models.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"60 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135875314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single image rectification of document deformation is a challenging task. Although some recent deep learning-based methods have attempted to solve this problem, they cannot achieve satisfactory results when dealing with document images with complex deformations. In this article, we propose a new efficient framework for document flattening. Our main insight is that most layout primitives in a document have rectangular outline shapes, making unwarping local layout primitives essentially homogeneous with unwarping the entire document. The former task is clearly more straightforward to solve than the latter due to the more consistent texture and relatively smooth deformation. On this basis, we propose a layout-aware deep model working in a divide-and-conquer manner. First, we employ a transformer-based segmentation module to obtain the layout information of the input document. Then a new regression module is applied to predict the global and local UV maps. Finally, we design an effective merging algorithm to correct the global prediction with local details. Both quantitative and qualitative experimental results demonstrate that our framework achieves favorable performance against state-of-the-art methods. In addition, the current publicly available document flattening datasets have limited 3D paper shapes without layout annotation and also lack a general geometric correction metric. Therefore, we build a new large-scale synthetic dataset by utilizing a fully automatic rendering method to generate deformed documents with diverse shapes and exact layout segmentation labels. We also propose a new geometric correction metric based on our paired document UV maps. Code and dataset will be released at https://github.com/BunnySoCrazy/LA-DocFlatten .
{"title":"Layout-Aware Single-Image Document Flattening","authors":"Pu Li, Weize Quan, Jianwei Guo, Dong-Ming Yan","doi":"10.1145/3627818","DOIUrl":"https://doi.org/10.1145/3627818","url":null,"abstract":"Single image rectification of document deformation is a challenging task. Although some recent deep learning-based methods have attempted to solve this problem, they cannot achieve satisfactory results when dealing with document images with complex deformations. In this article, we propose a new efficient framework for document flattening. Our main insight is that most layout primitives in a document have rectangular outline shapes, making unwarping local layout primitives essentially homogeneous with unwarping the entire document. The former task is clearly more straightforward to solve than the latter due to the more consistent texture and relatively smooth deformation. On this basis, we propose a layout-aware deep model working in a divide-and-conquer manner. First, we employ a transformer-based segmentation module to obtain the layout information of the input document. Then a new regression module is applied to predict the global and local UV maps. Finally, we design an effective merging algorithm to correct the global prediction with local details. Both quantitative and qualitative experimental results demonstrate that our framework achieves favorable performance against state-of-the-art methods. In addition, the current publicly available document flattening datasets have limited 3D paper shapes without layout annotation and also lack a general geometric correction metric. Therefore, we build a new large-scale synthetic dataset by utilizing a fully automatic rendering method to generate deformed documents with diverse shapes and exact layout segmentation labels. We also propose a new geometric correction metric based on our paired document UV maps. Code and dataset will be released at https://github.com/BunnySoCrazy/LA-DocFlatten .","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"58 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135875325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Narek Tumanyan, Omer Bar-Tal, Shir Amir, Shai Bagon, Tali Dekel
We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer – “Splice”, which works by training a generator on a single and arbitrary pair of structure-appearance images, and “SpliceNet”, a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain . Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.
{"title":"Disentangling Structure and Appearance in ViT Feature Space","authors":"Narek Tumanyan, Omer Bar-Tal, Shir Amir, Shai Bagon, Tali Dekel","doi":"10.1145/3630096","DOIUrl":"https://doi.org/10.1145/3630096","url":null,"abstract":"We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer – “Splice”, which works by training a generator on a single and arbitrary pair of structure-appearance images, and “SpliceNet”, a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain . Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"126 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135372875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jose Luis Ponton, Haoran Yun, Andreas Aristidou, Carlos Andujar, Nuria Pelechano
Accurate and reliable human motion reconstruction is crucial for creating natural interactions of full-body avatars in Virtual Reality (VR) and entertainment applications. As the Metaverse and social applications gain popularity, users are seeking cost-effective solutions to create full-body animations that are comparable in quality to those produced by commercial motion capture systems. In order to provide affordable solutions though, it is important to minimize the number of sensors attached to the subject’s body. Unfortunately, reconstructing the full-body pose from sparse data is a heavily under-determined problem. Some studies that use IMU sensors face challenges in reconstructing the pose due to positional drift and ambiguity of the poses. In recent years, some mainstream VR systems have released 6-degree-of-freedom (6-DoF) tracking devices providing positional and rotational information. Nevertheless, most solutions for reconstructing full-body poses rely on traditional inverse kinematics (IK) solutions, which often produce non-continuous and unnatural poses. In this article, we introduce SparsePoser, a novel deep learning-based solution for reconstructing a full-body pose from a reduced set of six tracking devices. Our system incorporates a convolutional-based autoencoder that synthesizes high-quality continuous human poses by learning the human motion manifold from motion capture data. Then, we employ a learned IK component, made of multiple lightweight feed-forward neural networks, to adjust the hands and feet toward the corresponding trackers. We extensively evaluate our method on publicly available motion capture datasets and with real-time live demos. We show that our method outperforms state-of-the-art techniques using IMU sensors or 6-DoF tracking devices, and can be used for users with different body dimensions and proportions.
{"title":"SparsePoser: Real-time Full-body Motion Reconstruction from Sparse Data","authors":"Jose Luis Ponton, Haoran Yun, Andreas Aristidou, Carlos Andujar, Nuria Pelechano","doi":"10.1145/3625264","DOIUrl":"https://doi.org/10.1145/3625264","url":null,"abstract":"Accurate and reliable human motion reconstruction is crucial for creating natural interactions of full-body avatars in Virtual Reality (VR) and entertainment applications. As the Metaverse and social applications gain popularity, users are seeking cost-effective solutions to create full-body animations that are comparable in quality to those produced by commercial motion capture systems. In order to provide affordable solutions though, it is important to minimize the number of sensors attached to the subject’s body. Unfortunately, reconstructing the full-body pose from sparse data is a heavily under-determined problem. Some studies that use IMU sensors face challenges in reconstructing the pose due to positional drift and ambiguity of the poses. In recent years, some mainstream VR systems have released 6-degree-of-freedom (6-DoF) tracking devices providing positional and rotational information. Nevertheless, most solutions for reconstructing full-body poses rely on traditional inverse kinematics (IK) solutions, which often produce non-continuous and unnatural poses. In this article, we introduce SparsePoser, a novel deep learning-based solution for reconstructing a full-body pose from a reduced set of six tracking devices. Our system incorporates a convolutional-based autoencoder that synthesizes high-quality continuous human poses by learning the human motion manifold from motion capture data. Then, we employ a learned IK component, made of multiple lightweight feed-forward neural networks, to adjust the hands and feet toward the corresponding trackers. We extensively evaluate our method on publicly available motion capture datasets and with real-time live demos. We show that our method outperforms state-of-the-art techniques using IMU sensors or 6-DoF tracking devices, and can be used for users with different body dimensions and proportions.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"198 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135765706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}