Computational Visual Media最新文献_第2页

Dance2MIDI: Dance-driven multi-instrument music generation Dance2MIDI: 舞蹈驱动的多乐器音乐生成器

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-07-24 DOI: 10.1007/s41095-024-0417-1

Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, Feilin Han

Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multi-instrument scenario is under-explored. The challenges associated with dance-driven multi-instrument music (MIDI) generation are twofold: (i) lack of a publicly available multi-instrument MIDI and video paired dataset and (ii) the weak correlation between music and video. To tackle these challenges, we have built the first multi-instrument MIDI and dance paired dataset (D2MIDI). Based on this dataset, we introduce a multi-instrument MIDI generation framework (Dance2MIDI) conditioned on dance video. Firstly, to capture the relationship between dance and music, we employ a graph convolutional network to encode the dance motion. This allows us to extract features related to dance movement and dance style. Secondly, to generate a harmonious rhythm, we utilize a transformer model to decode the drum track sequence, leveraging a cross-attention mechanism. Thirdly, we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task. A BERT-like model is employed to comprehend the context of the entire music piece through self-supervised learning. We evaluate the music generated by our framework trained on the D2MIDI dataset and demonstrate that our method achieves state-of-the-art performance.

舞蹈驱动音乐生成旨在生成以舞蹈视频为条件的音乐作品。以往的工作主要集中在单声道或原始音频的生成上，而对多乐器的情况则探索不足。舞蹈驱动的多乐器音乐（MIDI）生成面临两方面的挑战：(i) 缺乏公开的多乐器 MIDI 和视频配对数据集；(ii) 音乐和视频之间的相关性较弱。为了应对这些挑战，我们建立了首个多乐器 MIDI 和舞蹈配对数据集（D2MIDI）。在这个数据集的基础上，我们引入了一个以舞蹈视频为条件的多乐器 MIDI 生成框架（Dance2MIDI）。首先，为了捕捉舞蹈与音乐之间的关系，我们采用图卷积网络对舞蹈动作进行编码。这使我们能够提取与舞蹈动作和舞蹈风格相关的特征。其次，为了生成和谐的节奏，我们利用交叉注意机制，利用变压器模型对鼓声序列进行解码。第三，我们将根据鼓声音轨生成其余音轨的任务建模为序列理解和完成任务。我们采用类似于 BERT 的模型，通过自我监督学习来理解整首乐曲的上下文。我们在 D2MIDI 数据集上评估了由我们的框架训练生成的音乐，并证明我们的方法达到了最先进的性能。

{"title":"Dance2MIDI: Dance-driven multi-instrument music generation","authors":"Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, Feilin Han","doi":"10.1007/s41095-024-0417-1","DOIUrl":"https://doi.org/10.1007/s41095-024-0417-1","url":null,"abstract":"Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multi-instrument scenario is under-explored. The challenges associated with dance-driven multi-instrument music (MIDI) generation are twofold: (i) lack of a publicly available multi-instrument MIDI and video paired dataset and (ii) the weak correlation between music and video. To tackle these challenges, we have built the first multi-instrument MIDI and dance paired dataset (D2MIDI). Based on this dataset, we introduce a multi-instrument MIDI generation framework (Dance2MIDI) conditioned on dance video. Firstly, to capture the relationship between dance and music, we employ a graph convolutional network to encode the dance motion. This allows us to extract features related to dance movement and dance style. Secondly, to generate a harmonious rhythm, we utilize a transformer model to decode the drum track sequence, leveraging a cross-attention mechanism. Thirdly, we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task. A BERT-like model is employed to comprehend the context of the entire music piece through self-supervised learning. We evaluate the music generated by our framework trained on the D2MIDI dataset and demonstrate that our method achieves state-of-the-art performance.","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"38 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Continual few-shot patch-based learning for anime-style colorization 基于少镜头补丁的连续学习，实现动画风格着色

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-07-09 DOI: 10.1007/s41095-024-0414-4

Akinobu Maejima, Seitaro Shinagawa, Hiroyuki Kubo, Takuya Funatomi, Tatsuo Yotsukura, Satoshi Nakamura, Yasuhiro Mukaigawa

The automatic colorization of anime line drawings is a challenging problem in production pipelines. Recent advances in deep neural networks have addressed this problem; however, collectingmany images of colorization targets in novel anime work before the colorization process starts leads to chicken-and-egg problems and has become an obstacle to using them in production pipelines. To overcome this obstacle, we propose a new patch-based learning method for few-shot anime-style colorization. The learning method adopts an efficient patch sampling technique with position embedding according to the characteristics of anime line drawings. We also present a continuous learning strategy that continuously updates our colorization model using new samples colorized by human artists. The advantage of our method is that it can learn our colorization model from scratch or pre-trained weights using only a few pre- and post-colorized line drawings that are created by artists in their usual colorization work. Therefore, our method can be easily incorporated within existing production pipelines. We quantitatively demonstrate that our colorizationmethod outperforms state-of-the-art methods.

动漫线稿的自动着色是生产流水线中的一个难题。深度神经网络的最新进展已经解决了这一问题；然而，在着色过程开始前收集大量新颖动漫作品中着色目标的图像会导致鸡生蛋、蛋生鸡的问题，成为在生产流水线中使用它们的障碍。为了克服这一障碍，我们提出了一种新的基于补丁的动漫风格着色学习方法。该学习方法采用了高效的补丁采样技术，并根据动漫线图的特点进行了位置嵌入。我们还提出了一种持续学习策略，利用人类艺术家着色的新样本不断更新我们的着色模型。我们的方法的优势在于，它可以从头开始学习我们的着色模型，或者只使用一些着色前和着色后的线条图来预先训练权重，这些线条图是由艺术家们在日常着色工作中创建的。因此，我们的方法可以很容易地集成到现有的生产流水线中。我们通过定量分析证明，我们的着色方法优于最先进的方法。

{"title":"Continual few-shot patch-based learning for anime-style colorization","authors":"Akinobu Maejima, Seitaro Shinagawa, Hiroyuki Kubo, Takuya Funatomi, Tatsuo Yotsukura, Satoshi Nakamura, Yasuhiro Mukaigawa","doi":"10.1007/s41095-024-0414-4","DOIUrl":"https://doi.org/10.1007/s41095-024-0414-4","url":null,"abstract":"The automatic colorization of anime line drawings is a challenging problem in production pipelines. Recent advances in deep neural networks have addressed this problem; however, collectingmany images of colorization targets in novel anime work before the colorization process starts leads to chicken-and-egg problems and has become an obstacle to using them in production pipelines. To overcome this obstacle, we propose a new patch-based learning method for few-shot anime-style colorization. The learning method adopts an efficient patch sampling technique with position embedding according to the characteristics of anime line drawings. We also present a continuous learning strategy that continuously updates our colorization model using new samples colorized by human artists. The advantage of our method is that it can learn our colorization model from scratch or pre-trained weights using only a few pre- and post-colorized line drawings that are created by artists in their usual colorization work. Therefore, our method can be easily incorporated within existing production pipelines. We quantitatively demonstrate that our colorizationmethod outperforms state-of-the-art methods.","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"46 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recent advances in 3D Gaussian splatting 三维高斯拼接技术的最新进展

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-07-08 DOI: 10.1007/s41095-024-0436-y

Tong Wu, Yu-Jie Yuan, Ling-Xiao Zhang, Jie Yang, Yan-Pei Cao, Ling-Qi Yan, Lin Gao

The emergence of 3D Gaussian splatting (3DGS) has greatly accelerated rendering in novel view synthesis. Unlike neural implicit representations like neural radiance fields (NeRFs) that represent a 3D scene with position and viewpoint-conditioned neural networks, 3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images. Apart from fast rendering, the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction, geometry editing, and physical simulation. Considering the rapid changes and growing number of works in this field, we present a literature review of recent 3D Gaussian splatting methods, which can be roughly classified by functionality into 3D reconstruction, 3D editing, and other downstream applications. Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique. This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview, aiming to stimulate future development of the 3D Gaussian splatting representation.

三维高斯拼接（3DGS）的出现大大加快了新型视图合成的渲染速度。神经辐射场（NeRFs）等神经隐式表示法通过位置和视点条件神经网络来表示三维场景，而三维高斯拼接法则不同，它利用一组高斯椭球来模拟场景，这样就可以通过将高斯椭球光栅化到图像中来实现高效渲染。除了快速渲染外，三维高斯拼接的显式表示法还有助于完成动态重建、几何编辑和物理模拟等下游任务。考虑到这一领域日新月异的变化和日益增多的作品，我们对最近的三维高斯拼接方法进行了文献综述，这些方法按功能可大致分为三维重建、三维编辑和其他下游应用。此外，我们还介绍了传统的基于点的渲染方法和三维高斯拼接的渲染公式，以帮助读者理解这项技术。本研究旨在帮助初学者快速入门这一领域，并为有经验的研究人员提供全面的概述，从而促进三维高斯拼接表示法的未来发展。

{"title":"Recent advances in 3D Gaussian splatting","authors":"Tong Wu, Yu-Jie Yuan, Ling-Xiao Zhang, Jie Yang, Yan-Pei Cao, Ling-Qi Yan, Lin Gao","doi":"10.1007/s41095-024-0436-y","DOIUrl":"https://doi.org/10.1007/s41095-024-0436-y","url":null,"abstract":"The emergence of 3D Gaussian splatting (3DGS) has greatly accelerated rendering in novel view synthesis. Unlike neural implicit representations like neural radiance fields (NeRFs) that represent a 3D scene with position and viewpoint-conditioned neural networks, 3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images. Apart from fast rendering, the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction, geometry editing, and physical simulation. Considering the rapid changes and growing number of works in this field, we present a literature review of recent 3D Gaussian splatting methods, which can be roughly classified by functionality into 3D reconstruction, 3D editing, and other downstream applications. Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique. This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview, aiming to stimulate future development of the 3D Gaussian splatting representation.\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"52 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Illuminator: Image-based illumination editing for indoor scene harmonization 照明器基于图像的照明编辑，实现室内场景协调

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-07-05 DOI: 10.1007/s41095-023-0397-6

Zhongyun Bao, Gang Fu, Zipei Chen, Chunxia Xiao

Illumination harmonization is an important but challenging task that aims to achieve illumination compatibility between the foreground and background under different illumination conditions. Most current studies mainly focus on achieving seamless integration between the appearance (illumination or visual style) of the foreground object itself and the background scene or producing the foreground shadow. They rarely considered global illumination consistency (i.e., the illumination and shadow of the foreground object). In our work, we introduce “Illuminator”, an image-based illumination editing technique. This method aims to achieve more realistic global illumination harmonization, ensuring consistent illumination and plausible shadows in complex indoor environments. The Illuminator contains a shadow residual generation branch and an object illumination transfer branch. The shadow residual generation branch introduces a novel attention-aware graph convolutional mechanism to achieve reasonable foreground shadow generation. The object illumination transfer branch primarily transfers background illumination to the foreground region. In addition, we construct a real-world indoor illumination harmonization dataset called RIH, which consists of various foreground objects and background scenes captured under diverse illumination conditions for training and evaluating our Illuminator. Our comprehensive experiments, conducted on the RIH dataset and a collection of real-world everyday life photos, validate the effectiveness of our method.

光照协调是一项重要但极具挑战性的任务，其目的是在不同光照条件下实现前景和背景之间的光照兼容。目前的大多数研究主要集中在实现前景物体本身的外观（光照或视觉风格）与背景场景的无缝整合，或产生前景阴影。他们很少考虑全局光照一致性（即前景物体的光照和阴影）。在我们的工作中，我们引入了 "Illuminator"--一种基于图像的光照编辑技术。该方法旨在实现更逼真的全局光照协调，确保复杂室内环境中光照一致、阴影可信。照明器包含一个阴影残留生成分支和一个物体照明转移分支。阴影残留生成分支引入了一种新颖的注意力感知图卷积机制，以实现合理的前景阴影生成。物体光照转移分支主要是将背景光照转移到前景区域。此外，我们还构建了一个名为 RIH 的真实世界室内光照协调数据集，其中包括在不同光照条件下捕获的各种前景物体和背景场景，用于训练和评估我们的照明器。我们在 RIH 数据集和一组真实世界的日常生活照片上进行了综合实验，验证了我们方法的有效性。

{"title":"Illuminator: Image-based illumination editing for indoor scene harmonization","authors":"Zhongyun Bao, Gang Fu, Zipei Chen, Chunxia Xiao","doi":"10.1007/s41095-023-0397-6","DOIUrl":"https://doi.org/10.1007/s41095-023-0397-6","url":null,"abstract":"Illumination harmonization is an important but challenging task that aims to achieve illumination compatibility between the foreground and background under different illumination conditions. Most current studies mainly focus on achieving seamless integration between the appearance (illumination or visual style) of the foreground object itself and the background scene or producing the foreground shadow. They rarely considered global illumination consistency (i.e., the illumination and shadow of the foreground object). In our work, we introduce “Illuminator”, an image-based illumination editing technique. This method aims to achieve more realistic global illumination harmonization, ensuring consistent illumination and plausible shadows in complex indoor environments. The Illuminator contains a shadow residual generation branch and an object illumination transfer branch. The shadow residual generation branch introduces a novel attention-aware graph convolutional mechanism to achieve reasonable foreground shadow generation. The object illumination transfer branch primarily transfers background illumination to the foreground region. In addition, we construct a real-world indoor illumination harmonization dataset called RIH, which consists of various foreground objects and background scenes captured under diverse illumination conditions for training and evaluating our Illuminator. Our comprehensive experiments, conducted on the RIH dataset and a collection of real-world everyday life photos, validate the effectiveness of our method.\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"12 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Shell stand: Stable thin shell models for 3D fabrication 外壳支架：用于三维制造的稳定薄壳模型

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-06-24 DOI: 10.1007/s41095-024-0402-8

Yu Xing, Xiaoxuan Wang, Lin Lu, Andrei Sharf, Daniel Cohen-Or, Changhe Tu

A thin shell model refers to a surface or structure, where the object’s thickness is considered negligible. In the context of 3D printing, thin shell models are characterized by having lightweight, hollow structures, and reduced material usage. Their versatility and visual appeal make them popular in various fields, such as cloth simulation, character skinning, and for thin-walled structures like leaves, paper, or metal sheets. Nevertheless, optimization of thin shell models without external support remains a challenge due to their minimal interior operational space. For the same reasons, hollowing methods are also unsuitable for this task. In fact, thin shell modulation methods are required to preserve the visual appearance of a two-sided surface which further constrain the problem space. In this paper, we introduce a new visual disparity metric tailored for shell models, integrating local details and global shape attributes in terms of visual perception. Our method modulates thin shell models using global deformations and local thickening while accounting for visual saliency, stability, and structural integrity. Thereby, thin shell models such as bas-reliefs, hollow shapes, and cloth can be stabilized to stand in arbitrary orientations, making them ideal for 3D printing.

薄壳模型是指物体的厚度可以忽略不计的表面或结构。在三维打印中，薄壳模型的特点是重量轻、中空结构和材料用量少。薄壳模型的多功能性和视觉吸引力使其在布料仿真、人物皮肤制作以及树叶、纸张或金属片等薄壁结构等多个领域大受欢迎。然而，由于薄壳的内部操作空间极小，对无外部支撑的薄壳模型进行优化仍是一项挑战。出于同样的原因，掏空方法也不适合这项任务。事实上，薄壳调制方法需要保留双面表面的视觉外观，这进一步限制了问题空间。在本文中，我们针对薄壳模型引入了一种新的视觉差异度量，从视觉感知的角度整合了局部细节和整体形状属性。我们的方法利用全局变形和局部增厚来调节薄壳模型，同时考虑视觉显著性、稳定性和结构完整性。因此，薄壳模型（如浮雕、空心形状和布料）可以稳定地站在任意方向上，非常适合三维打印。

{"title":"Shell stand: Stable thin shell models for 3D fabrication","authors":"Yu Xing, Xiaoxuan Wang, Lin Lu, Andrei Sharf, Daniel Cohen-Or, Changhe Tu","doi":"10.1007/s41095-024-0402-8","DOIUrl":"https://doi.org/10.1007/s41095-024-0402-8","url":null,"abstract":"A thin shell model refers to a surface or structure, where the object’s thickness is considered negligible. In the context of 3D printing, thin shell models are characterized by having lightweight, hollow structures, and reduced material usage. Their versatility and visual appeal make them popular in various fields, such as cloth simulation, character skinning, and for thin-walled structures like leaves, paper, or metal sheets. Nevertheless, optimization of thin shell models without external support remains a challenge due to their minimal interior operational space. For the same reasons, hollowing methods are also unsuitable for this task. In fact, thin shell modulation methods are required to preserve the visual appearance of a two-sided surface which further constrain the problem space. In this paper, we introduce a new visual disparity metric tailored for shell models, integrating local details and global shape attributes in terms of visual perception. Our method modulates thin shell models using global deformations and local thickening while accounting for visual saliency, stability, and structural integrity. Thereby, thin shell models such as bas-reliefs, hollow shapes, and cloth can be stabilized to stand in arbitrary orientations, making them ideal for 3D printing.","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"18 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal vectorized visibility for direct illumination of animated models 用于动画模型直接照明的时间矢量化可见度

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-05-29 DOI: 10.1007/s41095-023-0339-3

Zhenni Wang, Tze Yui Ho, Yi Xiao, Chi Sing Leung

Direct illumination rendering is an important technique in computer graphics. Precomputed radiance transfer algorithms can provide high quality rendering results in real time, but they can only support rigid models. On the other hand, ray tracing algorithms are flexible and can gracefully handle animated models. With NVIDIA RTX and the AI denoiser, we can use ray tracing algorithms to render visually appealing results in real time. Visually appealing though, they can deviate from the actual one considerably. We propose a visibility-boundary edge oriented infinite triangle bounding volume hierarchy (BVH) traversal algorithm to dynamically generate visibility in vector form. Our algorithm utilizes the properties of visibility-boundary edges and infinite triangle BVH traversal to maximize the efficiency of the vector form visibility generation. A novel data structure, temporal vectorized visibility, is proposed, which allows visibility in vector form to be shared across time and further increases the generation efficiency. Our algorithm can efficiently render close-to-reference direct illumination results. With the similar processing time, it provides a visual quality improvement around 10 dB in terms of peak signal-to-noise ratio (PSNR) w.r.t. the ray tracing algorithm reservoir-based spatiotemporal importance resampling (ReSTIR).

直接光照渲染是计算机图形学中的一项重要技术。预计算辐射传递算法可以实时提供高质量的渲染效果，但只能支持刚性模型。另一方面，光线追踪算法非常灵活，可以优雅地处理动画模型。借助英伟达™ RTX 和人工智能去噪器，我们可以使用光线追踪算法实时渲染出具有视觉吸引力的效果。虽然视觉效果吸引人，但它们可能与实际效果有很大偏差。我们提出了一种以可见性边界边缘为导向的无限三角形边界体层次结构（BVH）遍历算法，以矢量形式动态生成可见性。我们的算法利用可见性边界边和无限三角形边界体积层次遍历的特性，最大限度地提高了矢量形式可见性生成的效率。我们还提出了一种新颖的数据结构--时间矢量化可见性，它允许矢量形式的可见性跨时间共享，并进一步提高了生成效率。我们的算法可以高效地呈现接近参考的直接光照结果。与基于水库的时空重要性重采样（ReSTIR）的光线跟踪算法相比，在处理时间相近的情况下，该算法的峰值信噪比（PSNR）提高了约 10 dB。

{"title":"Temporal vectorized visibility for direct illumination of animated models","authors":"Zhenni Wang, Tze Yui Ho, Yi Xiao, Chi Sing Leung","doi":"10.1007/s41095-023-0339-3","DOIUrl":"https://doi.org/10.1007/s41095-023-0339-3","url":null,"abstract":"Direct illumination rendering is an important technique in computer graphics. Precomputed radiance transfer algorithms can provide high quality rendering results in real time, but they can only support rigid models. On the other hand, ray tracing algorithms are flexible and can gracefully handle animated models. With NVIDIA RTX and the AI denoiser, we can use ray tracing algorithms to render visually appealing results in real time. Visually appealing though, they can deviate from the actual one considerably. We propose a visibility-boundary edge oriented infinite triangle bounding volume hierarchy (BVH) traversal algorithm to dynamically generate visibility in vector form. Our algorithm utilizes the properties of visibility-boundary edges and infinite triangle BVH traversal to maximize the efficiency of the vector form visibility generation. A novel data structure, temporal vectorized visibility, is proposed, which allows visibility in vector form to be shared across time and further increases the generation efficiency. Our algorithm can efficiently render close-to-reference direct illumination results. With the similar processing time, it provides a visual quality improvement around 10 dB in terms of peak signal-to-noise ratio (PSNR) w.r.t. the ray tracing algorithm reservoir-based spatiotemporal importance resampling (ReSTIR).\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"6 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Super-resolution reconstruction of single image for latent features 针对潜在特征的单幅图像超分辨率重建

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-05-24 DOI: 10.1007/s41095-023-0387-8

Xin Wang, Jing-Ke Yan, Jing-Ye Cai, Jian-Hua Deng, Qin Qin, Yao Cheng

Single-image super-resolution (SISR) typically focuses on restoring various degraded low-resolution (LR) images to a single high-resolution (HR) image. However, during SISR tasks, it is often challenging for models to simultaneously maintain high quality and rapid sampling while preserving diversity in details and texture features. This challenge can lead to issues such as model collapse, lack of rich details and texture features in the reconstructed HR images, and excessive time consumption for model sampling. To address these problems, this paper proposes a Latent Feature-oriented Diffusion Probability Model (LDDPM). First, we designed a conditional encoder capable of effectively encoding LR images, reducing the solution space for model image reconstruction and thereby improving the quality of the reconstructed images. We then employed a normalized flow and multimodal adversarial training, learning from complex multimodal distributions, to model the denoising distribution. Doing so boosts the generative modeling capabilities within a minimal number of sampling steps. Experimental comparisons of our proposed model with existing SISR methods on mainstream datasets demonstrate that our model reconstructs more realistic HR images and achieves better performance on multiple evaluation metrics, providing a fresh perspective for tackling SISR tasks.

单图像超分辨率（SISR）通常侧重于将各种降级的低分辨率（LR）图像复原为单一的高分辨率（HR）图像。然而，在 SISR 任务中，模型要同时保持高质量和快速采样，并保留细节和纹理特征的多样性，往往具有挑战性。这一挑战可能导致模型崩溃、重建的高分辨率图像缺乏丰富的细节和纹理特征以及模型采样耗时过长等问题。为了解决这些问题，本文提出了一种面向潜特征的扩散概率模型（LDDPM）。首先，我们设计了一种能够有效编码 LR 图像的条件编码器，减少了模型图像重建的解空间，从而提高了重建图像的质量。然后，我们采用归一化流和多模态对抗训练，从复杂的多模态分布中学习，对去噪分布进行建模。这样做可以在最少的采样步骤内提高生成建模能力。我们提出的模型与现有的 SISR 方法在主流数据集上进行的实验比较表明，我们的模型能重建更逼真的 HR 图像，并在多个评估指标上取得更好的性能，为解决 SISR 任务提供了一个全新的视角。

{"title":"Super-resolution reconstruction of single image for latent features","authors":"Xin Wang, Jing-Ke Yan, Jing-Ye Cai, Jian-Hua Deng, Qin Qin, Yao Cheng","doi":"10.1007/s41095-023-0387-8","DOIUrl":"https://doi.org/10.1007/s41095-023-0387-8","url":null,"abstract":"Single-image super-resolution (SISR) typically focuses on restoring various degraded low-resolution (LR) images to a single high-resolution (HR) image. However, during SISR tasks, it is often challenging for models to simultaneously maintain high quality and rapid sampling while preserving diversity in details and texture features. This challenge can lead to issues such as model collapse, lack of rich details and texture features in the reconstructed HR images, and excessive time consumption for model sampling. To address these problems, this paper proposes a Latent Feature-oriented Diffusion Probability Model (LDDPM). First, we designed a conditional encoder capable of effectively encoding LR images, reducing the solution space for model image reconstruction and thereby improving the quality of the reconstructed images. We then employed a normalized flow and multimodal adversarial training, learning from complex multimodal distributions, to model the denoising distribution. Doing so boosts the generative modeling capabilities within a minimal number of sampling steps. Experimental comparisons of our proposed model with existing SISR methods on mainstream datasets demonstrate that our model reconstructs more realistic HR images and achieves better performance on multiple evaluation metrics, providing a fresh perspective for tackling SISR tasks.","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"39 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Foundation models meet visualizations: Challenges and opportunities 基础模型与可视化的结合：挑战与机遇

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-05-02 DOI: 10.1007/s41095-023-0393-x

Weikai Yang, Mengchen Liu, Zheng Wang, Shixia Liu

Recent studies have indicated that foundation models, such as BERT and GPT, excel at adapting to various downstream tasks. This adaptability has made them a dominant force in building artificial intelligence (AI) systems. Moreover, a new research paradigm has emerged as visualization techniques are incorporated into these models. This study divides these intersections into two research areas: visualization for foundation model (VIS4FM) and foundation model for visualization (FM4VIS). In terms of VIS4FM, we explore the primary role of visualizations in understanding, refining, and evaluating these intricate foundation models. VIS4FM addresses the pressing need for transparency, explainability, fairness, and robustness. Conversely, in terms of FM4VIS, we highlight how foundation models can be used to advance the visualization field itself. The intersection of foundation models with visualizations is promising but also introduces a set of challenges. By highlighting these challenges and promising opportunities, this study aims to provide a starting point for the continued exploration of this research avenue.

最近的研究表明，BERT 和 GPT 等基础模型擅长适应各种下游任务。这种适应性使它们成为构建人工智能（AI）系统的主导力量。此外，随着可视化技术融入这些模型，一种新的研究范式也应运而生。本研究将这些交叉点分为两个研究领域：基础模型可视化（VIS4FM）和可视化基础模型（FM4VIS）。在 VIS4FM 方面，我们探讨了可视化在理解、完善和评估这些错综复杂的基础模型方面的主要作用。VIS4FM 解决了对透明度、可解释性、公平性和稳健性的迫切需求。相反，就 FM4VIS 而言，我们强调如何利用基础模型来推动可视化领域本身的发展。基础模型与可视化的交叉是大有可为的，但也带来了一系列挑战。通过强调这些挑战和有前途的机遇，本研究旨在为继续探索这一研究途径提供一个起点。

引用次数: 0

Learning layout generation for virtual worlds 为虚拟世界生成学习布局

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-05-02 DOI: 10.1007/s41095-023-0365-1

Weihao Cheng, Ying Shan

The emergence of the metaverse has led to the rapidly increasing demand for the generation of extensive 3D worlds. We consider that an engaging world is built upon a rational layout of multiple land-use areas (e.g., forest, meadow, and farmland). To this end, we propose a generative model of land-use distribution that learns from geographic data. The model is based on a transformer architecture that generates a 2D map of the land-use layout, which can be conditioned on spatial and semantic controls, depending on whether either one or both are provided. This model enables diverse layout generation with user control and layout expansion by extending borders with partial inputs. To generate high-quality and satisfactory layouts, we devise a geometric objective function that supervises the model to perceive layout shapes and regularize generations using geometric priors. Additionally, we devise a planning objective function that supervises the model to perceive progressive composition demands and suppress generations deviating from controls. To evaluate the spatial distribution of the generations, we train an autoencoder to embed land-use layouts into vectors to enable comparison between the real and generated data using the Wasserstein metric, which is inspired by the Fréchet inception distance.

元宇宙的出现导致对生成广阔三维世界的需求迅速增加。我们认为，一个引人入胜的世界是建立在多个土地使用区域（如森林、草地和农田）的合理布局之上的。为此，我们提出了一种从地理数据中学习土地利用分布的生成模型。该模型基于转换器架构，可生成土地利用布局的二维地图，并可根据是否提供空间和语义控制，对空间和语义控制进行调节。该模型可在用户控制下生成多样化的布局，并通过部分输入扩展边界来扩展布局。为了生成高质量和令人满意的布局，我们设计了一个几何目标函数，用于监督模型感知布局形状，并利用几何先验对生成进行正则化。此外，我们还设计了一个规划目标函数，用于监督模型感知渐进式组合需求，并抑制偏离控制的世代。为了评估世代的空间分布，我们训练了一个自动编码器，将土地利用布局嵌入向量中，以便利用受弗雷谢特截距启发的瓦瑟斯坦度量对真实数据和生成数据进行比较。

{"title":"Learning layout generation for virtual worlds","authors":"Weihao Cheng, Ying Shan","doi":"10.1007/s41095-023-0365-1","DOIUrl":"https://doi.org/10.1007/s41095-023-0365-1","url":null,"abstract":"The emergence of the metaverse has led to the rapidly increasing demand for the generation of extensive 3D worlds. We consider that an engaging world is built upon a rational layout of multiple land-use areas (e.g., forest, meadow, and farmland). To this end, we propose a generative model of land-use distribution that learns from geographic data. The model is based on a transformer architecture that generates a 2D map of the land-use layout, which can be conditioned on spatial and semantic controls, depending on whether either one or both are provided. This model enables diverse layout generation with user control and layout expansion by extending borders with partial inputs. To generate high-quality and satisfactory layouts, we devise a geometric objective function that supervises the model to perceive layout shapes and regularize generations using geometric priors. Additionally, we devise a planning objective function that supervises the model to perceive progressive composition demands and suppress generations deviating from controls. To evaluate the spatial distribution of the generations, we train an autoencoder to embed land-use layouts into vectors to enable comparison between the real and generated data using the Wasserstein metric, which is inspired by the Fréchet inception distance.","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"13 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AdaPIP: Adaptive picture-in-picture guidance for 360° film watching AdaPIP：自适应画中画引导，360°观影

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-05-02 DOI: 10.1007/s41095-023-0347-3

Yi-Xiao Li, Guan Luo, Yi-Ke Xu, Yu He, Fang-Lue Zhang, Song-Hai Zhang

360° videos enable viewers to watch freely from different directions but inevitably prevent them from perceiving all the helpful information. To mitigate this problem, picture-in-picture (PIP) guidance was proposed using preview windows to show regions of interest (ROIs) outside the current view range. We identify several drawbacks of this representation and propose a new method for 360° film watching called AdaPIP. AdaPIP enhances traditional PIP by adaptively arranging preview windows with changeable view ranges and sizes. In addition, AdaPIP incorporates the advantage of arrow-based guidance by presenting circular windows with arrows attached to them to help users locate the corresponding ROIs more efficiently. We also adapted AdaPIP and Outside-In to HMD-based immersive virtual reality environments to demonstrate the usability of PIP-guided approaches beyond 2D screens. Comprehensive user experiments on 2D screens, as well as in VR environments, indicate that AdaPIP is superior to alternative methods in terms of visual experiences while maintaining a comparable degree of immersion.

360° 视频能让观众从不同方向自由观看，但不可避免地会妨碍他们感知所有有用信息。为了缓解这一问题，有人提出了画中画（PIP）引导，利用预览窗口显示当前视图范围之外的感兴趣区域（ROI）。我们发现了这种表示法的几个缺点，并提出了一种用于 360° 电影观看的新方法，称为 AdaPIP。AdaPIP 通过自适应地安排预览窗口来改变视图范围和大小，从而增强了传统的 PIP。此外，AdaPIP 还结合了箭头引导的优势，通过呈现带有箭头的圆形窗口，帮助用户更高效地定位相应的 ROI。我们还将 AdaPIP 和 Outside-In 应用于基于 HMD 的沉浸式虚拟现实环境，以展示 PIP 引导方法在二维屏幕之外的可用性。在二维屏幕和虚拟现实环境中进行的综合用户实验表明，AdaPIP 在视觉体验方面优于其他方法，同时还能保持相当程度的沉浸感。

引用次数: 0