Computational Visual Media最新文献_第5页

Real-time distance field acceleration based free-viewpoint video synthesis for large sports fields 基于实时距离场加速的大型运动场自由视点视频合成

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-01-03 DOI: 10.1007/s41095-022-0323-3

Yanran Dai, Jing Li, Yuqi Jiang, Haidong Qin, Bang Liang, Shikuan Hong, Haozhe Pan, Tao Yang

Free-viewpoint video allows the user to view objects from any virtual perspective, creating an immersive visual experience. This technology enhances the interactivity and freedom of multimedia performances. However, many free-viewpoint video synthesis methods hardly satisfy the requirement to work in real time with high precision, particularly for sports fields having large areas and numerous moving objects. To address these issues, we propose a free-viewpoint video synthesis method based on distance field acceleration. The central idea is to fuse multi-view distance field information and use it to adjust the search step size adaptively. Adaptive step size search is used in two ways: for fast estimation of multi-object three-dimensional surfaces, and synthetic view rendering based on global occlusion judgement. We have implemented our ideas using parallel computing for interactive display, using CUDA and OpenGL frameworks, and have used real-world and simulated experimental datasets for evaluation. The results show that the proposed method can render free-viewpoint videos with multiple objects on large sports fields at 25 fps. Furthermore, the visual quality of our synthetic novel viewpoint images exceeds that of state-of-the-art neural-rendering-based methods.

自由视角视频允许用户从任何虚拟视角观看物体，创造身临其境的视觉体验。这项技术增强了多媒体表演的互动性和自由度。然而，许多自由视点视频合成方法很难满足高精度实时工作的要求，尤其是对于面积大、移动物体多的运动场地。为了解决这些问题，我们提出了一种基于距离场加速的自由视点视频合成方法。其核心思想是融合多视角距离场信息，并利用这些信息自适应地调整搜索步长。自适应步长搜索有两种用途：快速估计多物体三维表面和基于全局遮挡判断的合成视图渲染。我们利用 CUDA 和 OpenGL 框架，通过并行计算实现了我们的想法，并使用真实世界和模拟实验数据集进行评估。结果表明，所提出的方法可以在大型运动场上以 25 fps 的速度渲染包含多个物体的自由视点视频。此外，我们合成的新视角图像的视觉质量超过了最先进的基于神经渲染的方法。

{"title":"Real-time distance field acceleration based free-viewpoint video synthesis for large sports fields","authors":"Yanran Dai, Jing Li, Yuqi Jiang, Haidong Qin, Bang Liang, Shikuan Hong, Haozhe Pan, Tao Yang","doi":"10.1007/s41095-022-0323-3","DOIUrl":"https://doi.org/10.1007/s41095-022-0323-3","url":null,"abstract":"Free-viewpoint video allows the user to view objects from any virtual perspective, creating an immersive visual experience. This technology enhances the interactivity and freedom of multimedia performances. However, many free-viewpoint video synthesis methods hardly satisfy the requirement to work in real time with high precision, particularly for sports fields having large areas and numerous moving objects. To address these issues, we propose a free-viewpoint video synthesis method based on distance field acceleration. The central idea is to fuse multi-view distance field information and use it to adjust the search step size adaptively. Adaptive step size search is used in two ways: for fast estimation of multi-object three-dimensional surfaces, and synthetic view rendering based on global occlusion judgement. We have implemented our ideas using parallel computing for interactive display, using CUDA and OpenGL frameworks, and have used real-world and simulated experimental datasets for evaluation. The results show that the proposed method can render free-viewpoint videos with multiple objects on large sports fields at 25 fps. Furthermore, the visual quality of our synthetic novel viewpoint images exceeds that of state-of-the-art neural-rendering-based methods.\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"12 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-modal visual tracking: Review and experimental comparison 多模式视觉跟踪：回顾与实验比较

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-01-03 DOI: 10.1007/s41095-023-0345-5

Pengyu Zhang, Dong Wang, Huchuan Lu

Visual object tracking has been drawing increasing attention in recent years, as a fundamental task in computer vision. To extend the range of tracking applications, researchers have been introducing information from multiple modalities to handle specific scenes, with promising research prospects for emerging methods and benchmarks. To provide a thorough review of multi-modal tracking, different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy, with specific focus on visible-depth (RGB-D) and visible-thermal (RGB-T) tracking. Subsequently, a detailed description of the related benchmarks and challenges is provided. Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets: PTB, VOT19-RGBD, GTOT, RGBT234, and VOT19-RGBT. Finally, various future directions, including model design and dataset construction, are discussed from different perspectives for further research.

视觉物体跟踪作为计算机视觉的一项基本任务，近年来日益受到关注。为了扩大跟踪应用的范围，研究人员一直在引入多种模式的信息来处理特定场景，新兴方法和基准的研究前景广阔。为了对多模态跟踪进行全面评述，本文在统一的分类法下总结了多模态跟踪算法的不同方面，并特别关注可见深度（RGB-D）和可见热（RGB-T）跟踪。随后，详细介绍了相关基准和挑战。在五个数据集上进行了广泛的实验，以分析跟踪器的有效性：PTB、VOT19-RGBD、GTOT、RGBT234 和 VOT19-RGBT。最后，从不同角度讨论了未来的研究方向，包括模型设计和数据集构建。

引用次数: 0

Controllable multi-domain semantic artwork synthesis 可控多领域语义艺术品合成

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-01-03 DOI: 10.1007/s41095-023-0356-2

Yuantian Huang, Satoshi Iizuka, Edgar Simo-Serra, Kazuhiro Fukui

We present a novel framework for the multi-domain synthesis of artworks from semantic layouts. One of the main limitations of this challenging task is the lack of publicly available segmentation datasets for art synthesis. To address this problem, we propose a dataset called ArtSem that contains 40,000 images of artwork from four different domains, with their corresponding semantic label maps. We first extracted semantic maps from landscape photography and used a conditional generative adversarial network (GAN)-based approach for generating high-quality artwork from semantic maps without requiring paired training data. Furthermore, we propose an artwork-synthesis model using domain-dependent variational encoders for high-quality multi-domain synthesis. Subsequently, the model was improved and complemented with a simple but effective normalization method based on jointly normalizing semantics and style, which we call spatially style-adaptive normalization (SSTAN). Compared to the previous methods, which only take semantic layout as the input, our model jointly learns style and semantic information representation, improving the generation quality of artistic images. These results indicate that our model learned to separate the domains in the latent space. Thus, we can perform fine-grained control of the synthesized artwork by identifying hyperplanes that separate the different domains. Moreover, by combining the proposed dataset and approach, we generated user-controllable artworks of higher quality than that of existing approaches, as corroborated by quantitative metrics and a user study.

我们提出了一个从语义布局多领域合成艺术作品的新框架。这项具有挑战性的任务的主要局限之一是缺乏用于艺术合成的公开可用的分割数据集。为了解决这个问题，我们提出了一个名为 ArtSem 的数据集，其中包含来自四个不同领域的 40,000 张艺术作品图像及其相应的语义标签图。我们首先从风景摄影中提取语义图，然后使用基于条件生成式对抗网络（GAN）的方法从语义图生成高质量的艺术作品，而无需配对训练数据。此外，我们还提出了一种艺术作品合成模型，该模型使用依赖于领域的变异编码器进行高质量的多领域合成。随后，我们对该模型进行了改进，并补充了一种基于语义和风格联合归一化的简单而有效的归一化方法，我们称之为空间风格自适应归一化（SSTAN）。与之前仅将语义布局作为输入的方法相比，我们的模型联合学习了风格和语义信息表示，从而提高了艺术图像的生成质量。这些结果表明，我们的模型学会了在潜在空间中分离域。因此，我们可以通过识别分隔不同领域的超平面，对合成的艺术作品进行精细控制。此外，通过结合所提出的数据集和方法，我们生成了用户可控的艺术作品，其质量高于现有方法，量化指标和用户研究也证实了这一点。

{"title":"Controllable multi-domain semantic artwork synthesis","authors":"Yuantian Huang, Satoshi Iizuka, Edgar Simo-Serra, Kazuhiro Fukui","doi":"10.1007/s41095-023-0356-2","DOIUrl":"https://doi.org/10.1007/s41095-023-0356-2","url":null,"abstract":"We present a novel framework for the multi-domain synthesis of artworks from semantic layouts. One of the main limitations of this challenging task is the lack of publicly available segmentation datasets for art synthesis. To address this problem, we propose a dataset called ArtSem that contains 40,000 images of artwork from four different domains, with their corresponding semantic label maps. We first extracted semantic maps from landscape photography and used a conditional generative adversarial network (GAN)-based approach for generating high-quality artwork from semantic maps without requiring paired training data. Furthermore, we propose an artwork-synthesis model using domain-dependent variational encoders for high-quality multi-domain synthesis. Subsequently, the model was improved and complemented with a simple but effective normalization method based on jointly normalizing semantics and style, which we call spatially style-adaptive normalization (SSTAN). Compared to the previous methods, which only take semantic layout as the input, our model jointly learns style and semantic information representation, improving the generation quality of artistic images. These results indicate that our model learned to separate the domains in the latent space. Thus, we can perform fine-grained control of the synthesized artwork by identifying hyperplanes that separate the different domains. Moreover, by combining the proposed dataset and approach, we generated user-controllable artworks of higher quality than that of existing approaches, as corroborated by quantitative metrics and a user study.\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"21 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporally consistent video colorization with deep feature propagation and self-regularization learning 利用深度特征传播和自规范化学习实现时空一致的视频着色

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-01-03 DOI: 10.1007/s41095-023-0342-8

Yihao Liu, Hengyuan Zhao, Kelvin C. K. Chan, Xintao Wang, Chen Change Loy, Yu Qiao, Chao Dong

Video colorization is a challenging and highly ill-posed problem. Although recent years have witnessed remarkable progress in single image colorization, there is relatively less research effort on video colorization, and existing methods always suffer from severe flickering artifacts (temporal inconsistency) or unsatisfactory colorization. We address this problem from a new perspective, by jointly considering colorization and temporal consistency in a unified framework. Specifically, we propose a novel temporally consistent video colorization (TCVC) framework. TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization. Furthermore, TCVC introduces a self-regularization learning (SRL) scheme to minimize the differences in predictions obtained using different time steps. SRL does not require any ground-truth color videos for training and can further improve temporal consistency. Experiments demonstrate that our method can not only provide visually pleasing colorized video, but also with clearly better temporal consistency than state-of-the-art methods. A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE, while code is available at https://github.com/lyh-18/TCVC-Temporally-Consistent-Video-Colorization.

视频着色是一个极具挑战性的难题。虽然近年来在单图像着色方面取得了显著进展，但视频着色方面的研究相对较少，而且现有方法总是存在严重的闪烁伪影（时间不一致性）或着色效果不理想的问题。我们从一个全新的角度来解决这个问题，在一个统一的框架中共同考虑着色和时间一致性。具体来说，我们提出了一种新颖的时间一致性视频着色（TCVC）框架。TCVC 以双向方式有效传播帧级深度特征，以增强着色的时间一致性。此外，TCVC 还引入了自规范化学习（SRL）方案，以最小化使用不同时间步骤获得的预测结果之间的差异。SRL 不需要任何地面真实色彩视频进行训练，可以进一步提高时间一致性。实验证明，我们的方法不仅能提供视觉上悦目的彩色视频，而且其时间一致性明显优于最先进的方法。视频演示见 https://www.youtube.com/watch?v=c7dczMs-olE，代码见 https://github.com/lyh-18/TCVC-Temporally-Consistent-Video-Colorization。

{"title":"Temporally consistent video colorization with deep feature propagation and self-regularization learning","authors":"Yihao Liu, Hengyuan Zhao, Kelvin C. K. Chan, Xintao Wang, Chen Change Loy, Yu Qiao, Chao Dong","doi":"10.1007/s41095-023-0342-8","DOIUrl":"https://doi.org/10.1007/s41095-023-0342-8","url":null,"abstract":"Video colorization is a challenging and highly ill-posed problem. Although recent years have witnessed remarkable progress in single image colorization, there is relatively less research effort on video colorization, and existing methods always suffer from severe flickering artifacts (temporal inconsistency) or unsatisfactory colorization. We address this problem from a new perspective, by jointly considering colorization and temporal consistency in a unified framework. Specifically, we propose a novel temporally consistent video colorization (TCVC) framework. TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization. Furthermore, TCVC introduces a self-regularization learning (SRL) scheme to minimize the differences in predictions obtained using different time steps. SRL does not require any ground-truth color videos for training and can further improve temporal consistency. Experiments demonstrate that our method can not only provide visually pleasing colorized video, but also with clearly better temporal consistency than state-of-the-art methods. A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE, while code is available at https://github.com/lyh-18/TCVC-Temporally-Consistent-Video-Colorization.\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"55 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-granularity sequence generation for hierarchical image classification 为分层图像分类生成多粒度序列

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-01-03 DOI: 10.1007/s41095-022-0332-2

Xinda Liu, Lili Wang

Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities, and also insufficiently consider relationships between the hierarchical multi-granularity labels. We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation (MGSG) approach for the hierarchical multi-granularity image classification task. Specifically, we introduce a transformer architecture to encode the image into visual representation sequences. Next, we traverse the taxonomic tree and organize the multi-granularity labels into sequences, and vectorize them and add positional information. The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs, and outputs the predicted multi-granularity label sequence. The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism, and relates visual information to the semantic label information through a cross-modality attention mechanism. In this way, the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities. Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method. Our project is available at https://github.com/liuxindazz/mgsg.

分层多粒度图像分类是一项具有挑战性的任务，其目的是为每幅给定图像同时标记多个粒度标签。现有方法往往忽略了不同图像区域对不同粒度标签预测的贡献不同，也没有充分考虑分层多粒度标签之间的关系。我们引入了序列到序列机制来克服这两个问题，并针对分层多粒度图像分类任务提出了一种多粒度序列生成（MGSG）方法。具体来说，我们引入了一种转换器架构，将图像编码为视觉表示序列。然后，我们遍历分类树，将多粒度标签组织成序列，并将其矢量化和添加位置信息。所提出的多粒度序列生成方法建立了一个解码器，将视觉表示序列和语义标签嵌入作为输入，并输出预测的多粒度标签序列。解码器通过遮蔽式多头自我注意机制对多粒度标签之间的依赖性和相关性进行建模，并通过跨模态注意机制将视觉信息与语义标签信息联系起来。这样，所提出的方法就保留了不同粒度标签之间的关系，并考虑到了不同图像区域对不同粒度标签的影响。通过对六个公共基准的评估，定性和定量地证明了所提方法的优势。我们的项目见 https://github.com/liuxindazz/mgsg。

{"title":"Multi-granularity sequence generation for hierarchical image classification","authors":"Xinda Liu, Lili Wang","doi":"10.1007/s41095-022-0332-2","DOIUrl":"https://doi.org/10.1007/s41095-022-0332-2","url":null,"abstract":"Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities, and also insufficiently consider relationships between the hierarchical multi-granularity labels. We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation (MGSG) approach for the hierarchical multi-granularity image classification task. Specifically, we introduce a transformer architecture to encode the image into visual representation sequences. Next, we traverse the taxonomic tree and organize the multi-granularity labels into sequences, and vectorize them and add positional information. The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs, and outputs the predicted multi-granularity label sequence. The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism, and relates visual information to the semantic label information through a cross-modality attention mechanism. In this way, the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities. Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method. Our project is available at https://github.com/liuxindazz/mgsg.\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"19 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generating diverse clothed 3D human animations via a generative model 通过生成模型生成多样化的穿衣三维人体动画

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-01-03 DOI: 10.1007/s41095-022-0324-2

Min Shi, Wenke Feng, Lin Gao, Dengming Zhu

Data-driven garment animation is a current topic of interest in the computer graphics industry. Existing approaches generally establish the mapping between a single human pose or a temporal pose sequence, and garment deformation, but it is difficult to quickly generate diverse clothed human animations. We address this problem with a method to automatically synthesize dressed human animations with temporal consistency from a specified human motion label. At the heart of our method is a two-stage strategy. Specifically, we first learn a latent space encoding the sequence-level distribution of human motions utilizing a transformer-based conditional variational autoencoder (Transformer-CVAE). Then a garment simulator synthesizes dynamic garment shapes using a transformer encoder–decoder architecture. Since the learned latent space comes from varied human motions, our method can generate a variety of styles of motions given a specific motion label. By means of a novel beginning of sequence (BOS) learning strategy and a self-supervised refinement procedure, our garment simulator is capable of efficiently synthesizing garment deformation sequences corresponding to the generated human motions while maintaining temporal and spatial consistency. We verify our ideas experimentally. This is the first generative model that directly dresses human animation.

数据驱动的服装动画是计算机制图行业当前关注的话题。现有方法通常在单一人体姿势或时间姿势序列与服装变形之间建立映射关系，但很难快速生成多样化的人体着装动画。为了解决这个问题，我们采用了一种方法，根据指定的人体运动标签自动合成具有时间一致性的着装人体动画。我们方法的核心是一个两阶段策略。具体来说，我们首先利用基于变换器的条件变异自动编码器（Transformer-CVAE）学习一个潜在空间，对人体动作的序列级分布进行编码。然后，服装模拟器利用变压器编码器-解码器架构合成动态服装形状。由于学习到的潜在空间来自不同的人体动作，因此我们的方法可以在特定动作标签下生成各种风格的动作。通过新颖的序列开始（BOS）学习策略和自我监督完善程序，我们的服装模拟器能够高效地合成与生成的人体动作相对应的服装变形序列，同时保持时间和空间的一致性。我们通过实验验证了我们的想法。这是第一个直接为人体动画穿衣的生成模型。

{"title":"Generating diverse clothed 3D human animations via a generative model","authors":"Min Shi, Wenke Feng, Lin Gao, Dengming Zhu","doi":"10.1007/s41095-022-0324-2","DOIUrl":"https://doi.org/10.1007/s41095-022-0324-2","url":null,"abstract":"Data-driven garment animation is a current topic of interest in the computer graphics industry. Existing approaches generally establish the mapping between a single human pose or a temporal pose sequence, and garment deformation, but it is difficult to quickly generate diverse clothed human animations. We address this problem with a method to automatically synthesize dressed human animations with temporal consistency from a specified human motion label. At the heart of our method is a two-stage strategy. Specifically, we first learn a latent space encoding the sequence-level distribution of human motions utilizing a transformer-based conditional variational autoencoder (Transformer-CVAE). Then a garment simulator synthesizes dynamic garment shapes using a transformer encoder–decoder architecture. Since the learned latent space comes from varied human motions, our method can generate a variety of styles of motions given a specific motion label. By means of a novel beginning of sequence (BOS) learning strategy and a self-supervised refinement procedure, our garment simulator is capable of efficiently synthesizing garment deformation sequences corresponding to the generated human motions while maintaining temporal and spatial consistency. We verify our ideas experimentally. This is the first generative model that directly dresses human animation.\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"24 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning physically based material and lighting decompositions for face editing 为人脸编辑学习基于物理的材质和照明分解

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2024-01-03 DOI: 10.1007/s41095-022-0309-1

Qian Zhang, Vikas Thamizharasan, James Tompkin

Lighting is crucial for portrait photography, yet the complex interactions between the skin and incident light are expensive to model computationally in graphics and difficult to reconstruct analytically via computer vision. Alternatively, to allow fast and controllable reflectance and lighting editing, we developed a physically based decomposition through deep learned priors from path-traced portrait images. Previous approaches that used simplified material models or low-frequency or low-dynamic-range lighting struggled to model specular reflections or relight directly without intermediate decomposition. However, we estimate the surface normal, skin albedo and roughness, and high-frequency HDRI maps, and propose an architecture to estimate both diffuse and specular reflectance components. In our experiments, we show that this approach can represent the true appearance function more effectively than simpler baseline methods, leading to better generalization and higher-quality editing.

照明对于人像摄影至关重要，然而皮肤与入射光之间复杂的相互作用在图形学中的计算建模成本很高，而且很难通过计算机视觉进行分析重建。为了实现快速、可控的反射和光照编辑，我们开发了一种基于物理的分解方法，通过对路径追踪的人像图像进行深度学习前验来实现。以前的方法使用简化的材料模型或低频或低动态范围照明，很难在没有中间分解的情况下直接模拟镜面反射或重新照明。然而，我们估算了表面法线、皮肤反照率和粗糙度以及高频 HDRI 地图，并提出了一种估算漫反射和镜面反射成分的架构。在实验中，我们发现这种方法比简单的基线方法能更有效地表示真实的外观函数，从而获得更好的泛化效果和更高质量的编辑效果。

引用次数: 0

APF-GAN: Exploring asymmetric pre-training and fine-tuning strategy for conditional generative adversarial network APF-GAN：探索条件生成式对抗网络的非对称预训练和微调策略

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2023-11-30 DOI: 10.1007/s41095-023-0357-1

Yuxuan Li, Lingfeng Yang, Xiang Li

引用次数: 0

Hierarchical vectorization for facial images 面部图像的分层矢量化

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2023-11-30 DOI: 10.1007/s41095-022-0314-4

Qian Fu, Linlin Liu, Fei Hou, Ying He

The explosive growth of social media means portrait editing and retouching are in high demand. While portraits are commonly captured and stored as raster images, editing raster images is non-trivial and requires the user to be highly skilled. Aiming at developing intuitive and easy-to-use portrait editing tools, we propose a novel vectorization method that can automatically convert raster images into a 3-tier hierarchical representation. The base layer consists of a set of sparse diffusion curves (DCs) which characterize salient geometric features and low-frequency colors, providing a means for semantic color transfer and facial expression editing. The middle level encodes specular highlights and shadows as large, editable Poisson regions (PRs) and allows the user to directly adjust illumination by tuning the strength and changing the shapes of PRs. The top level contains two types of pixel-sized PRs for high-frequency residuals and fine details such as pimples and pigmentation. We train a deep generative model that can produce high-frequency residuals automatically. Thanks to the inherent meaning in vector primitives, editing portraits becomes easy and intuitive. In particular, our method supports color transfer, facial expression editing, highlight and shadow editing, and automatic retouching. To quantitatively evaluate the results, we extend the commonly used FLIP metric (which measures color and feature differences between two images) to consider illumination. The new metric, illumination-sensitive FLIP, can effectively capture salient changes in color transfer results, and is more consistent with human perception than FLIP and other quality measures for portrait images. We evaluate our method on the FFHQR dataset and show it to be effective for common portrait editing tasks, such as retouching, light editing, color transfer, and expression editing.

社交媒体的爆炸式增长意味着对肖像编辑和润饰的需求很大。虽然人像通常以光栅图像的形式采集和存储，但编辑光栅图像并非易事，需要用户具备高超的技能。为了开发直观易用的肖像编辑工具，我们提出了一种新颖的矢量化方法，可自动将光栅图像转换为三层分级表示。底层由一组稀疏的扩散曲线（DC）组成，这些曲线描述了突出的几何特征和低频色彩，为语义色彩转换和面部表情编辑提供了一种手段。中间层将镜面高光和阴影编码为大型、可编辑的泊松区域（PR），用户可以通过调整泊松区域的强度和形状直接调整光照度。顶层包含两类像素大小的 PR，分别用于处理高频残差以及痘痘和色素沉着等精细细节。我们训练的深度生成模型可以自动生成高频残差。得益于矢量基元的固有意义，人像编辑变得简单而直观。特别是，我们的方法支持色彩转换、面部表情编辑、高光和阴影编辑以及自动修饰。为了定量评估结果，我们扩展了常用的 FLIP 指标（用于测量两幅图像之间的颜色和特征差异），将光照也考虑在内。新指标--光照敏感 FLIP 能有效捕捉色彩转换结果中的显著变化，与 FLIP 和其他人像图像质量指标相比，更符合人类的感知。我们在 FFHQR 数据集上对我们的方法进行了评估，结果表明它对常见的人像编辑任务，如修饰、光线编辑、色彩转换和表情编辑等都很有效。

{"title":"Hierarchical vectorization for facial images","authors":"Qian Fu, Linlin Liu, Fei Hou, Ying He","doi":"10.1007/s41095-022-0314-4","DOIUrl":"https://doi.org/10.1007/s41095-022-0314-4","url":null,"abstract":"The explosive growth of social media means portrait editing and retouching are in high demand. While portraits are commonly captured and stored as raster images, editing raster images is non-trivial and requires the user to be highly skilled. Aiming at developing intuitive and easy-to-use portrait editing tools, we propose a novel vectorization method that can automatically convert raster images into a 3-tier hierarchical representation. The base layer consists of a set of sparse diffusion curves (DCs) which characterize salient geometric features and low-frequency colors, providing a means for semantic color transfer and facial expression editing. The middle level encodes specular highlights and shadows as large, editable Poisson regions (PRs) and allows the user to directly adjust illumination by tuning the strength and changing the shapes of PRs. The top level contains two types of pixel-sized PRs for high-frequency residuals and fine details such as pimples and pigmentation. We train a deep generative model that can produce high-frequency residuals automatically. Thanks to the inherent meaning in vector primitives, editing portraits becomes easy and intuitive. In particular, our method supports color transfer, facial expression editing, highlight and shadow editing, and automatic retouching. To quantitatively evaluate the results, we extend the commonly used FLIP metric (which measures color and feature differences between two images) to consider illumination. The new metric, illumination-sensitive FLIP, can effectively capture salient changes in color transfer results, and is more consistent with human perception than FLIP and other quality measures for portrait images. We evaluate our method on the FFHQR dataset and show it to be effective for common portrait editing tasks, such as retouching, light editing, color transfer, and expression editing.\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"7 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139078756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A unified multi-view multi-person tracking framework 统一的多视角多人跟踪框架

IF 6.9 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media

Pub Date : 2023-11-30 DOI: 10.1007/s41095-023-0334-8

Fan Yang, Shigeyuki Odashima, Sosuke Yamao, Hiroaki Fujimoto, Shoichi Masui, Shan Jiang

Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.

尽管在三维多视角多人（3D MM）跟踪方面取得了重大进展，但目前的框架仍分别针对足迹跟踪或姿势跟踪。为前者设计的框架不能用于后者，因为它们直接通过同构投影获得地平面上的三维位置，而这不适用于地面上的三维姿势。与此相反，为姿势跟踪设计的框架一般会隔离多视角和多帧关联，对于足迹跟踪可能不够稳健，因为足迹跟踪使用的关键点比姿势跟踪少，削弱了单帧中的多视角关联线索。本研究提出了一个统一的多视角多人跟踪框架，以弥补足迹跟踪和姿势跟踪之间的差距。无需额外修改，该框架可采用单目二维边界框和二维姿势作为输入，为多人生成稳健的三维轨迹。重要的是，多帧和多视角信息被联合用于改进关联和三角测量。研究表明，我们的框架在 Campus 和 Shelf 数据集的三维姿态跟踪方面具有最先进的性能，在 WILDTRACK 和 MMPTRACK 数据集的三维足迹跟踪方面也取得了不相上下的结果。

{"title":"A unified multi-view multi-person tracking framework","authors":"Fan Yang, Shigeyuki Odashima, Sosuke Yamao, Hiroaki Fujimoto, Shoichi Masui, Shan Jiang","doi":"10.1007/s41095-023-0334-8","DOIUrl":"https://doi.org/10.1007/s41095-023-0334-8","url":null,"abstract":"Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.\u0000","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"4 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139078705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0