首页 > 最新文献

Computational Visual Media最新文献

英文 中文
Deep panoramic depth prediction and completion for indoor scenes 室内场景的深度全景深度预测和完成
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-02-08 DOI: 10.1007/s41095-023-0358-0
Giovanni Pintore, Eva Almansa, Armando Sanchez, Giorgio Vassena, Enrico Gobbetti

We introduce a novel end-to-end deep-learning solution for rapidly estimating a dense spherical depth map of an indoor environment. Our input is a single equirectangular image registered with a sparse depth map, as provided by a variety of common capture setups. Depth is inferred by an efficient and lightweight single-branch network, which employs a dynamic gating system to process together dense visual data and sparse geometric data. We exploit the characteristics of typical man-made environments to efficiently compress multi-resolution features and find short- and long-range relations among scene parts. Furthermore, we introduce a new augmentation strategy to make the model robust to different types of sparsity, including those generated by various structured light sensors and LiDAR setups. The experimental results demonstrate that our method provides interactive performance and outperforms state-of-the-art solutions in computational efficiency, adaptivity to variable depth sparsity patterns, and prediction accuracy for challenging indoor data, even when trained solely on synthetic data without any fine tuning.

我们介绍了一种新颖的端到端深度学习解决方案,用于快速估计室内环境的密集球形深度图。我们的输入是由各种常见捕捉设置提供的单个等角图像与稀疏深度图。深度由一个高效、轻便的单分支网络推断,该网络采用动态门控系统来处理密集的视觉数据和稀疏的几何数据。我们利用典型人造环境的特点来有效压缩多分辨率特征,并找到场景各部分之间的短距离和长距离关系。此外,我们还引入了一种新的增强策略,使模型对不同类型的稀疏性具有鲁棒性,包括由各种结构光传感器和激光雷达设置产生的稀疏性。实验结果表明,我们的方法具有交互性能,在计算效率、对不同深度稀疏模式的适应性以及对具有挑战性的室内数据的预测准确性方面都优于最先进的解决方案,即使仅在合成数据上进行训练而不做任何微调也是如此。
{"title":"Deep panoramic depth prediction and completion for indoor scenes","authors":"Giovanni Pintore, Eva Almansa, Armando Sanchez, Giorgio Vassena, Enrico Gobbetti","doi":"10.1007/s41095-023-0358-0","DOIUrl":"https://doi.org/10.1007/s41095-023-0358-0","url":null,"abstract":"<p>We introduce a novel end-to-end deep-learning solution for rapidly estimating a dense spherical depth map of an indoor environment. Our input is a single equirectangular image registered with a sparse depth map, as provided by a variety of common capture setups. Depth is inferred by an efficient and lightweight single-branch network, which employs a dynamic gating system to process together dense visual data and sparse geometric data. We exploit the characteristics of typical man-made environments to efficiently compress multi-resolution features and find short- and long-range relations among scene parts. Furthermore, we introduce a new augmentation strategy to make the model robust to different types of sparsity, including those generated by various structured light sensors and LiDAR setups. The experimental results demonstrate that our method provides interactive performance and outperforms state-of-the-art solutions in computational efficiency, adaptivity to variable depth sparsity patterns, and prediction accuracy for challenging indoor data, even when trained solely on synthetic data without any fine tuning.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139766149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shape embedding and retrieval in multi-flow deformation 多流变形中的形状嵌入和检索
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-02-08 DOI: 10.1007/s41095-022-0315-3
Baiqiang Leng, Jingwei Huang, Guanlin Shen, Bin Wang

We propose a unified 3D flow framework for joint learning of shape embedding and deformation for different categories. Our goal is to recover shapes from imperfect point clouds by fitting the best shape template in a shape repository after deformation. Accordingly, we learn a shape embedding for template retrieval and a flow-based network for robust deformation. We note that the deformation flow can be quite different for different shape categories. Therefore, we introduce a novel multi-hub module to learn multiple modes of deformation to incorporate such variation, providing a network which can handle a wide range of objects from different categories. The shape embedding is designed to retrieve the best-fit template as the nearest neighbor in a latent space. We replace the standard fully connected layer with a tiny structure in the embedding that significantly reduces network complexity and further improves deformation quality. Experiments show the superiority of our method to existing state-of-the-art methods via qualitative and quantitative comparisons. Finally, our method provides efficient and flexible deformation that can further be used for novel shape design.

我们提出了一个统一的三维流框架,用于联合学习不同类别的形状嵌入和变形。我们的目标是从不完美的点云中恢复形状,方法是在变形后在形状库中拟合最佳形状模板。因此,我们学习了用于模板检索的形状嵌入和用于稳健变形的基于流的网络。我们注意到,不同形状类别的变形流可能大不相同。因此,我们引入了一个新颖的多集线器模块来学习多种变形模式,以纳入这种变化,从而提供一个可处理不同类别的各种物体的网络。形状嵌入的设计目的是在潜在空间中检索作为最近邻的最合适模板。我们用嵌入中的微小结构取代了标准的全连接层,从而大大降低了网络的复杂性,并进一步提高了变形质量。通过定性和定量比较,实验表明我们的方法优于现有的最先进方法。最后,我们的方法提供了高效灵活的变形,可进一步用于新颖的形状设计。
{"title":"Shape embedding and retrieval in multi-flow deformation","authors":"Baiqiang Leng, Jingwei Huang, Guanlin Shen, Bin Wang","doi":"10.1007/s41095-022-0315-3","DOIUrl":"https://doi.org/10.1007/s41095-022-0315-3","url":null,"abstract":"<p>We propose a unified 3D flow framework for joint learning of shape embedding and deformation for different categories. Our goal is to recover shapes from imperfect point clouds by fitting the best shape template in a shape repository after deformation. Accordingly, we learn a shape embedding for template retrieval and a flow-based network for robust deformation. We note that the deformation flow can be quite different for different shape categories. Therefore, we introduce a novel multi-hub module to learn multiple modes of deformation to incorporate such variation, providing a network which can handle a wide range of objects from different categories. The shape embedding is designed to retrieve the best-fit template as the nearest neighbor in a latent space. We replace the standard fully connected layer with a tiny structure in the embedding that significantly reduces network complexity and further improves deformation quality. Experiments show the superiority of our method to existing state-of-the-art methods via qualitative and quantitative comparisons. Finally, our method provides efficient and flexible deformation that can further be used for novel shape design.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139766148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic ocean inverse modeling based on differentiable rendering 基于可变渲染的动态海洋反演建模
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-01-03 DOI: 10.1007/s41095-023-0338-4
Xueguang Xie, Yang Gao, Fei Hou, Aimin Hao, Hong Qin

Learning and inferring underlying motion patterns of captured 2D scenes and then re-creating dynamic evolution consistent with the real-world natural phenomena have high appeal for graphics and animation. To bridge the technical gap between virtual and real environments, we focus on the inverse modeling and reconstruction of visually consistent and property-verifiable oceans, taking advantage of deep learning and differentiable physics to learn geometry and constitute waves in a self-supervised manner. First, we infer hierarchical geometry using two networks, which are optimized via the differentiable renderer. We extract wave components from the sequence of inferred geometry through a network equipped with a differentiable ocean model. Then, ocean dynamics can be evolved using the reconstructed wave components. Through extensive experiments, we verify that our new method yields satisfactory results for both geometry reconstruction and wave estimation. Moreover, the new framework has the inverse modeling potential to facilitate a host of graphics applications, such as the rapid production of physically accurate scene animation and editing guided by real ocean scenes.

学习和推断捕捉到的二维场景的底层运动模式,然后重新创建与真实世界自然现象一致的动态演化,这对图形和动画制作具有很高的吸引力。为了弥合虚拟环境与真实环境之间的技术差距,我们重点研究了视觉上一致且属性可验证的海洋的反向建模和重建,利用深度学习和可微分物理学的优势,以自我监督的方式学习几何并构成波浪。首先,我们使用两个网络推断分层几何,并通过可微分渲染器进行优化。我们通过一个配备可微分海洋模型的网络,从推断出的几何图形序列中提取波浪成分。然后,就可以利用重建的波浪成分来演化海洋动力学。通过大量实验,我们验证了我们的新方法在几何重建和波浪估算方面都取得了令人满意的结果。此外,新框架还具有反建模潜力,可促进大量图形应用,如快速制作物理上精确的场景动画,以及在真实海洋场景指导下进行编辑。
{"title":"Dynamic ocean inverse modeling based on differentiable rendering","authors":"Xueguang Xie, Yang Gao, Fei Hou, Aimin Hao, Hong Qin","doi":"10.1007/s41095-023-0338-4","DOIUrl":"https://doi.org/10.1007/s41095-023-0338-4","url":null,"abstract":"<p>Learning and inferring underlying motion patterns of captured 2D scenes and then re-creating dynamic evolution consistent with the real-world natural phenomena have high appeal for graphics and animation. To bridge the technical gap between virtual and real environments, we focus on the inverse modeling and reconstruction of visually consistent and property-verifiable oceans, taking advantage of deep learning and differentiable physics to learn geometry and constitute waves in a self-supervised manner. First, we infer hierarchical geometry using two networks, which are optimized via the differentiable renderer. We extract wave components from the sequence of inferred geometry through a network equipped with a differentiable ocean model. Then, ocean dynamics can be evolved using the reconstructed wave components. Through extensive experiments, we verify that our new method yields satisfactory results for both geometry reconstruction and wave estimation. Moreover, the new framework has the inverse modeling potential to facilitate a host of graphics applications, such as the rapid production of physically accurate scene animation and editing guided by real ocean scenes.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking visual SLAM methods in mirror environments 镜像环境中视觉 SLAM 方法的基准测试
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-01-03 DOI: 10.1007/s41095-022-0329-x
Peter Herbert, Jing Wu, Ze Ji, Yu-Kun Lai

Visual simultaneous localisation and mapping (vSLAM) finds applications for indoor and outdoor navigation that routinely subjects it to visual complexities, particularly mirror reflections. The effect of mirror presence (time visible and its average size in the frame) was hypothesised to impact localisation and mapping performance, with systems using direct techniques expected to perform worse. Thus, a dataset, MirrEnv, of image sequences recorded in mirror environments, was collected, and used to evaluate the performance of existing representative methods. RGBD ORB-SLAM3 and BundleFusion appear to show moderate degradation of absolute trajectory error with increasing mirror duration, whilst the remaining results did not show significantly degraded localisation performance. The mesh maps generated proved to be very inaccurate, with real and virtual reflections colliding in the reconstructions. A discussion is given of the likely sources of error and robustness in mirror environments, outlining future directions for validating and improving vSLAM performance in the presence of planar mirrors. The MirrEnv dataset is available at https://doi.org/10.17035/d.2023.0292477898.

视觉同步定位和绘图(vSLAM)应用于室内和室外导航,经常会遇到复杂的视觉问题,尤其是镜面反射。镜面存在的影响(可见时间和镜面在画面中的平均大小)被认为会影响定位和绘图性能,使用直接技术的系统预计性能会更差。因此,我们收集了在镜面环境中记录的图像序列数据集 MirrEnv,用于评估现有代表性方法的性能。随着镜像持续时间的增加,RGBD ORB-SLAM3 和 BundleFusion 似乎显示出绝对轨迹误差的适度下降,而其余结果并未显示出明显的定位性能下降。事实证明,生成的网格图非常不准确,真实反射和虚拟反射在重建中发生碰撞。本文讨论了可能的误差来源和镜面环境下的鲁棒性,概述了验证和改进 vSLAM 在平面镜面下性能的未来方向。MirrEnv 数据集可在 https://doi.org/10.17035/d.2023.0292477898 上查阅。
{"title":"Benchmarking visual SLAM methods in mirror environments","authors":"Peter Herbert, Jing Wu, Ze Ji, Yu-Kun Lai","doi":"10.1007/s41095-022-0329-x","DOIUrl":"https://doi.org/10.1007/s41095-022-0329-x","url":null,"abstract":"<p>Visual simultaneous localisation and mapping (vSLAM) finds applications for indoor and outdoor navigation that routinely subjects it to visual complexities, particularly mirror reflections. The effect of mirror presence (time visible and its average size in the frame) was hypothesised to impact localisation and mapping performance, with systems using direct techniques expected to perform worse. Thus, a dataset, MirrEnv, of image sequences recorded in mirror environments, was collected, and used to evaluate the performance of existing representative methods. RGBD ORB-SLAM3 and BundleFusion appear to show moderate degradation of absolute trajectory error with increasing mirror duration, whilst the remaining results did not show significantly degraded localisation performance. The mesh maps generated proved to be very inaccurate, with real and virtual reflections colliding in the reconstructions. A discussion is given of the likely sources of error and robustness in mirror environments, outlining future directions for validating and improving vSLAM performance in the presence of planar mirrors. The MirrEnv dataset is available at https://doi.org/10.17035/d.2023.0292477898.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time distance field acceleration based free-viewpoint video synthesis for large sports fields 基于实时距离场加速的大型运动场自由视点视频合成
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-01-03 DOI: 10.1007/s41095-022-0323-3
Yanran Dai, Jing Li, Yuqi Jiang, Haidong Qin, Bang Liang, Shikuan Hong, Haozhe Pan, Tao Yang

Free-viewpoint video allows the user to view objects from any virtual perspective, creating an immersive visual experience. This technology enhances the interactivity and freedom of multimedia performances. However, many free-viewpoint video synthesis methods hardly satisfy the requirement to work in real time with high precision, particularly for sports fields having large areas and numerous moving objects. To address these issues, we propose a free-viewpoint video synthesis method based on distance field acceleration. The central idea is to fuse multi-view distance field information and use it to adjust the search step size adaptively. Adaptive step size search is used in two ways: for fast estimation of multi-object three-dimensional surfaces, and synthetic view rendering based on global occlusion judgement. We have implemented our ideas using parallel computing for interactive display, using CUDA and OpenGL frameworks, and have used real-world and simulated experimental datasets for evaluation. The results show that the proposed method can render free-viewpoint videos with multiple objects on large sports fields at 25 fps. Furthermore, the visual quality of our synthetic novel viewpoint images exceeds that of state-of-the-art neural-rendering-based methods.

自由视角视频允许用户从任何虚拟视角观看物体,创造身临其境的视觉体验。这项技术增强了多媒体表演的互动性和自由度。然而,许多自由视点视频合成方法很难满足高精度实时工作的要求,尤其是对于面积大、移动物体多的运动场地。为了解决这些问题,我们提出了一种基于距离场加速的自由视点视频合成方法。其核心思想是融合多视角距离场信息,并利用这些信息自适应地调整搜索步长。自适应步长搜索有两种用途:快速估计多物体三维表面和基于全局遮挡判断的合成视图渲染。我们利用 CUDA 和 OpenGL 框架,通过并行计算实现了我们的想法,并使用真实世界和模拟实验数据集进行评估。结果表明,所提出的方法可以在大型运动场上以 25 fps 的速度渲染包含多个物体的自由视点视频。此外,我们合成的新视角图像的视觉质量超过了最先进的基于神经渲染的方法。
{"title":"Real-time distance field acceleration based free-viewpoint video synthesis for large sports fields","authors":"Yanran Dai, Jing Li, Yuqi Jiang, Haidong Qin, Bang Liang, Shikuan Hong, Haozhe Pan, Tao Yang","doi":"10.1007/s41095-022-0323-3","DOIUrl":"https://doi.org/10.1007/s41095-022-0323-3","url":null,"abstract":"<p>Free-viewpoint video allows the user to view objects from any virtual perspective, creating an immersive visual experience. This technology enhances the interactivity and freedom of multimedia performances. However, many free-viewpoint video synthesis methods hardly satisfy the requirement to work in real time with high precision, particularly for sports fields having large areas and numerous moving objects. To address these issues, we propose a free-viewpoint video synthesis method based on distance field acceleration. The central idea is to fuse multi-view distance field information and use it to adjust the search step size adaptively. Adaptive step size search is used in two ways: for fast estimation of multi-object three-dimensional surfaces, and synthetic view rendering based on global occlusion judgement. We have implemented our ideas using parallel computing for interactive display, using CUDA and OpenGL frameworks, and have used real-world and simulated experimental datasets for evaluation. The results show that the proposed method can render free-viewpoint videos with multiple objects on large sports fields at 25 fps. Furthermore, the visual quality of our synthetic novel viewpoint images exceeds that of state-of-the-art neural-rendering-based methods.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-modal visual tracking: Review and experimental comparison 多模式视觉跟踪:回顾与实验比较
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-01-03 DOI: 10.1007/s41095-023-0345-5
Pengyu Zhang, Dong Wang, Huchuan Lu

Visual object tracking has been drawing increasing attention in recent years, as a fundamental task in computer vision. To extend the range of tracking applications, researchers have been introducing information from multiple modalities to handle specific scenes, with promising research prospects for emerging methods and benchmarks. To provide a thorough review of multi-modal tracking, different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy, with specific focus on visible-depth (RGB-D) and visible-thermal (RGB-T) tracking. Subsequently, a detailed description of the related benchmarks and challenges is provided. Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets: PTB, VOT19-RGBD, GTOT, RGBT234, and VOT19-RGBT. Finally, various future directions, including model design and dataset construction, are discussed from different perspectives for further research.

视觉物体跟踪作为计算机视觉的一项基本任务,近年来日益受到关注。为了扩大跟踪应用的范围,研究人员一直在引入多种模式的信息来处理特定场景,新兴方法和基准的研究前景广阔。为了对多模态跟踪进行全面评述,本文在统一的分类法下总结了多模态跟踪算法的不同方面,并特别关注可见深度(RGB-D)和可见热(RGB-T)跟踪。随后,详细介绍了相关基准和挑战。在五个数据集上进行了广泛的实验,以分析跟踪器的有效性:PTB、VOT19-RGBD、GTOT、RGBT234 和 VOT19-RGBT。最后,从不同角度讨论了未来的研究方向,包括模型设计和数据集构建。
{"title":"Multi-modal visual tracking: Review and experimental comparison","authors":"Pengyu Zhang, Dong Wang, Huchuan Lu","doi":"10.1007/s41095-023-0345-5","DOIUrl":"https://doi.org/10.1007/s41095-023-0345-5","url":null,"abstract":"<p>Visual object tracking has been drawing increasing attention in recent years, as a fundamental task in computer vision. To extend the range of tracking applications, researchers have been introducing information from multiple modalities to handle specific scenes, with promising research prospects for emerging methods and benchmarks. To provide a thorough review of multi-modal tracking, different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy, with specific focus on visible-depth (RGB-D) and visible-thermal (RGB-T) tracking. Subsequently, a detailed description of the related benchmarks and challenges is provided. Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets: PTB, VOT19-RGBD, GTOT, RGBT234, and VOT19-RGBT. Finally, various future directions, including model design and dataset construction, are discussed from different perspectives for further research.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Controllable multi-domain semantic artwork synthesis 可控多领域语义艺术品合成
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-01-03 DOI: 10.1007/s41095-023-0356-2
Yuantian Huang, Satoshi Iizuka, Edgar Simo-Serra, Kazuhiro Fukui

We present a novel framework for the multi-domain synthesis of artworks from semantic layouts. One of the main limitations of this challenging task is the lack of publicly available segmentation datasets for art synthesis. To address this problem, we propose a dataset called ArtSem that contains 40,000 images of artwork from four different domains, with their corresponding semantic label maps. We first extracted semantic maps from landscape photography and used a conditional generative adversarial network (GAN)-based approach for generating high-quality artwork from semantic maps without requiring paired training data. Furthermore, we propose an artwork-synthesis model using domain-dependent variational encoders for high-quality multi-domain synthesis. Subsequently, the model was improved and complemented with a simple but effective normalization method based on jointly normalizing semantics and style, which we call spatially style-adaptive normalization (SSTAN). Compared to the previous methods, which only take semantic layout as the input, our model jointly learns style and semantic information representation, improving the generation quality of artistic images. These results indicate that our model learned to separate the domains in the latent space. Thus, we can perform fine-grained control of the synthesized artwork by identifying hyperplanes that separate the different domains. Moreover, by combining the proposed dataset and approach, we generated user-controllable artworks of higher quality than that of existing approaches, as corroborated by quantitative metrics and a user study.

我们提出了一个从语义布局多领域合成艺术作品的新框架。这项具有挑战性的任务的主要局限之一是缺乏用于艺术合成的公开可用的分割数据集。为了解决这个问题,我们提出了一个名为 ArtSem 的数据集,其中包含来自四个不同领域的 40,000 张艺术作品图像及其相应的语义标签图。我们首先从风景摄影中提取语义图,然后使用基于条件生成式对抗网络(GAN)的方法从语义图生成高质量的艺术作品,而无需配对训练数据。此外,我们还提出了一种艺术作品合成模型,该模型使用依赖于领域的变异编码器进行高质量的多领域合成。随后,我们对该模型进行了改进,并补充了一种基于语义和风格联合归一化的简单而有效的归一化方法,我们称之为空间风格自适应归一化(SSTAN)。与之前仅将语义布局作为输入的方法相比,我们的模型联合学习了风格和语义信息表示,从而提高了艺术图像的生成质量。这些结果表明,我们的模型学会了在潜在空间中分离域。因此,我们可以通过识别分隔不同领域的超平面,对合成的艺术作品进行精细控制。此外,通过结合所提出的数据集和方法,我们生成了用户可控的艺术作品,其质量高于现有方法,量化指标和用户研究也证实了这一点。
{"title":"Controllable multi-domain semantic artwork synthesis","authors":"Yuantian Huang, Satoshi Iizuka, Edgar Simo-Serra, Kazuhiro Fukui","doi":"10.1007/s41095-023-0356-2","DOIUrl":"https://doi.org/10.1007/s41095-023-0356-2","url":null,"abstract":"<p>We present a novel framework for the multi-domain synthesis of artworks from semantic layouts. One of the main limitations of this challenging task is the lack of publicly available segmentation datasets for art synthesis. To address this problem, we propose a dataset called <i>ArtSem</i> that contains 40,000 images of artwork from four different domains, with their corresponding semantic label maps. We first extracted semantic maps from landscape photography and used a conditional generative adversarial network (GAN)-based approach for generating high-quality artwork from semantic maps without requiring paired training data. Furthermore, we propose an artwork-synthesis model using domain-dependent variational encoders for high-quality multi-domain synthesis. Subsequently, the model was improved and complemented with a simple but effective normalization method based on jointly normalizing semantics and style, which we call spatially style-adaptive normalization (SSTAN). Compared to the previous methods, which only take semantic layout as the input, our model jointly learns style and semantic information representation, improving the generation quality of artistic images. These results indicate that our model learned to separate the domains in the latent space. Thus, we can perform fine-grained control of the synthesized artwork by identifying hyperplanes that separate the different domains. Moreover, by combining the proposed dataset and approach, we generated user-controllable artworks of higher quality than that of existing approaches, as corroborated by quantitative metrics and a user study.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporally consistent video colorization with deep feature propagation and self-regularization learning 利用深度特征传播和自规范化学习实现时空一致的视频着色
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-01-03 DOI: 10.1007/s41095-023-0342-8
Yihao Liu, Hengyuan Zhao, Kelvin C. K. Chan, Xintao Wang, Chen Change Loy, Yu Qiao, Chao Dong

Video colorization is a challenging and highly ill-posed problem. Although recent years have witnessed remarkable progress in single image colorization, there is relatively less research effort on video colorization, and existing methods always suffer from severe flickering artifacts (temporal inconsistency) or unsatisfactory colorization. We address this problem from a new perspective, by jointly considering colorization and temporal consistency in a unified framework. Specifically, we propose a novel temporally consistent video colorization (TCVC) framework. TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization. Furthermore, TCVC introduces a self-regularization learning (SRL) scheme to minimize the differences in predictions obtained using different time steps. SRL does not require any ground-truth color videos for training and can further improve temporal consistency. Experiments demonstrate that our method can not only provide visually pleasing colorized video, but also with clearly better temporal consistency than state-of-the-art methods. A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE, while code is available at https://github.com/lyh-18/TCVC-Temporally-Consistent-Video-Colorization.

视频着色是一个极具挑战性的难题。虽然近年来在单图像着色方面取得了显著进展,但视频着色方面的研究相对较少,而且现有方法总是存在严重的闪烁伪影(时间不一致性)或着色效果不理想的问题。我们从一个全新的角度来解决这个问题,在一个统一的框架中共同考虑着色和时间一致性。具体来说,我们提出了一种新颖的时间一致性视频着色(TCVC)框架。TCVC 以双向方式有效传播帧级深度特征,以增强着色的时间一致性。此外,TCVC 还引入了自规范化学习(SRL)方案,以最小化使用不同时间步骤获得的预测结果之间的差异。SRL 不需要任何地面真实色彩视频进行训练,可以进一步提高时间一致性。实验证明,我们的方法不仅能提供视觉上悦目的彩色视频,而且其时间一致性明显优于最先进的方法。视频演示见 https://www.youtube.com/watch?v=c7dczMs-olE,代码见 https://github.com/lyh-18/TCVC-Temporally-Consistent-Video-Colorization。
{"title":"Temporally consistent video colorization with deep feature propagation and self-regularization learning","authors":"Yihao Liu, Hengyuan Zhao, Kelvin C. K. Chan, Xintao Wang, Chen Change Loy, Yu Qiao, Chao Dong","doi":"10.1007/s41095-023-0342-8","DOIUrl":"https://doi.org/10.1007/s41095-023-0342-8","url":null,"abstract":"<p>Video colorization is a challenging and highly ill-posed problem. Although recent years have witnessed remarkable progress in single image colorization, there is relatively less research effort on video colorization, and existing methods always suffer from severe flickering artifacts (temporal inconsistency) or unsatisfactory colorization. We address this problem from a new perspective, by jointly considering colorization and temporal consistency in a unified framework. Specifically, we propose a novel temporally consistent video colorization (TCVC) framework. TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization. Furthermore, TCVC introduces a self-regularization learning (SRL) scheme to minimize the differences in predictions obtained using different time steps. SRL does not require any ground-truth color videos for training and can further improve temporal consistency. Experiments demonstrate that our method can not only provide visually pleasing colorized video, but also with clearly better temporal consistency than state-of-the-art methods. A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE, while code is available at https://github.com/lyh-18/TCVC-Temporally-Consistent-Video-Colorization.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-granularity sequence generation for hierarchical image classification 为分层图像分类生成多粒度序列
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-01-03 DOI: 10.1007/s41095-022-0332-2
Xinda Liu, Lili Wang

Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities, and also insufficiently consider relationships between the hierarchical multi-granularity labels. We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation (MGSG) approach for the hierarchical multi-granularity image classification task. Specifically, we introduce a transformer architecture to encode the image into visual representation sequences. Next, we traverse the taxonomic tree and organize the multi-granularity labels into sequences, and vectorize them and add positional information. The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs, and outputs the predicted multi-granularity label sequence. The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism, and relates visual information to the semantic label information through a cross-modality attention mechanism. In this way, the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities. Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method. Our project is available at https://github.com/liuxindazz/mgsg.

分层多粒度图像分类是一项具有挑战性的任务,其目的是为每幅给定图像同时标记多个粒度标签。现有方法往往忽略了不同图像区域对不同粒度标签预测的贡献不同,也没有充分考虑分层多粒度标签之间的关系。我们引入了序列到序列机制来克服这两个问题,并针对分层多粒度图像分类任务提出了一种多粒度序列生成(MGSG)方法。具体来说,我们引入了一种转换器架构,将图像编码为视觉表示序列。然后,我们遍历分类树,将多粒度标签组织成序列,并将其矢量化和添加位置信息。所提出的多粒度序列生成方法建立了一个解码器,将视觉表示序列和语义标签嵌入作为输入,并输出预测的多粒度标签序列。解码器通过遮蔽式多头自我注意机制对多粒度标签之间的依赖性和相关性进行建模,并通过跨模态注意机制将视觉信息与语义标签信息联系起来。这样,所提出的方法就保留了不同粒度标签之间的关系,并考虑到了不同图像区域对不同粒度标签的影响。通过对六个公共基准的评估,定性和定量地证明了所提方法的优势。我们的项目见 https://github.com/liuxindazz/mgsg。
{"title":"Multi-granularity sequence generation for hierarchical image classification","authors":"Xinda Liu, Lili Wang","doi":"10.1007/s41095-022-0332-2","DOIUrl":"https://doi.org/10.1007/s41095-022-0332-2","url":null,"abstract":"<p>Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities, and also insufficiently consider relationships between the hierarchical multi-granularity labels. We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation (MGSG) approach for the hierarchical multi-granularity image classification task. Specifically, we introduce a transformer architecture to encode the image into visual representation sequences. Next, we traverse the taxonomic tree and organize the multi-granularity labels into sequences, and vectorize them and add positional information. The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs, and outputs the predicted multi-granularity label sequence. The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism, and relates visual information to the semantic label information through a cross-modality attention mechanism. In this way, the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities. Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method. Our project is available at https://github.com/liuxindazz/mgsg.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generating diverse clothed 3D human animations via a generative model 通过生成模型生成多样化的穿衣三维人体动画
IF 6.9 3区 计算机科学 Q1 Computer Science Pub Date : 2024-01-03 DOI: 10.1007/s41095-022-0324-2
Min Shi, Wenke Feng, Lin Gao, Dengming Zhu

Data-driven garment animation is a current topic of interest in the computer graphics industry. Existing approaches generally establish the mapping between a single human pose or a temporal pose sequence, and garment deformation, but it is difficult to quickly generate diverse clothed human animations. We address this problem with a method to automatically synthesize dressed human animations with temporal consistency from a specified human motion label. At the heart of our method is a two-stage strategy. Specifically, we first learn a latent space encoding the sequence-level distribution of human motions utilizing a transformer-based conditional variational autoencoder (Transformer-CVAE). Then a garment simulator synthesizes dynamic garment shapes using a transformer encoder–decoder architecture. Since the learned latent space comes from varied human motions, our method can generate a variety of styles of motions given a specific motion label. By means of a novel beginning of sequence (BOS) learning strategy and a self-supervised refinement procedure, our garment simulator is capable of efficiently synthesizing garment deformation sequences corresponding to the generated human motions while maintaining temporal and spatial consistency. We verify our ideas experimentally. This is the first generative model that directly dresses human animation.

数据驱动的服装动画是计算机制图行业当前关注的话题。现有方法通常在单一人体姿势或时间姿势序列与服装变形之间建立映射关系,但很难快速生成多样化的人体着装动画。为了解决这个问题,我们采用了一种方法,根据指定的人体运动标签自动合成具有时间一致性的着装人体动画。我们方法的核心是一个两阶段策略。具体来说,我们首先利用基于变换器的条件变异自动编码器(Transformer-CVAE)学习一个潜在空间,对人体动作的序列级分布进行编码。然后,服装模拟器利用变压器编码器-解码器架构合成动态服装形状。由于学习到的潜在空间来自不同的人体动作,因此我们的方法可以在特定动作标签下生成各种风格的动作。通过新颖的序列开始(BOS)学习策略和自我监督完善程序,我们的服装模拟器能够高效地合成与生成的人体动作相对应的服装变形序列,同时保持时间和空间的一致性。我们通过实验验证了我们的想法。这是第一个直接为人体动画穿衣的生成模型。
{"title":"Generating diverse clothed 3D human animations via a generative model","authors":"Min Shi, Wenke Feng, Lin Gao, Dengming Zhu","doi":"10.1007/s41095-022-0324-2","DOIUrl":"https://doi.org/10.1007/s41095-022-0324-2","url":null,"abstract":"<p>Data-driven garment animation is a current topic of interest in the computer graphics industry. Existing approaches generally establish the mapping between a single human pose or a temporal pose sequence, and garment deformation, but it is difficult to quickly generate diverse clothed human animations. We address this problem with a method to automatically synthesize dressed human animations with temporal consistency from a specified human motion label. At the heart of our method is a two-stage strategy. Specifically, we first learn a latent space encoding the sequence-level distribution of human motions utilizing a transformer-based conditional variational autoencoder (Transformer-CVAE). Then a garment simulator synthesizes dynamic garment shapes using a transformer encoder–decoder architecture. Since the learned latent space comes from varied human motions, our method can generate a variety of styles of motions given a specific motion label. By means of a novel beginning of sequence (BOS) learning strategy and a self-supervised refinement procedure, our garment simulator is capable of efficiently synthesizing garment deformation sequences corresponding to the generated human motions while maintaining temporal and spatial consistency. We verify our ideas experimentally. This is the first generative model that directly dresses human animation.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Visual Media
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1