Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision最新文献_第7页

Understanding Collapse in Non-contrastive Siamese Representation Learning 非对比暹罗表征学习中的崩溃理解

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-29 DOI: 10.1007/978-3-031-19821-2_28

Alexander C. Li, Alexei A. Efros, Deepak Pathak

引用次数: 15

DELTAR: Depth Estimation from a Light-weight ToF Sensor and RGB Image DELTAR:基于轻量级ToF传感器和RGB图像的深度估计

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-27 DOI: 10.48550/arXiv.2209.13362

Yijin Li, Xinyang Liu, Wenqian Dong, Han Zhou, H. Bao, Guofeng Zhang, Yinda Zhang, Zhaopeng Cui

Light-weight time-of-flight (ToF) depth sensors are small, cheap, low-energy and have been massively deployed on mobile devices for the purposes like autofocus, obstacle detection, etc. However, due to their specific measurements (depth distribution in a region instead of the depth value at a certain pixel) and extremely low resolution, they are insufficient for applications requiring high-fidelity depth such as 3D reconstruction. In this paper, we propose DELTAR, a novel method to empower light-weight ToF sensors with the capability of measuring high resolution and accurate depth by cooperating with a color image. As the core of DELTAR, a feature extractor customized for depth distribution and an attention-based neural architecture is proposed to fuse the information from the color and ToF domain efficiently. To evaluate our system in real-world scenarios, we design a data collection device and propose a new approach to calibrate the RGB camera and ToF sensor. Experiments show that our method produces more accurate depth than existing frameworks designed for depth completion and depth super-resolution and achieves on par performance with a commodity-level RGB-D sensor. Code and data are available at https://zju3dv.github.io/deltar/.

轻型飞行时间(ToF)深度传感器体积小、价格便宜、能耗低，已被大量应用于移动设备上，用于自动对焦、障碍物检测等。然而，由于测量的是特定区域的深度分布，而不是某像素的深度值，而且分辨率极低，对于3D重建等需要高保真深度的应用来说是不够的。在本文中，我们提出了一种新颖的DELTAR方法，通过与彩色图像的配合，使轻型ToF传感器具有高分辨率和精确深度的测量能力。为了有效地融合颜色域和ToF域的信息，提出了基于深度分布的特征提取器和基于注意力的神经结构，作为DELTAR的核心。为了在实际场景中评估我们的系统，我们设计了一个数据收集设备，并提出了一种校准RGB相机和ToF传感器的新方法。实验表明，与现有的深度完井和深度超分辨率框架相比，我们的方法可以产生更精确的深度，并达到与商品级RGB-D传感器相当的性能。代码和数据可在https://zju3dv.github.io/deltar/上获得。

{"title":"DELTAR: Depth Estimation from a Light-weight ToF Sensor and RGB Image","authors":"Yijin Li, Xinyang Liu, Wenqian Dong, Han Zhou, H. Bao, Guofeng Zhang, Yinda Zhang, Zhaopeng Cui","doi":"10.48550/arXiv.2209.13362","DOIUrl":"https://doi.org/10.48550/arXiv.2209.13362","url":null,"abstract":"Light-weight time-of-flight (ToF) depth sensors are small, cheap, low-energy and have been massively deployed on mobile devices for the purposes like autofocus, obstacle detection, etc. However, due to their specific measurements (depth distribution in a region instead of the depth value at a certain pixel) and extremely low resolution, they are insufficient for applications requiring high-fidelity depth such as 3D reconstruction. In this paper, we propose DELTAR, a novel method to empower light-weight ToF sensors with the capability of measuring high resolution and accurate depth by cooperating with a color image. As the core of DELTAR, a feature extractor customized for depth distribution and an attention-based neural architecture is proposed to fuse the information from the color and ToF domain efficiently. To evaluate our system in real-world scenarios, we design a data collection device and propose a new approach to calibrate the RGB camera and ToF sensor. Experiments show that our method produces more accurate depth than existing frameworks designed for depth completion and depth super-resolution and achieves on par performance with a commodity-level RGB-D sensor. Code and data are available at https://zju3dv.github.io/deltar/.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83874951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

AcroFOD: An Adaptive Method for Cross-domain Few-shot Object Detection AcroFOD:一种跨域小镜头目标检测的自适应方法

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-22 DOI: 10.48550/arXiv.2209.10904

Yipeng Gao, Lingxiao Yang, Yunmu Huang, Song Xie, Shiyong Li, Weihao Zheng

Under the domain shift, cross-domain few-shot object detection aims to adapt object detectors in the target domain with a few annotated target data. There exists two significant challenges: (1) Highly insufficient target domain data; (2) Potential over-adaptation and misleading caused by inappropriately amplified target samples without any restriction. To address these challenges, we propose an adaptive method consisting of two parts. First, we propose an adaptive optimization strategy to select augmented data similar to target samples rather than blindly increasing the amount. Specifically, we filter the augmented candidates which significantly deviate from the target feature distribution in the very beginning. Second, to further relieve the data limitation, we propose the multi-level domain-aware data augmentation to increase the diversity and rationality of augmented data, which exploits the cross-image foreground-background mixture. Experiments show that the proposed method achieves state-of-the-art performance on multiple benchmarks.

在域移位的情况下，跨域少镜头目标检测的目的是使目标域中的目标检测器具有少量的标注目标数据。存在两大挑战:(1)目标域数据高度不足;(2)目标样本扩增不当，无任何限制，可能造成过度适应和误导。为了应对这些挑战，我们提出了一种由两部分组成的自适应方法。首先，我们提出了一种自适应优化策略，选择与目标样本相似的增强数据，而不是盲目地增加数量。具体来说，我们在一开始就过滤出明显偏离目标特征分布的增强候选对象。其次，为了进一步缓解数据的局限性，我们提出了多层次的领域感知数据增强，利用交叉图像的前景和背景混合来增加增强数据的多样性和合理性。实验表明，该方法在多个基准测试中达到了最先进的性能。

{"title":"AcroFOD: An Adaptive Method for Cross-domain Few-shot Object Detection","authors":"Yipeng Gao, Lingxiao Yang, Yunmu Huang, Song Xie, Shiyong Li, Weihao Zheng","doi":"10.48550/arXiv.2209.10904","DOIUrl":"https://doi.org/10.48550/arXiv.2209.10904","url":null,"abstract":"Under the domain shift, cross-domain few-shot object detection aims to adapt object detectors in the target domain with a few annotated target data. There exists two significant challenges: (1) Highly insufficient target domain data; (2) Potential over-adaptation and misleading caused by inappropriately amplified target samples without any restriction. To address these challenges, we propose an adaptive method consisting of two parts. First, we propose an adaptive optimization strategy to select augmented data similar to target samples rather than blindly increasing the amount. Specifically, we filter the augmented candidates which significantly deviate from the target feature distribution in the very beginning. Second, to further relieve the data limitation, we propose the multi-level domain-aware data augmentation to increase the diversity and rationality of augmented data, which exploits the cross-image foreground-background mixture. Experiments show that the proposed method achieves state-of-the-art performance on multiple benchmarks.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77736103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

FusionVAE: A Deep Hierarchical Variational Autoencoder for RGB Image Fusion 用于RGB图像融合的深度分层变分自编码器

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-22 DOI: 10.48550/arXiv.2209.11277

Fabian Duffhauss, Ngo Anh Vien, Hanna Ziesche, G. Neumann

. Sensor fusion can significantly improve the performance of many computer vision tasks. However, traditional fusion approaches are either not data-driven and cannot exploit prior knowledge nor find regu-larities in a given dataset or they are restricted to a single application. We overcome this shortcoming by presenting a novel deep hierarchical variational autoencoder called FusionVAE that can serve as a basis for many fusion tasks. Our approach is able to generate diverse image samples that are conditioned on multiple noisy, occluded, or only partially visible input images. We derive and optimize a variational lower bound for the conditional log-likelihood of FusionVAE. In order to assess the fusion capabilities of our model thoroughly, we created three novel datasets for image fusion based on popular computer vision datasets. In our experiments, we show that FusionVAE learns a representation of aggregated information that is relevant to fusion tasks. The results demonstrate that our approach outperforms traditional methods significantly. Furthermore, we present the advantages and disadvantages of different design choices.

．传感器融合可以显著提高许多计算机视觉任务的性能。然而，传统的融合方法不是数据驱动的，不能利用先验知识，也不能在给定的数据集中发现规律，或者它们仅限于单个应用。我们克服了这一缺点，提出了一种新的深层分层变分自编码器，称为FusionVAE，它可以作为许多融合任务的基础。我们的方法能够生成不同的图像样本，这些样本以多个噪声、遮挡或仅部分可见的输入图像为条件。我们推导并优化了FusionVAE条件对数似然的变分下界。为了全面评估模型的融合能力，我们基于流行的计算机视觉数据集创建了三个新的图像融合数据集。在我们的实验中，我们证明了FusionVAE学习了与融合任务相关的聚合信息的表示。结果表明，我们的方法明显优于传统方法。此外，我们还介绍了不同设计选择的优点和缺点。

{"title":"FusionVAE: A Deep Hierarchical Variational Autoencoder for RGB Image Fusion","authors":"Fabian Duffhauss, Ngo Anh Vien, Hanna Ziesche, G. Neumann","doi":"10.48550/arXiv.2209.11277","DOIUrl":"https://doi.org/10.48550/arXiv.2209.11277","url":null,"abstract":". Sensor fusion can significantly improve the performance of many computer vision tasks. However, traditional fusion approaches are either not data-driven and cannot exploit prior knowledge nor find regu-larities in a given dataset or they are restricted to a single application. We overcome this shortcoming by presenting a novel deep hierarchical variational autoencoder called FusionVAE that can serve as a basis for many fusion tasks. Our approach is able to generate diverse image samples that are conditioned on multiple noisy, occluded, or only partially visible input images. We derive and optimize a variational lower bound for the conditional log-likelihood of FusionVAE. In order to assess the fusion capabilities of our model thoroughly, we created three novel datasets for image fusion based on popular computer vision datasets. In our experiments, we show that FusionVAE learns a representation of aggregated information that is relevant to fusion tasks. The results demonstrate that our approach outperforms traditional methods significantly. Furthermore, we present the advantages and disadvantages of different design choices.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86771546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IntereStyle: Encoding an Interest Region for Robust StyleGAN Inversion 为稳健StyleGAN反演编码兴趣区域

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-22 DOI: 10.48550/arXiv.2209.10811

S. Moon, GyeongMoon Park

Recently, manipulation of real-world images has been highly elaborated along with the development of Generative Adversarial Networks (GANs) and corresponding encoders, which embed real-world images into the latent space. However, designing encoders of GAN still remains a challenging task due to the trade-off between distortion and perception. In this paper, we point out that the existing encoders try to lower the distortion not only on the interest region, e.g., human facial region but also on the uninterest region, e.g., background patterns and obstacles. However, most uninterest regions in real-world images are located at out-of-distribution (OOD), which are infeasible to be ideally reconstructed by generative models. Moreover, we empirically find that the uninterest region overlapped with the interest region can mangle the original feature of the interest region, e.g., a microphone overlapped with a facial region is inverted into the white beard. As a result, lowering the distortion of the whole image while maintaining the perceptual quality is very challenging. To overcome this trade-off, we propose a simple yet effective encoder training scheme, coined IntereStyle, which facilitates encoding by focusing on the interest region. IntereStyle steers the encoder to disentangle the encodings of the interest and uninterest regions. To this end, we filter the information of the uninterest region iteratively to regulate the negative impact of the uninterest region. We demonstrate that IntereStyle achieves both lower distortion and higher perceptual quality compared to the existing state-of-the-art encoders. Especially, our model robustly conserves features of the original images, which shows the robust image editing and style mixing results. We will release our code with the pre-trained model after the review.

最近，随着生成对抗网络(GANs)和相应编码器的发展，对真实世界图像的处理得到了高度的阐述，这些编码器将真实世界的图像嵌入到潜在空间中。然而，由于失真和感知之间的权衡，GAN编码器的设计仍然是一个具有挑战性的任务。在本文中，我们指出现有的编码器不仅试图降低兴趣区域(如人脸区域)的失真，而且还试图降低非兴趣区域(如背景图案和障碍物)的失真。然而，现实图像中大多数的无兴趣区域位于out- distribution (OOD)，无法通过生成模型进行理想的重构。此外，我们的经验发现，与兴趣区域重叠的非兴趣区域会扭曲兴趣区域的原始特征，例如，与面部区域重叠的麦克风会被反转成白胡子。因此，在保持感知质量的同时降低整个图像的失真是非常具有挑战性的。为了克服这种权衡，我们提出了一种简单而有效的编码器训练方案，称为IntereStyle，它通过关注感兴趣区域来促进编码。兴趣模式引导编码器解开感兴趣和不感兴趣区域的编码。为此，我们对无兴趣区域的信息进行迭代过滤，以调节无兴趣区域的负面影响。我们证明，与现有的最先进的编码器相比，IntereStyle实现了更低的失真和更高的感知质量。特别是该模型对原始图像的特征进行了鲁棒性保存，显示了鲁棒性的图像编辑和样式混合效果。我们将在审查后发布带有预训练模型的代码。

{"title":"IntereStyle: Encoding an Interest Region for Robust StyleGAN Inversion","authors":"S. Moon, GyeongMoon Park","doi":"10.48550/arXiv.2209.10811","DOIUrl":"https://doi.org/10.48550/arXiv.2209.10811","url":null,"abstract":"Recently, manipulation of real-world images has been highly elaborated along with the development of Generative Adversarial Networks (GANs) and corresponding encoders, which embed real-world images into the latent space. However, designing encoders of GAN still remains a challenging task due to the trade-off between distortion and perception. In this paper, we point out that the existing encoders try to lower the distortion not only on the interest region, e.g., human facial region but also on the uninterest region, e.g., background patterns and obstacles. However, most uninterest regions in real-world images are located at out-of-distribution (OOD), which are infeasible to be ideally reconstructed by generative models. Moreover, we empirically find that the uninterest region overlapped with the interest region can mangle the original feature of the interest region, e.g., a microphone overlapped with a facial region is inverted into the white beard. As a result, lowering the distortion of the whole image while maintaining the perceptual quality is very challenging. To overcome this trade-off, we propose a simple yet effective encoder training scheme, coined IntereStyle, which facilitates encoding by focusing on the interest region. IntereStyle steers the encoder to disentangle the encodings of the interest and uninterest regions. To this end, we filter the information of the uninterest region iteratively to regulate the negative impact of the uninterest region. We demonstrate that IntereStyle achieves both lower distortion and higher perceptual quality compared to the existing state-of-the-art encoders. Especially, our model robustly conserves features of the original images, which shows the robust image editing and style mixing results. We will release our code with the pre-trained model after the review.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83881888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

PREF: Predictability Regularized Neural Motion Fields PREF:可预测性正则化神经运动场

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-21 DOI: 10.48550/arXiv.2209.10691

Liangchen Song, Xuan Gong, Benjamin Planche, Meng Zheng, D. Doermann, Junsong Yuan, Terrence Chen, Ziyan Wu

Knowing the 3D motions in a dynamic scene is essential to many vision applications. Recent progress is mainly focused on estimating the activity of some specific elements like humans. In this paper, we leverage a neural motion field for estimating the motion of all points in a multiview setting. Modeling the motion from a dynamic scene with multiview data is challenging due to the ambiguities in points of similar color and points with time-varying color. We propose to regularize the estimated motion to be predictable. If the motion from previous frames is known, then the motion in the near future should be predictable. Therefore, we introduce a predictability regularization by first conditioning the estimated motion on latent embeddings, then by adopting a predictor network to enforce predictability on the embeddings. The proposed framework PREF (Predictability REgularized Fields) achieves on par or better results than state-of-the-art neural motion field-based dynamic scene representation methods, while requiring no prior knowledge of the scene.

了解动态场景中的3D运动对于许多视觉应用来说是必不可少的。最近的进展主要集中在估计某些特定元素(如人类)的活动。在本文中，我们利用神经运动场来估计多视图设置中所有点的运动。由于相似颜色点和时变颜色点的模糊性，多视图动态场景的运动建模具有挑战性。我们建议对估计的运动进行正则化，使其可预测。如果前一帧的运动是已知的，那么在不久的将来的运动应该是可预测的。因此，我们引入了可预测性正则化，首先将估计的运动条件化为潜在嵌入，然后采用预测网络对嵌入进行可预测性。所提出的框架PREF(可预测性正则化域)达到了与最先进的基于神经运动场的动态场景表示方法相当或更好的结果，同时不需要对场景的先验知识。

{"title":"PREF: Predictability Regularized Neural Motion Fields","authors":"Liangchen Song, Xuan Gong, Benjamin Planche, Meng Zheng, D. Doermann, Junsong Yuan, Terrence Chen, Ziyan Wu","doi":"10.48550/arXiv.2209.10691","DOIUrl":"https://doi.org/10.48550/arXiv.2209.10691","url":null,"abstract":"Knowing the 3D motions in a dynamic scene is essential to many vision applications. Recent progress is mainly focused on estimating the activity of some specific elements like humans. In this paper, we leverage a neural motion field for estimating the motion of all points in a multiview setting. Modeling the motion from a dynamic scene with multiview data is challenging due to the ambiguities in points of similar color and points with time-varying color. We propose to regularize the estimated motion to be predictable. If the motion from previous frames is known, then the motion in the near future should be predictable. Therefore, we introduce a predictability regularization by first conditioning the estimated motion on latent embeddings, then by adopting a predictor network to enforce predictability on the embeddings. The proposed framework PREF (Predictability REgularized Fields) achieves on par or better results than state-of-the-art neural motion field-based dynamic scene representation methods, while requiring no prior knowledge of the scene.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74675657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

NashAE: Disentangling Representations through Adversarial Covariance Minimization 通过对抗性协方差最小化来解纠缠表征

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-21 DOI: 10.48550/arXiv.2209.10677

Eric C. Yeats, Frank Liu, David A. P. Womble, Hai Li

We present a self-supervised method to disentangle factors of variation in high-dimensional data that does not rely on prior knowledge of the underlying variation profile (e.g., no assumptions on the number or distribution of the individual latent variables to be extracted). In this method which we call NashAE, high-dimensional feature disentanglement is accomplished in the low-dimensional latent space of a standard autoencoder (AE) by promoting the discrepancy between each encoding element and information of the element recovered from all other encoding elements. Disentanglement is promoted efficiently by framing this as a minmax game between the AE and an ensemble of regression networks which each provide an estimate of an element conditioned on an observation of all other elements. We quantitatively compare our approach with leading disentanglement methods using existing disentanglement metrics. Furthermore, we show that NashAE has increased reliability and increased capacity to capture salient data characteristics in the learned latent representation.

我们提出了一种自监督方法来解开高维数据中的变化因素，该方法不依赖于对潜在变化概况的先验知识(例如，不假设要提取的单个潜在变量的数量或分布)。该方法在标准自编码器(AE)的低维潜在空间中，通过提高每个编码元素与从所有其他编码元素中恢复的元素信息之间的差异来实现高维特征解纠缠。通过将其构建为AE和回归网络集合之间的最小最大博弈，有效地促进了解纠缠，每个回归网络都提供了对所有其他元素的观察为条件的元素的估计。我们定量地比较了我们的方法与领先的解纠缠方法使用现有的解纠缠度量。此外，我们表明NashAE在学习潜在表征中具有更高的可靠性和捕获显著数据特征的能力。

{"title":"NashAE: Disentangling Representations through Adversarial Covariance Minimization","authors":"Eric C. Yeats, Frank Liu, David A. P. Womble, Hai Li","doi":"10.48550/arXiv.2209.10677","DOIUrl":"https://doi.org/10.48550/arXiv.2209.10677","url":null,"abstract":"We present a self-supervised method to disentangle factors of variation in high-dimensional data that does not rely on prior knowledge of the underlying variation profile (e.g., no assumptions on the number or distribution of the individual latent variables to be extracted). In this method which we call NashAE, high-dimensional feature disentanglement is accomplished in the low-dimensional latent space of a standard autoencoder (AE) by promoting the discrepancy between each encoding element and information of the element recovered from all other encoding elements. Disentanglement is promoted efficiently by framing this as a minmax game between the AE and an ensemble of regression networks which each provide an estimate of an element conditioned on an observation of all other elements. We quantitatively compare our approach with leading disentanglement methods using existing disentanglement metrics. Furthermore, we show that NashAE has increased reliability and increased capacity to capture salient data characteristics in the learned latent representation.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85665731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

KXNet: A Model-Driven Deep Neural Network for Blind Super-Resolution KXNet:一个模型驱动的盲超分辨率深度神经网络

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-21 DOI: 10.48550/arXiv.2209.10305

J. Fu, Hong Wang, Qi Xie, Qian Zhao, Deyu Meng, Zongben Xu

Although current deep learning-based methods have gained promising performance in the blind single image super-resolution (SISR) task, most of them mainly focus on heuristically constructing diverse network architectures and put less emphasis on the explicit embedding of the physical generation mechanism between blur kernels and high-resolution (HR) images. To alleviate this issue, we propose a model-driven deep neural network, called KXNet, for blind SISR. Specifically, to solve the classical SISR model, we propose a simple-yet-effective iterative algorithm. Then by unfolding the involved iterative steps into the corresponding network module, we naturally construct the KXNet. The main specificity of the proposed KXNet is that the entire learning process is fully and explicitly integrated with the inherent physical mechanism underlying this SISR task. Thus, the learned blur kernel has clear physical patterns and the mutually iterative process between blur kernel and HR image can soundly guide the KXNet to be evolved in the right direction. Extensive experiments on synthetic and real data finely demonstrate the superior accuracy and generality of our method beyond the current representative state-of-the-art blind SISR methods. Code is available at: https://github.com/jiahong-fu/KXNet.

虽然目前基于深度学习的方法在盲单图像超分辨率(SISR)任务中取得了很好的效果，但大多数方法主要集中在启发式地构建不同的网络架构，而对模糊核与高分辨率(HR)图像之间物理生成机制的显式嵌入重视较少。为了缓解这个问题，我们提出了一个模型驱动的深度神经网络，称为KXNet，用于盲SISR。具体来说，针对经典的SISR模型，我们提出了一种简单有效的迭代算法。然后，通过将所涉及的迭代步骤展开到相应的网络模块中，我们自然地构建了KXNet。所提出的KXNet的主要特点是，整个学习过程完全且明确地集成了作为SISR任务基础的内在物理机制。因此，学习到的模糊内核具有清晰的物理模式，模糊内核与HR图像之间的相互迭代过程可以很好地引导KXNet朝着正确的方向进化。在合成数据和真实数据上进行的大量实验很好地证明了我们的方法优于当前具有代表性的最先进的盲SISR方法的准确性和通用性。代码可从https://github.com/jiahong-fu/KXNet获得。

{"title":"KXNet: A Model-Driven Deep Neural Network for Blind Super-Resolution","authors":"J. Fu, Hong Wang, Qi Xie, Qian Zhao, Deyu Meng, Zongben Xu","doi":"10.48550/arXiv.2209.10305","DOIUrl":"https://doi.org/10.48550/arXiv.2209.10305","url":null,"abstract":"Although current deep learning-based methods have gained promising performance in the blind single image super-resolution (SISR) task, most of them mainly focus on heuristically constructing diverse network architectures and put less emphasis on the explicit embedding of the physical generation mechanism between blur kernels and high-resolution (HR) images. To alleviate this issue, we propose a model-driven deep neural network, called KXNet, for blind SISR. Specifically, to solve the classical SISR model, we propose a simple-yet-effective iterative algorithm. Then by unfolding the involved iterative steps into the corresponding network module, we naturally construct the KXNet. The main specificity of the proposed KXNet is that the entire learning process is fully and explicitly integrated with the inherent physical mechanism underlying this SISR task. Thus, the learned blur kernel has clear physical patterns and the mutually iterative process between blur kernel and HR image can soundly guide the KXNet to be evolved in the right direction. Extensive experiments on synthetic and real data finely demonstrate the superior accuracy and generality of our method beyond the current representative state-of-the-art blind SISR methods. Code is available at: https://github.com/jiahong-fu/KXNet.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86085056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

HVC-Net: Unifying Homography, Visibility, and Confidence Learning for Planar Object Tracking HVC-Net:平面对象跟踪的统一单应性、可见性和置信度学习

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-19 DOI: 10.48550/arXiv.2209.08924

Haoxian Zhang, Yonggen Ling

Robust and accurate planar tracking over a whole video sequence is vitally important for many vision applications. The key to planar object tracking is to find object correspondences, modeled by homography, between the reference image and the tracked image. Existing methods tend to obtain wrong correspondences with changing appearance variations, camera-object relative motions and occlusions. To alleviate this problem, we present a unified convolutional neural network (CNN) model that jointly considers homography, visibility, and confidence. First, we introduce correlation blocks that explicitly account for the local appearance changes and camera-object relative motions as the base of our model. Second, we jointly learn the homography and visibility that links camera-object relative motions with occlusions. Third, we propose a confidence module that actively monitors the estimation quality from the pixel correlation distributions obtained in correlation blocks. All these modules are plugged into a Lucas-Kanade (LK) tracking pipeline to obtain both accurate and robust planar object tracking. Our approach outperforms the state-of-the-art methods on public POT and TMT datasets. Its superior performance is also verified on a real-world application, synthesizing high-quality in-video advertisements.

在许多视觉应用中，对整个视频序列进行鲁棒和精确的平面跟踪是至关重要的。平面目标跟踪的关键是找到参考图像与被跟踪图像之间的对象对应关系。现有的方法往往得到错误的对应变化的外观变化，相机-对象相对运动和遮挡。为了缓解这个问题，我们提出了一个统一的卷积神经网络(CNN)模型，该模型联合考虑了单应性、可见性和置信度。首先，我们引入了相关块，明确地解释了局部外观变化和相机-物体相对运动作为我们模型的基础。其次，我们共同学习了将相机-物体相对运动与遮挡联系起来的单应性和可见性。第三，我们提出了一个置信度模块，从相关块中获得的像素相关分布中主动监控估计质量。所有这些模块都插入到卢卡斯-卡纳德(LK)跟踪管道中，以获得精确和鲁棒的平面目标跟踪。我们的方法在公共POT和TMT数据集上优于最先进的方法。其优越的性能也在实际应用中得到验证，合成了高质量的视频内广告。

{"title":"HVC-Net: Unifying Homography, Visibility, and Confidence Learning for Planar Object Tracking","authors":"Haoxian Zhang, Yonggen Ling","doi":"10.48550/arXiv.2209.08924","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08924","url":null,"abstract":"Robust and accurate planar tracking over a whole video sequence is vitally important for many vision applications. The key to planar object tracking is to find object correspondences, modeled by homography, between the reference image and the tracked image. Existing methods tend to obtain wrong correspondences with changing appearance variations, camera-object relative motions and occlusions. To alleviate this problem, we present a unified convolutional neural network (CNN) model that jointly considers homography, visibility, and confidence. First, we introduce correlation blocks that explicitly account for the local appearance changes and camera-object relative motions as the base of our model. Second, we jointly learn the homography and visibility that links camera-object relative motions with occlusions. Third, we propose a confidence module that actively monitors the estimation quality from the pixel correlation distributions obtained in correlation blocks. All these modules are plugged into a Lucas-Kanade (LK) tracking pipeline to obtain both accurate and robust planar object tracking. Our approach outperforms the state-of-the-art methods on public POT and TMT datasets. Its superior performance is also verified on a real-world application, synthesizing high-quality in-video advertisements.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77366297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

D&D: Learning Human Dynamics from Dynamic Camera 龙与地下城:从动态摄像机学习人类动态

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-19 DOI: 10.48550/arXiv.2209.08790

Jiefeng Li, Siyuan Bian, Chaoshun Xu, Gang Liu, Gang Yu, Cewu Lu

3D human pose estimation from a monocular video has recently seen significant improvements. However, most state-of-the-art methods are kinematics-based, which are prone to physically implausible motions with pronounced artifacts. Current dynamics-based methods can predict physically plausible motion but are restricted to simple scenarios with static camera view. In this work, we present D&D (Learning Human Dynamics from Dynamic Camera), which leverages the laws of physics to reconstruct 3D human motion from the in-the-wild videos with a moving camera. D&D introduces inertial force control (IFC) to explain the 3D human motion in the non-inertial local frame by considering the inertial forces of the dynamic camera. To learn the ground contact with limited annotations, we develop probabilistic contact torque (PCT), which is computed by differentiable sampling from contact probabilities and used to generate motions. The contact state can be weakly supervised by encouraging the model to generate correct motions. Furthermore, we propose an attentive PD controller that adjusts target pose states using temporal information to obtain smooth and accurate pose control. Our approach is entirely neural-based and runs without offline optimization or simulation in physics engines. Experiments on large-scale 3D human motion benchmarks demonstrate the effectiveness of D&D, where we exhibit superior performance against both state-of-the-art kinematics-based and dynamics-based methods. Code is available at https://github.com/Jeffsjtu/DnD

单目视频的3D人体姿态估计最近有了显著的改进。然而，大多数最先进的方法是基于运动学的，这很容易产生物理上难以置信的运动和明显的伪影。目前基于动态的方法可以预测物理上合理的运动，但仅限于静态摄像机视图的简单场景。在这项工作中，我们提出了D&D(从动态摄像机学习人类动力学)，它利用物理定律从移动摄像机的野外视频中重建3D人体运动。龙与地下城引入惯性力控制(IFC)，通过考虑动态摄像机的惯性力来解释非惯性局部坐标系中的三维人体运动。为了学习具有有限注释的地面接触，我们开发了概率接触扭矩(PCT)，该扭矩由接触概率的可微采样计算并用于生成运动。通过鼓励模型产生正确的运动，可以对接触状态进行弱监督。此外，我们提出了一种专注PD控制器，利用时间信息调整目标姿态状态，以获得平滑和准确的姿态控制。我们的方法完全是基于神经的，无需在物理引擎中进行离线优化或模拟。大规模3D人体运动基准实验证明了D&D的有效性，我们在最先进的基于运动学和基于动力学的方法中都表现出卓越的性能。代码可从https://github.com/Jeffsjtu/DnD获得

{"title":"D&D: Learning Human Dynamics from Dynamic Camera","authors":"Jiefeng Li, Siyuan Bian, Chaoshun Xu, Gang Liu, Gang Yu, Cewu Lu","doi":"10.48550/arXiv.2209.08790","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08790","url":null,"abstract":"3D human pose estimation from a monocular video has recently seen significant improvements. However, most state-of-the-art methods are kinematics-based, which are prone to physically implausible motions with pronounced artifacts. Current dynamics-based methods can predict physically plausible motion but are restricted to simple scenarios with static camera view. In this work, we present D&D (Learning Human Dynamics from Dynamic Camera), which leverages the laws of physics to reconstruct 3D human motion from the in-the-wild videos with a moving camera. D&D introduces inertial force control (IFC) to explain the 3D human motion in the non-inertial local frame by considering the inertial forces of the dynamic camera. To learn the ground contact with limited annotations, we develop probabilistic contact torque (PCT), which is computed by differentiable sampling from contact probabilities and used to generate motions. The contact state can be weakly supervised by encouraging the model to generate correct motions. Furthermore, we propose an attentive PD controller that adjusts target pose states using temporal information to obtain smooth and accurate pose control. Our approach is entirely neural-based and runs without offline optimization or simulation in physics engines. Experiments on large-scale 3D human motion benchmarks demonstrate the effectiveness of D&D, where we exhibit superior performance against both state-of-the-art kinematics-based and dynamics-based methods. Code is available at https://github.com/Jeffsjtu/DnD","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88062279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17