2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)最新文献_第2页

Learning Graph Variational Autoencoders with Constraints and Structured Priors for Conditional Indoor 3D Scene Generation 具有约束和结构先验的学习图变分自编码器用于条件室内三维场景生成

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00085

Aditya Chattopadhyay, Xi Zhang, D. Wipf, H. Arora, Rene Vidal

We present a graph variational autoencoder with a structured prior for generating the layout of indoor 3D scenes. Given the room type (e.g., living room or library) and the room layout (e.g., room elements such as floor and walls), our architecture generates a collection of objects (e.g., furniture items such as sofa, table and chairs) that is consistent with the room type and layout. This is a challenging problem because the generated scene needs to satisfy multiple constrains, e.g., each object should lie inside the room and two objects should not occupy the same volume. To address these challenges, we propose a deep generative model that encodes these relationships as soft constraints on an attributed graph (e.g., the nodes capture attributes of room and furniture elements, such as shape, class, pose and size, and the edges capture geometric relationships such as relative orientation). The architecture consists of a graph encoder that maps the input graph to a structured latent space, and a graph decoder that generates a furniture graph, given a latent code and the room graph. The latent space is modeled with autoregressive priors, which facilitates the generation of highly structured scenes. We also propose an efficient training procedure that combines matching and constrained learning. Experiments on the 3D-FRONT dataset show that our method produces scenes that are diverse and are adapted to the room layout.

我们提出了一个具有结构化先验的图形变分自编码器，用于生成室内3D场景的布局。给定房间类型(例如，客厅或图书馆)和房间布局(例如，地板和墙壁等房间元素)，我们的体系结构生成与房间类型和布局一致的对象集合(例如，沙发、桌子和椅子等家具项目)。这是一个具有挑战性的问题，因为生成的场景需要满足多个约束，例如，每个物体应该位于房间内，两个物体不应该占用相同的体积。为了解决这些挑战，我们提出了一个深度生成模型，该模型将这些关系编码为属性图上的软约束(例如，节点捕获房间和家具元素的属性，如形状、类别、姿势和大小，而边缘捕获几何关系，如相对方向)。该架构包括一个将输入图形映射到结构化潜在空间的图形编码器，以及一个给定潜在代码和房间图形生成家具图形的图形解码器。潜在空间采用自回归先验建模，便于生成高度结构化的场景。我们还提出了一种结合匹配和约束学习的高效训练方法。在3D-FRONT数据集上的实验表明，我们的方法产生了多样化的场景，并适应了房间的布局。

{"title":"Learning Graph Variational Autoencoders with Constraints and Structured Priors for Conditional Indoor 3D Scene Generation","authors":"Aditya Chattopadhyay, Xi Zhang, D. Wipf, H. Arora, Rene Vidal","doi":"10.1109/WACV56688.2023.00085","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00085","url":null,"abstract":"We present a graph variational autoencoder with a structured prior for generating the layout of indoor 3D scenes. Given the room type (e.g., living room or library) and the room layout (e.g., room elements such as floor and walls), our architecture generates a collection of objects (e.g., furniture items such as sofa, table and chairs) that is consistent with the room type and layout. This is a challenging problem because the generated scene needs to satisfy multiple constrains, e.g., each object should lie inside the room and two objects should not occupy the same volume. To address these challenges, we propose a deep generative model that encodes these relationships as soft constraints on an attributed graph (e.g., the nodes capture attributes of room and furniture elements, such as shape, class, pose and size, and the edges capture geometric relationships such as relative orientation). The architecture consists of a graph encoder that maps the input graph to a structured latent space, and a graph decoder that generates a furniture graph, given a latent code and the room graph. The latent space is modeled with autoregressive priors, which facilitates the generation of highly structured scenes. We also propose an efficient training procedure that combines matching and constrained learning. Experiments on the 3D-FRONT dataset show that our method produces scenes that are diverse and are adapted to the room layout.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"228 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114249831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

CNN2Graph: Building Graphs for Image Classification CNN2Graph:构建图像分类图

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00009

Vivek Trivedy, Longin Jan Latecki

Neural Network classifiers generally operate via the i.i.d. assumption where examples are passed through independently during training. We propose CNN2GNN and CNN2Transformer which instead leverage inter-example information for classification. We use Graph Neural Networks (GNNs) to build a latent space bipartite graph and compute cross-attention scores between input images and a proxy set. Our approach addresses several challenges of existing methods. Firstly, it is end-to-end differentiable despite the generally discrete nature of graph construction. Secondly, it allows inductive inference at no extra cost. Thirdly, it presents a simple method to construct graphs from arbitrary datasets that captures both example level and class level information. Finally, it addresses the proxy collapse problem by combining contrastive and cross-entropy losses rather than separate clustering algorithms. Our results increase classification performance over baseline experiments and outperform other methods. We also conduct an empirical investigation showing that Transformer style attention scales better than GAT attention with dataset size.

神经网络分类器通常通过i.i.d假设进行操作，其中示例在训练过程中独立通过。我们提出了CNN2GNN和CNN2Transformer，它们利用样本间信息进行分类。我们使用图神经网络(gnn)来构建潜在空间二部图，并计算输入图像与代理集之间的交叉注意分数。我们的方法解决了现有方法的几个挑战。首先，尽管图的构造通常是离散的，但它是端到端可微的。其次，它允许在没有额外成本的情况下进行归纳推理。第三，提出了一种从任意数据集构建图的简单方法，该方法既可以捕获示例级别信息，也可以捕获类级别信息。最后，它通过结合对比和交叉熵损失而不是单独的聚类算法来解决代理崩溃问题。我们的结果比基线实验提高了分类性能，并且优于其他方法。我们还进行了一项实证调查，表明Transformer风格的注意力在数据集大小上优于GAT风格的注意力。

{"title":"CNN2Graph: Building Graphs for Image Classification","authors":"Vivek Trivedy, Longin Jan Latecki","doi":"10.1109/WACV56688.2023.00009","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00009","url":null,"abstract":"Neural Network classifiers generally operate via the i.i.d. assumption where examples are passed through independently during training. We propose CNN2GNN and CNN2Transformer which instead leverage inter-example information for classification. We use Graph Neural Networks (GNNs) to build a latent space bipartite graph and compute cross-attention scores between input images and a proxy set. Our approach addresses several challenges of existing methods. Firstly, it is end-to-end differentiable despite the generally discrete nature of graph construction. Secondly, it allows inductive inference at no extra cost. Thirdly, it presents a simple method to construct graphs from arbitrary datasets that captures both example level and class level information. Finally, it addresses the proxy collapse problem by combining contrastive and cross-entropy losses rather than separate clustering algorithms. Our results increase classification performance over baseline experiments and outperform other methods. We also conduct an empirical investigation showing that Transformer style attention scales better than GAT attention with dataset size.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"122 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129503862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GEMS: Generating Efficient Meta-Subnets GEMS:生成高效的元子网

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00528

Varad Pimpalkhute, Shruti Kunde, Rekha Singhal

Gradient-based meta learners (GBML) such as MAML [6] aim to learn a model initialization across similar tasks, such that the model generalizes well on unseen tasks sampled from the same distribution with few gradient updates. A limitation of GBML is its inability to adapt to real-world applications where input tasks are sampled from multiple distributions. An existing effort [23] learns ${mathcal{N}}$ initializations for tasks sampled from ${mathcal{N}}$ distributions; roughly increasing training time by a factor of ${mathcal{N}}$. Instead, we use a single model initialization to learn distribution-specific parameters for every input task. This reduces negative knowledge transfer across distributions and overall computational cost. Specifically, we explore two ways of efficiently learning on multi-distribution tasks: 1) Binary Mask Perceptron (BMP) which learns distribution-specific layers, 2) Multi-modal Supermask (MMSUP) which learns distribution-specific parameters. We evaluate the performance of the proposed framework (GEMS) on few-shot vision classification tasks. The experimental results demonstrate an improvement in accuracy and a speed-up of ~2× to 4× in the training time, over existing state of the art algorithms on quasi-benchmark datasets in the field of meta-learning.

基于梯度的元学习器(GBML)，如MAML[6]旨在学习跨相似任务的模型初始化，这样模型就可以很好地泛化从相同分布中采样的未见过的任务，只需很少的梯度更新。GBML的一个限制是它无法适应从多个分布中采样输入任务的实际应用程序。已有的研究[23]对从${mathcal{N}}$分布中抽样的任务学习${mathcal{N}}$初始化;大致增加训练时间${mathcal{N}}$。相反，我们使用单个模型初始化来为每个输入任务学习特定于分布的参数。这减少了分布之间的负知识转移和总体计算成本。具体来说，我们探索了两种有效学习多分布任务的方法:1)二元掩码感知器(BMP)学习特定分布层，2)多模态超掩码(MMSUP)学习特定分布参数。我们评估了所提出的框架(GEMS)在少镜头视觉分类任务上的性能。实验结果表明，在元学习领域的准基准数据集上，与现有的最先进算法相比，该算法的准确率和训练时间提高了约2到4倍。

{"title":"GEMS: Generating Efficient Meta-Subnets","authors":"Varad Pimpalkhute, Shruti Kunde, Rekha Singhal","doi":"10.1109/WACV56688.2023.00528","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00528","url":null,"abstract":"Gradient-based meta learners (GBML) such as MAML [6] aim to learn a model initialization across similar tasks, such that the model generalizes well on unseen tasks sampled from the same distribution with few gradient updates. A limitation of GBML is its inability to adapt to real-world applications where input tasks are sampled from multiple distributions. An existing effort [23] learns ${mathcal{N}}$ initializations for tasks sampled from ${mathcal{N}}$ distributions; roughly increasing training time by a factor of ${mathcal{N}}$. Instead, we use a single model initialization to learn distribution-specific parameters for every input task. This reduces negative knowledge transfer across distributions and overall computational cost. Specifically, we explore two ways of efficiently learning on multi-distribution tasks: 1) Binary Mask Perceptron (BMP) which learns distribution-specific layers, 2) Multi-modal Supermask (MMSUP) which learns distribution-specific parameters. We evaluate the performance of the proposed framework (GEMS) on few-shot vision classification tasks. The experimental results demonstrate an improvement in accuracy and a speed-up of ~2× to 4× in the training time, over existing state of the art algorithms on quasi-benchmark datasets in the field of meta-learning.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129718358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Flow-Guided Multi-frame De-fencing 高效流引导多帧防围栏

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00188

Stavros Tsogkas, Feng Zhang, A. Jepson, Alex Levinshtein

Taking photographs "in-the-wild" is often hindered by fence obstructions that stand between the camera user and the scene of interest, and which are hard or impossible to avoid. De-fencing is the algorithmic process of automatically removing such obstructions from images, revealing the invisible parts of the scene. While this problem can be formulated as a combination of fence segmentation and image inpainting, this often leads to implausible hallucinations of the occluded regions. Existing multi-frame approaches rely on propagating information to a selected keyframe from its temporal neighbors, but they are often inefficient and struggle with alignment of severely obstructed images. In this work we draw inspiration from the video completion literature, and develop a simplified framework for multi-frame de-fencing that computes high quality flow maps directly from obstructed frames, and uses them to accurately align frames. Our primary focus is efficiency and practicality in a real world setting: the input to our algorithm is a short image burst (5 frames) – a data modality commonly available in modern smartphones– and the output is a single reconstructed keyframe, with the fence removed. Our approach leverages simple yet effective CNN modules, trained on carefully generated synthetic data, and outperforms more complicated alternatives real bursts, both quantitatively and qualitatively, while running real-time.

在“野外”拍摄照片常常会受到栅栏障碍物的阻碍，这些障碍物挡在相机使用者和感兴趣的场景之间，很难或不可能避免。“去栅栏”是一种自动从图像中移除障碍物，显示场景中不可见部分的算法过程。虽然这个问题可以表述为栅栏分割和图像绘制的结合，但这通常会导致闭塞区域的不可信幻觉。现有的多帧方法依赖于将信息从其时间邻居传播到选定的关键帧，但它们通常效率低下，并且难以对严重受阻的图像进行对齐。在这项工作中，我们从视频补全文献中汲取灵感，并开发了一个简化的多帧反隔离框架，该框架直接从受阻的帧中计算高质量的流图，并使用它们来精确对齐帧。我们的主要关注点是在现实世界环境中的效率和实用性:我们算法的输入是一个短图像突发(5帧)-现代智能手机中常见的数据模式-输出是一个重建的关键帧，去掉了栅栏。我们的方法利用简单而有效的CNN模块，在精心生成的合成数据上进行训练，并且在实时运行的同时，在定量和定性方面都优于更复杂的替代方案。

{"title":"Efficient Flow-Guided Multi-frame De-fencing","authors":"Stavros Tsogkas, Feng Zhang, A. Jepson, Alex Levinshtein","doi":"10.1109/WACV56688.2023.00188","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00188","url":null,"abstract":"Taking photographs \"in-the-wild\" is often hindered by fence obstructions that stand between the camera user and the scene of interest, and which are hard or impossible to avoid. De-fencing is the algorithmic process of automatically removing such obstructions from images, revealing the invisible parts of the scene. While this problem can be formulated as a combination of fence segmentation and image inpainting, this often leads to implausible hallucinations of the occluded regions. Existing multi-frame approaches rely on propagating information to a selected keyframe from its temporal neighbors, but they are often inefficient and struggle with alignment of severely obstructed images. In this work we draw inspiration from the video completion literature, and develop a simplified framework for multi-frame de-fencing that computes high quality flow maps directly from obstructed frames, and uses them to accurately align frames. Our primary focus is efficiency and practicality in a real world setting: the input to our algorithm is a short image burst (5 frames) – a data modality commonly available in modern smartphones– and the output is a single reconstructed keyframe, with the fence removed. Our approach leverages simple yet effective CNN modules, trained on carefully generated synthetic data, and outperforms more complicated alternatives real bursts, both quantitatively and qualitatively, while running real-time.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129951892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Proactive Deepfake Defence via Identity Watermarking 基于身份水印的主动深度伪造防御

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00458

Yuan Zhao, Bo Liu, Ming Ding, Baoping Liu, Tianqing Zhu, Xin Yu

The explosive progress of Deepfake techniques poses unprecedented privacy and security risks to our society by creating real-looking but fake visual content. The current Deepfake detection studies are still in their infancy because they mainly rely on capturing artifacts left by a Deepfake synthesis process as detection clues, which can be easily removed by various distortions (e.g. blurring) or advanced Deepfake techniques. In this paper, we propose a novel method that does not depend on identifying the artifacts but resorts to the mechanism of anti-counterfeit labels to protect face images from malicious Deepfake tampering. Specifically, we design a neural network with an encoder-decoder structure to embed watermarks as anti-Deepfake labels into the facial identity features. The injected label is entangled with the facial identity feature, so it will be sensitive to face swap translations (i.e., Deepfake) and robust to conventional image modifications (e.g., resize and compress). Therefore, we can identify whether watermarked images have been tampered with by Deepfake methods according to the label’s existence. Experimental results demonstrate that our method can achieve average detection accuracy of more than 80%, which validates the proposed method’s effectiveness in implementing Deepfake detection.

Deepfake技术的爆炸式发展创造了看似真实但虚假的视觉内容，给我们的社会带来了前所未有的隐私和安全风险。目前的Deepfake检测研究仍处于起步阶段，因为它们主要依赖于捕获Deepfake合成过程中留下的伪影作为检测线索，这些伪影很容易通过各种失真(例如模糊)或先进的Deepfake技术去除。在本文中，我们提出了一种新的方法，它不依赖于识别伪像，而是利用防伪标签的机制来保护人脸图像免受恶意Deepfake篡改。具体来说，我们设计了一个具有编码器-解码器结构的神经网络，将水印作为抗deepfake标签嵌入到面部身份特征中。注入的标签与面部身份特征纠缠在一起，因此它对人脸交换翻译(即Deepfake)敏感，对传统的图像修改(例如调整大小和压缩)稳健。因此，我们可以根据标签的存在来判断水印图像是否被Deepfake方法篡改过。实验结果表明，该方法的平均检测准确率达到80%以上，验证了该方法在实现Deepfake检测方面的有效性。

{"title":"Proactive Deepfake Defence via Identity Watermarking","authors":"Yuan Zhao, Bo Liu, Ming Ding, Baoping Liu, Tianqing Zhu, Xin Yu","doi":"10.1109/WACV56688.2023.00458","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00458","url":null,"abstract":"The explosive progress of Deepfake techniques poses unprecedented privacy and security risks to our society by creating real-looking but fake visual content. The current Deepfake detection studies are still in their infancy because they mainly rely on capturing artifacts left by a Deepfake synthesis process as detection clues, which can be easily removed by various distortions (e.g. blurring) or advanced Deepfake techniques. In this paper, we propose a novel method that does not depend on identifying the artifacts but resorts to the mechanism of anti-counterfeit labels to protect face images from malicious Deepfake tampering. Specifically, we design a neural network with an encoder-decoder structure to embed watermarks as anti-Deepfake labels into the facial identity features. The injected label is entangled with the facial identity feature, so it will be sensitive to face swap translations (i.e., Deepfake) and robust to conventional image modifications (e.g., resize and compress). Therefore, we can identify whether watermarked images have been tampered with by Deepfake methods according to the label’s existence. Experimental results demonstrate that our method can achieve average detection accuracy of more than 80%, which validates the proposed method’s effectiveness in implementing Deepfake detection.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128325617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

MFCFlow: A Motion Feature Compensated Multi-Frame Recurrent Network for Optical Flow Estimation MFCFlow:一种用于光流估计的运动特征补偿多帧循环网络

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00504

Yonghu Chen, Dongchen Zhu, Wenjun Shi, Guanghui Zhang, Tianyu Zhang, Xiaolin Zhang, Jiamao Li

Occlusions have long been a hard nut to crack in optical flow estimation due to ambiguous pixels matching between abutting images. Current methods only take two consecutive images as input, which is challenging to capture temporal coherence and reason about occluded regions. In this paper, we propose a novel optical flow estimation framework, namely MFCFlow, which attempts to compensate for the information of occlusions by mining and transferring motion features between multiple frames. Specifically, we construct a Motion-guided Feature Compensation cell (MFC cell) to enhance the ambiguous motion features according to the correlation of previous features obtained by attention-based structure. Furthermore, a TopK attention strategy is developed and embedded into the MFC cell to improve the subsequent matching quality. Extensive experiments demonstrate that our MFCFlow achieves significant improvements in occluded regions and attains state-of-the-art performances on both Sintel and KITTI benchmarks among other multi-frame optical flow methods.

由于相邻图像之间像素匹配不明确，遮挡一直是光流估计中的一个难题。目前的方法仅采用两幅连续图像作为输入，难以捕获遮挡区域的时间相干性和推理性。在本文中，我们提出了一种新的光流估计框架MFCFlow，它试图通过在多帧之间挖掘和传递运动特征来补偿遮挡信息。具体而言，我们构建了一个运动引导特征补偿单元(MFC cell)，根据基于注意的结构获得的先前特征的相关性来增强模糊的运动特征。此外，开发了TopK注意策略并将其嵌入到MFC单元中，以提高后续匹配质量。大量的实验表明，我们的MFCFlow在遮挡区域取得了显著的改进，并在其他多帧光流方法中在sinl和KITTI基准上取得了最先进的性能。

{"title":"MFCFlow: A Motion Feature Compensated Multi-Frame Recurrent Network for Optical Flow Estimation","authors":"Yonghu Chen, Dongchen Zhu, Wenjun Shi, Guanghui Zhang, Tianyu Zhang, Xiaolin Zhang, Jiamao Li","doi":"10.1109/WACV56688.2023.00504","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00504","url":null,"abstract":"Occlusions have long been a hard nut to crack in optical flow estimation due to ambiguous pixels matching between abutting images. Current methods only take two consecutive images as input, which is challenging to capture temporal coherence and reason about occluded regions. In this paper, we propose a novel optical flow estimation framework, namely MFCFlow, which attempts to compensate for the information of occlusions by mining and transferring motion features between multiple frames. Specifically, we construct a Motion-guided Feature Compensation cell (MFC cell) to enhance the ambiguous motion features according to the correlation of previous features obtained by attention-based structure. Furthermore, a TopK attention strategy is developed and embedded into the MFC cell to improve the subsequent matching quality. Extensive experiments demonstrate that our MFCFlow achieves significant improvements in occluded regions and attains state-of-the-art performances on both Sintel and KITTI benchmarks among other multi-frame optical flow methods.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129256681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

CameraPose: Weakly-Supervised Monocular 3D Human Pose Estimation by Leveraging In-the-wild 2D Annotations CameraPose:利用野外2D注释的弱监督单目3D人体姿势估计

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00294

Cheng-Yen Yang, Jiajia Luo, Lu Xia, Yuyin Sun, Nan Qiao, Ke Zhang, Zhongyu Jiang, Jenq-Neng Hwang

To improve the generalization of 3D human pose estimators, many existing deep learning based models focus on adding different augmentations to training poses. However, data augmentation techniques are limited to the "seen" pose combinations and hard to infer poses with rare "unseen" joint positions. To address this problem, we present CameraPose, a weakly-supervised framework for 3D human pose estimation from a single image, which can not only be applied on 2D-3D pose pairs but also on 2D alone annotations. By adding a camera parameter branch, any in-the-wild 2D annotations can be fed into our pipeline to boost the training diversity and the 3D poses can be implicitly learned by reprojecting back to 2D. Moreover, CameraPose introduces a refinement network module with confidence-guided loss to further improve the quality of noisy 2D keypoints extracted by 2D pose estimators. Experimental results demonstrate that the CameraPose brings in clear improvements on cross-scenario datasets. Notably, it outperforms the baseline method by 3mm on the most challenging dataset 3DPW. In addition, by combining our proposed refinement network module with existing 3D pose estimators, their performance can be improved in cross-scenario evaluation.

为了提高三维人体姿态估计器的泛化能力，许多现有的基于深度学习的模型都侧重于在训练姿态中添加不同的增强。然而，数据增强技术仅限于“可见”的姿势组合，很难推断出罕见的“不可见”关节位置的姿势。为了解决这个问题，我们提出了CameraPose，这是一个弱监督框架，用于从单张图像估计3D人体姿势，它不仅可以应用于2D-3D姿势对，还可以应用于2D单独的注释。通过添加相机参数分支，任何野外2D注释都可以馈送到我们的管道中，以提高训练的多样性，3D姿势可以通过重新投影回2D来隐式学习。此外，CameraPose引入了一个带有置信度引导损失的细化网络模块，以进一步提高2D姿态估计器提取的有噪声2D关键点的质量。实验结果表明，CameraPose在跨场景数据集上有明显的改进。值得注意的是，在最具挑战性的数据集3DPW上，它的性能比基线方法高3mm。此外，通过将我们提出的改进网络模块与现有的3D姿态估计器相结合，可以提高它们在跨场景评估中的性能。

{"title":"CameraPose: Weakly-Supervised Monocular 3D Human Pose Estimation by Leveraging In-the-wild 2D Annotations","authors":"Cheng-Yen Yang, Jiajia Luo, Lu Xia, Yuyin Sun, Nan Qiao, Ke Zhang, Zhongyu Jiang, Jenq-Neng Hwang","doi":"10.1109/WACV56688.2023.00294","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00294","url":null,"abstract":"To improve the generalization of 3D human pose estimators, many existing deep learning based models focus on adding different augmentations to training poses. However, data augmentation techniques are limited to the \"seen\" pose combinations and hard to infer poses with rare \"unseen\" joint positions. To address this problem, we present CameraPose, a weakly-supervised framework for 3D human pose estimation from a single image, which can not only be applied on 2D-3D pose pairs but also on 2D alone annotations. By adding a camera parameter branch, any in-the-wild 2D annotations can be fed into our pipeline to boost the training diversity and the 3D poses can be implicitly learned by reprojecting back to 2D. Moreover, CameraPose introduces a refinement network module with confidence-guided loss to further improve the quality of noisy 2D keypoints extracted by 2D pose estimators. Experimental results demonstrate that the CameraPose brings in clear improvements on cross-scenario datasets. Notably, it outperforms the baseline method by 3mm on the most challenging dataset 3DPW. In addition, by combining our proposed refinement network module with existing 3D pose estimators, their performance can be improved in cross-scenario evaluation.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130384429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Avoiding Lingering in Learning Active Recognition by Adversarial Disturbance 对抗干扰下主动识别学习中避免滞留

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00459

Lei Fan, Ying Wu

This paper considers the active recognition scenario, where the agent is empowered to intelligently acquire observations for better recognition. The agents usually compose two modules, i.e., the policy and the recognizer, to select actions and predict the category. While using ground-truth class labels to supervise the recognizer, the policy is typically updated with rewards determined by the current in-training recognizer, like whether achieving correct predictions. However, this joint learning process could lead to unintended solutions, like a collapsed policy that only visits views that the recognizer is already sufficiently trained to obtain rewards, which harms the generalization ability. We call this phenomenon lingering to depict the agent being reluctant to explore challenging views during training. Existing approaches to tackle the exploration-exploitation trade-off could be ineffective as they usually assume reliable feedback during exploration to update the estimate of rarely-visited states. This assumption is invalid here as the reward from the recognizer could be insufficiently trained.To this end, our approach integrates another adversarial policy to constantly disturb the recognition agent during training, forming a competing game to promote active explorations and avoid lingering. The reinforced adversary, rewarded when the recognition fails, contests the recognition agent by turning the camera to challenging observations. Extensive experiments across two datasets validate the effectiveness of the proposed approach regarding its recognition performances, learning efficiencies, and especially robustness in managing environmental noises.

本文考虑了主动识别场景，其中智能体被授权智能地获取观察值以更好地识别。智能体通常由策略和识别器两个模块组成，用于选择动作和预测类别。当使用真实类标签来监督识别器时，该策略通常会使用当前训练中的识别器确定的奖励来更新，比如是否实现了正确的预测。然而，这种联合学习过程可能会导致意想不到的解决方案，比如一个崩溃的策略，只访问识别器已经得到足够训练以获得奖励的视图，这损害了泛化能力。我们称这种现象为徘徊，以描述智能体在训练过程中不愿意探索具有挑战性的观点。现有的解决勘探-开发权衡的方法可能是无效的，因为它们通常在勘探过程中假设可靠的反馈来更新很少访问的状态的估计。这个假设在这里是无效的，因为来自识别器的奖励可能没有得到充分的训练。为此，我们的方法整合了另一种对抗性策略，在训练过程中不断干扰识别代理，形成竞争博弈，促进主动探索，避免徘徊。当识别失败时，被强化的对手会得到奖励，通过将相机转向具有挑战性的观察来与识别代理竞争。跨两个数据集的大量实验验证了所提出方法在识别性能，学习效率，特别是在管理环境噪声方面的有效性。

{"title":"Avoiding Lingering in Learning Active Recognition by Adversarial Disturbance","authors":"Lei Fan, Ying Wu","doi":"10.1109/WACV56688.2023.00459","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00459","url":null,"abstract":"This paper considers the active recognition scenario, where the agent is empowered to intelligently acquire observations for better recognition. The agents usually compose two modules, i.e., the policy and the recognizer, to select actions and predict the category. While using ground-truth class labels to supervise the recognizer, the policy is typically updated with rewards determined by the current in-training recognizer, like whether achieving correct predictions. However, this joint learning process could lead to unintended solutions, like a collapsed policy that only visits views that the recognizer is already sufficiently trained to obtain rewards, which harms the generalization ability. We call this phenomenon lingering to depict the agent being reluctant to explore challenging views during training. Existing approaches to tackle the exploration-exploitation trade-off could be ineffective as they usually assume reliable feedback during exploration to update the estimate of rarely-visited states. This assumption is invalid here as the reward from the recognizer could be insufficiently trained.To this end, our approach integrates another adversarial policy to constantly disturb the recognition agent during training, forming a competing game to promote active explorations and avoid lingering. The reinforced adversary, rewarded when the recognition fails, contests the recognition agent by turning the camera to challenging observations. Extensive experiments across two datasets validate the effectiveness of the proposed approach regarding its recognition performances, learning efficiencies, and especially robustness in managing environmental noises.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"22 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123495180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ReEnFP: Detail-Preserving Face Reconstruction by Encoding Facial Priors 基于先验编码的保留细节的人脸重构

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00606

Yasheng Sun, Jiangke Lin, Hang Zhou, Zhi-liang Xu, Dongliang He, H. Koike

We address the problem of face modeling, which is still challenging in achieving high-quality reconstruction results efficiently. Neither previous regression-based nor optimization-based frameworks could well balance between the facial reconstruction fidelity and efficiency. We notice that the large amount of in-the-wild facial images contain diverse appearance information, however, their underlying knowledge is not fully exploited for face modeling. To this end, we propose our Reconstruction by Encoding Facial Priors (ReEnFP) pipeline to exploit the potential of unconstrained facial images for further improvement. Our key is to encode generative priors learned by a style-based texture generator on unconstrained data for fast and detail-preserving face reconstruction. With our texture generator pre-trained using a differentiable renderer, faces could be encoded to its latent space as opposed to the time-consuming optimization-based inversion. Our generative prior encoding is further enhanced with a pyramid fusion block for adaptive integration of input spatial information. Extensive experiments show that our method reconstructs photo-realistic facial textures and geometric details with precise identity recovery.

我们解决了人脸建模问题，这仍然是一个挑战，以获得高质量的高效重建结果。无论是基于回归的框架还是基于优化的框架，都不能很好地平衡面部重建的保真度和效率。我们注意到大量的野外人脸图像包含各种各样的外观信息，然而，它们的底层知识并没有被充分利用到人脸建模中。为此，我们提出了基于编码面部先验的重构(ReEnFP)管道，以挖掘无约束面部图像的潜力，进一步改进。我们的关键是对基于样式的纹理生成器在无约束数据上学习到的生成先验进行编码，以实现快速和保留细节的人脸重建。通过使用可微分渲染器预训练纹理生成器，可以将人脸编码到潜在空间，而不是耗时的基于优化的反演。我们的生成先验编码进一步增强了一个金字塔融合块，用于自适应集成输入空间信息。大量的实验表明，我们的方法重建了逼真的面部纹理和几何细节，具有精确的身份恢复。

{"title":"ReEnFP: Detail-Preserving Face Reconstruction by Encoding Facial Priors","authors":"Yasheng Sun, Jiangke Lin, Hang Zhou, Zhi-liang Xu, Dongliang He, H. Koike","doi":"10.1109/WACV56688.2023.00606","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00606","url":null,"abstract":"We address the problem of face modeling, which is still challenging in achieving high-quality reconstruction results efficiently. Neither previous regression-based nor optimization-based frameworks could well balance between the facial reconstruction fidelity and efficiency. We notice that the large amount of in-the-wild facial images contain diverse appearance information, however, their underlying knowledge is not fully exploited for face modeling. To this end, we propose our Reconstruction by Encoding Facial Priors (ReEnFP) pipeline to exploit the potential of unconstrained facial images for further improvement. Our key is to encode generative priors learned by a style-based texture generator on unconstrained data for fast and detail-preserving face reconstruction. With our texture generator pre-trained using a differentiable renderer, faces could be encoded to its latent space as opposed to the time-consuming optimization-based inversion. Our generative prior encoding is further enhanced with a pyramid fusion block for adaptive integration of input spatial information. Extensive experiments show that our method reconstructs photo-realistic facial textures and geometric details with precise identity recovery.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123640638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Split to Learn: Gradient Split for Multi-Task Human Image Analysis 分裂学习:多任务人类图像分析的梯度分裂

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00433

Weijian Deng, Yumin Suh, Xiang Yu, M. Faraki, Liang Zheng, Manmohan Chandraker

This paper presents an approach to train a unified deep network that simultaneously solves multiple human-related tasks. A multi-task framework is favorable for sharing information across tasks under restricted computational resources. However, tasks not only share information but may also compete for resources and conflict with each other, making the optimization of shared parameters difficult and leading to suboptimal performance. We propose a simple but effective training scheme called GradSplit that alleviates this issue by utilizing asymmetric inter-task relations. Specifically, at each convolution module, it splits features into T groups for T tasks and trains each group only using the gradient back-propagated from the task losses with which it does not have conflicts. During training, we apply GradSplit to a series of convolution modules. As a result, each module is trained to generate a set of task-specific features using the shared features from the previous module. This enables a network to use complementary information across tasks while circumventing gradient conflicts. Experimental results show that GradSplit achieves a better accuracy-efficiency trade-off than existing methods. It minimizes accuracy drop caused by task conflicts while significantly saving compute resources in terms of both FLOPs and memory at inference. We further show that GradSplit achieves higher cross-dataset accuracy compared to single-task and other multi-task networks.

本文提出了一种训练统一深度网络的方法，该网络可以同时解决多个与人类相关的任务。多任务框架有利于在计算资源有限的情况下跨任务共享信息。然而，任务之间不仅会共享信息，还会相互竞争资源和冲突，使得共享参数的优化变得困难，从而导致性能次优。我们提出了一个简单而有效的训练方案，称为GradSplit，它通过利用不对称的任务间关系来缓解这个问题。具体来说，在每个卷积模块中，它将T个任务的特征分成T个组，并且仅使用从任务损失中反向传播的梯度来训练每个组，这些任务损失与它没有冲突。在训练期间，我们将GradSplit应用于一系列卷积模块。因此，每个模块都经过训练，使用前一个模块的共享功能生成一组特定于任务的功能。这使得网络可以在任务之间使用互补信息，同时避免梯度冲突。实验结果表明，与现有方法相比，GradSplit实现了更好的精度-效率权衡。它最大限度地减少了由任务冲突引起的精度下降，同时在推理时显着节省了计算资源，包括flop和内存。我们进一步表明，与单任务和其他多任务网络相比，GradSplit实现了更高的跨数据集准确性。

{"title":"Split to Learn: Gradient Split for Multi-Task Human Image Analysis","authors":"Weijian Deng, Yumin Suh, Xiang Yu, M. Faraki, Liang Zheng, Manmohan Chandraker","doi":"10.1109/WACV56688.2023.00433","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00433","url":null,"abstract":"This paper presents an approach to train a unified deep network that simultaneously solves multiple human-related tasks. A multi-task framework is favorable for sharing information across tasks under restricted computational resources. However, tasks not only share information but may also compete for resources and conflict with each other, making the optimization of shared parameters difficult and leading to suboptimal performance. We propose a simple but effective training scheme called GradSplit that alleviates this issue by utilizing asymmetric inter-task relations. Specifically, at each convolution module, it splits features into T groups for T tasks and trains each group only using the gradient back-propagated from the task losses with which it does not have conflicts. During training, we apply GradSplit to a series of convolution modules. As a result, each module is trained to generate a set of task-specific features using the shared features from the previous module. This enables a network to use complementary information across tasks while circumventing gradient conflicts. Experimental results show that GradSplit achieves a better accuracy-efficiency trade-off than existing methods. It minimizes accuracy drop caused by task conflicts while significantly saving compute resources in terms of both FLOPs and memory at inference. We further show that GradSplit achieves higher cross-dataset accuracy compared to single-task and other multi-task networks.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121700223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1