2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献_第10页

Connecting the Complementary-view Videos: Joint Camera Identification and Subject Association 连接互补视点视频:联合摄像机识别与主体关联

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00245

Ruize Han, Yiyang Gan, Jiacheng Li, F. Wang, Wei Feng, Song Wang

We attempt to connect the data from complementary views, i.e., top view from drone-mounted cameras in the air, and side view from wearable cameras on the ground. Collaborative analysis of such complementary-view data can facilitate to build the air-ground cooperative visual system for various kinds of applications. This is a very challenging problem due to the large view difference between top and side views. In this paper, we develop a new approach that can simultaneously handle three tasks: i) localizing the side-view camera in the top view; ii) estimating the view direction of the side-view camera; iii) detecting and associating the same subjects on the ground across the complementary views. Our main idea is to explore the spatial position layout of the subjects in two views. In particular, we propose a spatial-aware position representation method to embed the spatial-position distribution of the subjects in different views. We further design a cross-view video collaboration framework composed of a camera identification module and a subject association module to simultaneously perform the above three tasks. We collect a new synthetic dataset consisting of top-view and side-view video sequence pairs for performance evaluation and the experimental results show the effectiveness of the proposed method.

我们试图将互补视角的数据连接起来，即空中无人机摄像头的俯视图和地面可穿戴摄像头的侧视图。对这些互补视点数据进行协同分析，有利于建立各种应用的空地协同视觉系统。这是一个非常具有挑战性的问题，因为顶视图和侧视图之间存在很大的视图差异。在本文中，我们开发了一种可以同时处理三个任务的新方法:i)在俯视图中定位侧视摄像机;Ii)估计侧视摄像头的观看方向;Iii)在互补视图中发现并关联地面上的相同主题。我们的主要想法是在两个视图中探索主体的空间位置布局。特别地，我们提出了一种空间感知的位置表示方法来嵌入被试在不同视角下的空间位置分布。我们进一步设计了一个由摄像机识别模块和主题关联模块组成的跨视图视频协作框架，以同时完成上述三个任务。我们收集了一个由俯视图和侧视图视频序列对组成的新的合成数据集来进行性能评估，实验结果表明了该方法的有效性。

{"title":"Connecting the Complementary-view Videos: Joint Camera Identification and Subject Association","authors":"Ruize Han, Yiyang Gan, Jiacheng Li, F. Wang, Wei Feng, Song Wang","doi":"10.1109/CVPR52688.2022.00245","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00245","url":null,"abstract":"We attempt to connect the data from complementary views, i.e., top view from drone-mounted cameras in the air, and side view from wearable cameras on the ground. Collaborative analysis of such complementary-view data can facilitate to build the air-ground cooperative visual system for various kinds of applications. This is a very challenging problem due to the large view difference between top and side views. In this paper, we develop a new approach that can simultaneously handle three tasks: i) localizing the side-view camera in the top view; ii) estimating the view direction of the side-view camera; iii) detecting and associating the same subjects on the ground across the complementary views. Our main idea is to explore the spatial position layout of the subjects in two views. In particular, we propose a spatial-aware position representation method to embed the spatial-position distribution of the subjects in different views. We further design a cross-view video collaboration framework composed of a camera identification module and a subject association module to simultaneously perform the above three tasks. We collect a new synthetic dataset consisting of top-view and side-view video sequence pairs for performance evaluation and the experimental results show the effectiveness of the proposed method.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126675829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Show, Deconfound and Tell: Image Captioning with Causal Inference 展示，拆解和讲述:带有因果推理的图像标题

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01751

Bing Liu, Dong Wang, Xu Yang, Yong Zhou, Rui Yao, Zhiwen Shao, Jiaqi Zhao

The transformer-based encoder-decoder framework has shown remarkable performance in image captioning. However, most transformer-based captioning methods ever overlook two kinds of elusive confounders: the visual confounder and the linguistic confounder, which generally lead to harmful bias, induce the spurious correlations during training, and degrade the model generalization. In this paper, we first use Structural Causal Models (SCMs) to show how two confounders damage the image captioning. Then we apply the backdoor adjustment to propose a novel causal inference based image captioning (CIIC) framework, which consists of an interventional object detector (IOD) and an interventional transformer decoder (ITD) to jointly confront both confounders. In the encoding stage, the IOD is able to disentangle the region-based visual features by deconfounding the visual confounder. In the decoding stage, the ITD introduces causal intervention into the transformer decoder and deconfounds the visual and linguistic confounders simultaneously. Two modules collaborate with each other to alleviate the spurious correlations caused by the unobserved confounders. When tested on MSCOCO, our proposal significantly outperforms the state-of-the-art encoder-decoder models on Karpathy split and online test split. Code is published in https://github.com/CUMTGG/CIIC.

基于变压器的编码器-解码器框架在图像字幕处理中表现出了显著的性能。然而，大多数基于变换的字幕方法都忽略了两种难以识别的混杂因素:视觉混杂因素和语言混杂因素，这两种混杂因素通常会导致有害的偏差，在训练过程中诱发伪相关，降低模型的泛化能力。在本文中，我们首先使用结构因果模型(scm)来显示两个混杂因素如何损害图像标题。然后，我们应用后门调整提出了一种新的基于因果推理的图像字幕(CIIC)框架，该框架由介入对象检测器(IOD)和介入变压器解码器(ITD)组成，共同对抗这两个干扰。在编码阶段，IOD能够通过解构视觉混淆来解开基于区域的视觉特征。在解码阶段，过渡段将因果干预引入变压器解码器，同时消除视觉和语言干扰。两个模块相互协作，以减轻由未观察到的混杂因素引起的虚假相关性。当在MSCOCO上进行测试时，我们的建议在Karpathy分裂和在线测试分裂上明显优于最先进的编码器-解码器模型。代码发布在https://github.com/CUMTGG/CIIC。

{"title":"Show, Deconfound and Tell: Image Captioning with Causal Inference","authors":"Bing Liu, Dong Wang, Xu Yang, Yong Zhou, Rui Yao, Zhiwen Shao, Jiaqi Zhao","doi":"10.1109/CVPR52688.2022.01751","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01751","url":null,"abstract":"The transformer-based encoder-decoder framework has shown remarkable performance in image captioning. However, most transformer-based captioning methods ever overlook two kinds of elusive confounders: the visual confounder and the linguistic confounder, which generally lead to harmful bias, induce the spurious correlations during training, and degrade the model generalization. In this paper, we first use Structural Causal Models (SCMs) to show how two confounders damage the image captioning. Then we apply the backdoor adjustment to propose a novel causal inference based image captioning (CIIC) framework, which consists of an interventional object detector (IOD) and an interventional transformer decoder (ITD) to jointly confront both confounders. In the encoding stage, the IOD is able to disentangle the region-based visual features by deconfounding the visual confounder. In the decoding stage, the ITD introduces causal intervention into the transformer decoder and deconfounds the visual and linguistic confounders simultaneously. Two modules collaborate with each other to alleviate the spurious correlations caused by the unobserved confounders. When tested on MSCOCO, our proposal significantly outperforms the state-of-the-art encoder-decoder models on Karpathy split and online test split. Code is published in https://github.com/CUMTGG/CIIC.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121538296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction 多智能体运动预测的层次向量变换

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00862

Zikang Zhou, Luyao Ye, Jianping Wang, Kui Wu, K. Lu

Accurately predicting the future motions of surrounding traffic agents is critical for the safety of autonomous ve-hicles. Recently, vectorized approaches have dominated the motion prediction community due to their capability of capturing complex interactions in traffic scenes. How-ever, existing methods neglect the symmetries of the prob-lem and suffer from the expensive computational cost, facing the challenge of making real-time multi-agent motion prediction without sacrificing the prediction performance. To tackle this challenge, we propose Hierarchical Vector Transformer (HiVT) for fast and accurate multi-agent motion prediction. By decomposing the problem into local con-text extraction and global interaction modeling, our method can effectively and efficiently model a large number of agents in the scene. Meanwhile, we propose a translation-invariant scene representation and rotation-invariant spa-tial learning modules, which extract features robust to the geometric transformations of the scene and enable the model to make accurate predictions for multiple agents in a single forward pass. Experiments show that HiVT achieves the state-of-the-art performance on the Argoverse motion forecasting benchmark with a small model size and can make fast multi-agent motion prediction.

准确预测周围交通主体的未来运动对自动驾驶汽车的安全至关重要。近年来，矢量化方法由于能够捕捉交通场景中复杂的相互作用，在运动预测领域占据主导地位。然而，现有的方法忽略了问题的对称性，并且计算成本昂贵，面临着在不牺牲预测性能的情况下进行实时多智能体运动预测的挑战。为了解决这一挑战，我们提出了层次向量变压器(HiVT)来快速准确地预测多智能体运动。通过将问题分解为局部上下文提取和全局交互建模，我们的方法可以有效地对场景中的大量智能体进行建模。同时，我们提出了平移不变的场景表示和旋转不变的空间学习模块，提取了对场景几何变换鲁棒的特征，使模型能够在一次前向传递中对多个智能体做出准确的预测。实验表明，HiVT在Argoverse运动预测基准上以较小的模型尺寸达到了最先进的性能，可以进行快速的多智能体运动预测。

{"title":"HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction","authors":"Zikang Zhou, Luyao Ye, Jianping Wang, Kui Wu, K. Lu","doi":"10.1109/CVPR52688.2022.00862","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00862","url":null,"abstract":"Accurately predicting the future motions of surrounding traffic agents is critical for the safety of autonomous ve-hicles. Recently, vectorized approaches have dominated the motion prediction community due to their capability of capturing complex interactions in traffic scenes. How-ever, existing methods neglect the symmetries of the prob-lem and suffer from the expensive computational cost, facing the challenge of making real-time multi-agent motion prediction without sacrificing the prediction performance. To tackle this challenge, we propose Hierarchical Vector Transformer (HiVT) for fast and accurate multi-agent motion prediction. By decomposing the problem into local con-text extraction and global interaction modeling, our method can effectively and efficiently model a large number of agents in the scene. Meanwhile, we propose a translation-invariant scene representation and rotation-invariant spa-tial learning modules, which extract features robust to the geometric transformations of the scene and enable the model to make accurate predictions for multiple agents in a single forward pass. Experiments show that HiVT achieves the state-of-the-art performance on the Argoverse motion forecasting benchmark with a small model size and can make fast multi-agent motion prediction.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122952608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

Boosting 3D Object Detection by Simulating Multimodality on Point Clouds 通过模拟点云上的多模态来增强三维目标检测

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01327

Wu Zheng, Ming-Hong Hong, Li Jiang, Chi-Wing Fu

This paper presents a new approach to boost a single-modality (LiDAR) 3D object detector by teaching it to sim-ulate features and responses that follow a multi-modality (LiDAR-image) detector. The approach needs LiDAR-image data only when training the single-modality detector, and once well-trained, it only needs LiDAR data at inference. We design a novel framework to realize the approach: re-sponse distillation to focus on the crucial response samples and avoid most background samples; sparse-voxel distillation to learn voxel semantics and relations from the esti-mated crucial voxels; a fine-grained voxel-to-point distillation to better attend to features of small and distant objects; and instance distillation to further enhance the deep-feature consistency. Experimental results on the nuScenes dataset show that our approach outperforms all SOTA LiDAR-only 3D detectors and even surpasses the baseline LiDAR-image detector on the key NDS metric, filling ~72% mAP gap be-tween the single- and multi-modality detectors.

本文提出了一种新的方法，通过教它模拟多模态(激光雷达图像)探测器的特征和响应来增强单模态(激光雷达)3D物体探测器。该方法仅在训练单模态探测器时需要激光雷达图像数据，训练完成后，在推理时只需要激光雷达数据。我们设计了一种新的框架来实现该方法:响应蒸馏专注于关键响应样本，避免大多数背景样本;稀疏体素蒸馏，从估计的关键体素中学习体素语义和关系;一种细粒度的体素到点的蒸馏，可以更好地处理小物体和远处物体的特征;实例蒸馏进一步提高深度特征的一致性。在nuScenes数据集上的实验结果表明，我们的方法优于所有的SOTA激光雷达3D探测器，甚至在关键的NDS指标上超过了基线激光雷达图像探测器，填补了单模态和多模态探测器之间约72%的mAP缺口。

引用次数: 14

Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation 单阶段就足够了:多人绝对3D姿势估计

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01274

Lei Jin, Chenyang Xu, Xiaojuan Wang, Yabo Xiao, Yandong Guo, Xuecheng Nie, Jian Zhao

The existing multi-person absolute 3D pose estimation methods are mainly based on two-stage paradigm, i.e., top-down or bottom-up, leading to redundant pipelines with high computation cost. We argue that it is more desirable to simplify such two-stage paradigm to a single-stage one to promote both efficiency and performance. To this end, we present an efficient single-stage solution, Decoupled Regression Model (DRM), with three distinct novelties. First, DRM introduces a new decoupled representation for 3D pose, which expresses the 2D pose in image plane and depth information of each 3D human instance via 2D center point (center of visible keypoints) and root point (denoted as pelvis), respectively. Second, to learn better feature representation for the human depth regression, DRM introduces a 2D Pose-guided Depth Query Module (PDQM) to extract the features in 2D pose regression branch, enabling the depth regression branch to perceive the scale information of instances. Third, DRM leverages a Decoupled Absolute Pose Loss (DAPL) to facilitate the absolute root depth and root-relative depth estimation, thus improving the accuracy of absolute 3D pose. Comprehensive experiments on challenging benchmarks including MuPoTS-3D and Panoptic clearly verify the superiority of our framework, which outperforms the state-of-the-art bottom-up absolute 3D pose estimation methods.

现有的多人绝对三维姿态估计方法主要基于自顶向下或自底向上两阶段范式，导致管道冗余，计算成本高。我们认为，将这种两阶段范式简化为单阶段范式以提高效率和性能是更可取的。为此，我们提出了一个有效的单阶段解决方案，解耦回归模型(DRM)，具有三个不同的新颖之处。首先，DRM引入了一种新的三维姿态解耦表示，通过二维中心点(可见关键点的中心)和根点(表示为骨盆)分别表示每个三维人体实例的图像平面和深度信息中的二维姿态。其次，为了更好地学习人体深度回归的特征表示，DRM引入了2D位姿引导深度查询模块(PDQM)，提取2D位姿回归分支中的特征，使深度回归分支能够感知实例的尺度信息。第三，DRM利用解耦的绝对姿态损失(DAPL)来实现绝对根深度和根相对深度的估计，从而提高了绝对三维姿态的精度。在具有挑战性的基准测试(包括MuPoTS-3D和Panoptic)上进行的综合实验清楚地验证了我们的框架的优越性，它优于最先进的自下而上的绝对3D姿态估计方法。

{"title":"Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation","authors":"Lei Jin, Chenyang Xu, Xiaojuan Wang, Yabo Xiao, Yandong Guo, Xuecheng Nie, Jian Zhao","doi":"10.1109/CVPR52688.2022.01274","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01274","url":null,"abstract":"The existing multi-person absolute 3D pose estimation methods are mainly based on two-stage paradigm, i.e., top-down or bottom-up, leading to redundant pipelines with high computation cost. We argue that it is more desirable to simplify such two-stage paradigm to a single-stage one to promote both efficiency and performance. To this end, we present an efficient single-stage solution, Decoupled Regression Model (DRM), with three distinct novelties. First, DRM introduces a new decoupled representation for 3D pose, which expresses the 2D pose in image plane and depth information of each 3D human instance via 2D center point (center of visible keypoints) and root point (denoted as pelvis), respectively. Second, to learn better feature representation for the human depth regression, DRM introduces a 2D Pose-guided Depth Query Module (PDQM) to extract the features in 2D pose regression branch, enabling the depth regression branch to perceive the scale information of instances. Third, DRM leverages a Decoupled Absolute Pose Loss (DAPL) to facilitate the absolute root depth and root-relative depth estimation, thus improving the accuracy of absolute 3D pose. Comprehensive experiments on challenging benchmarks including MuPoTS-3D and Panoptic clearly verify the superiority of our framework, which outperforms the state-of-the-art bottom-up absolute 3D pose estimation methods.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"575 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121019766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Neural Mesh Simplification 神经网格简化

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01803

Rolandos Alexandros Potamias, Stylianos Ploumpis, S. Zafeiriou

Despite the advent in rendering, editing and preprocessing methods of 3D meshes, their real-time execution remains still infeasible for large-scale meshes. To ease and accelerate such processes, mesh simplification methods have been introduced with the aim to reduce the mesh resolution while preserving its appearance. In this work we attempt to tackle the novel task of learnable and differentiable mesh simplification. Compared to traditional simplification approaches that collapse edges in a greedy iterative manner, we propose a fast and scalable method that simplifies a given mesh in one-pass. The proposed method unfolds in three steps. Initially, a subset of the input vertices is sampled using a sophisticated extension of random sampling. Then, we train a sparse attention network to propose candidate triangles based on the edge connectivity of the sampled vertices. Finally, a classification network estimates the probability that a candidate triangle will be included in the final mesh. The fast, lightweight and differentiable properties of the proposed method makes it possible to be plugged in every learnable pipeline without introducing a significant overhead. We evaluate both the sampled vertices and the generated triangles under several appearance error measures and compare its performance against several state-of-the-art baselines. Furthermore, we showcase that the running performance can be up to 10× faster than traditional methods.

尽管出现了3D网格的渲染、编辑和预处理方法，但它们的实时执行对于大规模网格来说仍然是不可行的。为了简化和加速这一过程，已经引入了网格简化方法，目的是在保持其外观的同时降低网格分辨率。在这项工作中，我们试图解决可学习和可微网格简化的新任务。与传统的贪心迭代折叠边缘的简化方法相比，我们提出了一种快速且可扩展的方法，可以一次简化给定网格。该方法分三步展开。最初，使用随机抽样的复杂扩展对输入顶点的子集进行抽样。然后，我们训练一个稀疏关注网络，根据采样顶点的边缘连通性提出候选三角形。最后，分类网络估计候选三角形被包含在最终网格中的概率。所提出的方法具有快速、轻量级和可微分的特性，可以在不引入显著开销的情况下插入每个可学习的管道。我们在几种外观误差测量下评估采样顶点和生成的三角形，并将其性能与几种最先进的基线进行比较。此外，我们展示了运行性能可以比传统方法快10倍。

{"title":"Neural Mesh Simplification","authors":"Rolandos Alexandros Potamias, Stylianos Ploumpis, S. Zafeiriou","doi":"10.1109/CVPR52688.2022.01803","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01803","url":null,"abstract":"Despite the advent in rendering, editing and preprocessing methods of 3D meshes, their real-time execution remains still infeasible for large-scale meshes. To ease and accelerate such processes, mesh simplification methods have been introduced with the aim to reduce the mesh resolution while preserving its appearance. In this work we attempt to tackle the novel task of learnable and differentiable mesh simplification. Compared to traditional simplification approaches that collapse edges in a greedy iterative manner, we propose a fast and scalable method that simplifies a given mesh in one-pass. The proposed method unfolds in three steps. Initially, a subset of the input vertices is sampled using a sophisticated extension of random sampling. Then, we train a sparse attention network to propose candidate triangles based on the edge connectivity of the sampled vertices. Finally, a classification network estimates the probability that a candidate triangle will be included in the final mesh. The fast, lightweight and differentiable properties of the proposed method makes it possible to be plugged in every learnable pipeline without introducing a significant overhead. We evaluate both the sampled vertices and the generated triangles under several appearance error measures and compare its performance against several state-of-the-art baselines. Furthermore, we showcase that the running performance can be up to 10× faster than traditional methods.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121793184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer 基于对象引导联合解码转换器的文本到图像合成

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01758

Fuxiang Wu, Liu Liu, Fusheng Hao, Fengxiang He, Jun Cheng

Object-guided text-to-image synthesis aims to generate images from natural language descriptions built by two-step frameworks, i.e., the model generates the layout and then synthesizes images from the layout and captions. However, such frameworks have two issues: 1) complex structure, since generating language-related layout is not a trivial task; 2) error propagation, because the inappropriate layout will mislead the image synthesis and is hard to be revised. In this paper, we propose an object-guided joint-decoding module to simultaneously generate the image and the corresponding layout. Specially, we present the joint-decoding transformer to model the joint probability on images tokens and the corresponding layouts tokens, where layout tokens provide additional observed data to model the complex scene better. Then, we describe a novel Layout-Vqgan for layout encoding and decoding to provide more information about the complex scene. After that, we present the detail-enhanced module to enrich the language-related details based on two facts: 1) visual details could be omitted in the compression of VQGANs; 2) the joint-decoding transformer would not have sufficient generating capacity. The experiments show that our approach is competitive with previous object-centered models and can generate diverse and high-quality objects under the given layouts.

对象引导的文本到图像合成旨在通过两步框架构建的自然语言描述生成图像，即模型生成布局，然后从布局和标题合成图像。然而，这样的框架有两个问题:1)复杂的结构，因为生成与语言相关的布局不是一件小事;2)误差传播，因为布局不当会误导图像合成，难以修正。在本文中，我们提出了一个对象引导的联合解码模块来同时生成图像和相应的布局。特别地，我们提出了联合解码转换器来建模图像标记和相应的布局标记的联合概率，其中布局标记提供了额外的观测数据来更好地建模复杂场景。然后，我们描述了一种新颖的布局编码和解码的layout - vqgan，以提供更多关于复杂场景的信息。在此基础上，我们提出了细节增强模块，以丰富与语言相关的细节:1)视觉细节可以在vqgan压缩中省略;2)联合译码变压器发电能力不足。实验表明，我们的方法与以前的以对象为中心的模型相比具有竞争力，可以在给定的布局下生成多样化和高质量的对象。

{"title":"Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer","authors":"Fuxiang Wu, Liu Liu, Fusheng Hao, Fengxiang He, Jun Cheng","doi":"10.1109/CVPR52688.2022.01758","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01758","url":null,"abstract":"Object-guided text-to-image synthesis aims to generate images from natural language descriptions built by two-step frameworks, i.e., the model generates the layout and then synthesizes images from the layout and captions. However, such frameworks have two issues: 1) complex structure, since generating language-related layout is not a trivial task; 2) error propagation, because the inappropriate layout will mislead the image synthesis and is hard to be revised. In this paper, we propose an object-guided joint-decoding module to simultaneously generate the image and the corresponding layout. Specially, we present the joint-decoding transformer to model the joint probability on images tokens and the corresponding layouts tokens, where layout tokens provide additional observed data to model the complex scene better. Then, we describe a novel Layout-Vqgan for layout encoding and decoding to provide more information about the complex scene. After that, we present the detail-enhanced module to enrich the language-related details based on two facts: 1) visual details could be omitted in the compression of VQGANs; 2) the joint-decoding transformer would not have sufficient generating capacity. The experiments show that our approach is competitive with previous object-centered models and can generate diverse and high-quality objects under the given layouts.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121812241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

RecDis-SNN: Rectifying Membrane Potential Distribution for Directly Training Spiking Neural Networks RecDis-SNN:直接训练尖峰神经网络的整流膜电位分布

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00042

Yu-Zhu Guo, Xin-Yi Tong, Y. Chen, Liwen Zhang, Xiaode Liu, Zhe Ma, Xuhui Huang

The brain-inspired and event-driven Spiking Neural Network (SNN) aiming at mimicking the synaptic activity of biological neurons has received increasing attention. It transmits binary spike signals between network units when the membrane potential exceeds the firing threshold. This biomimetic mechanism of SNN appears energy-efficiency with its power sparsity and asynchronous operations on spike events. Unfortunately, with the propagation of binary spikes, the distribution of membrane potential will shift, leading to degeneration, saturation, and gradient mismatch problems, which would be disadvantageous to the network optimization and convergence. Such undesired shifts would prevent the SNN from performing well and going deep. To tackle these problems, we attempt to rectify the membrane potential distribution (MPD) by designing a novel distribution loss, MPD-Loss, which can explicitly penalize the un-desired shifts without introducing any additional operations in the inference phase. Moreover, the proposed method can also mitigate the quantization error in SNNs, which is usually ignored in other works. Experimental results demonstrate that the proposed method can directly train a deeper, larger, and better-performing SNN within fewer timesteps.

以模拟生物神经元突触活动为目标的脑激发和事件驱动的峰值神经网络(SNN)越来越受到人们的关注。当膜电位超过放电阈值时，在网络单元之间传递二元尖峰信号。这种仿生SNN机制由于其功率稀疏性和对尖峰事件的异步操作而具有能源效率。然而，随着二元尖峰的传播，膜电位的分布会发生变化，导致退化、饱和和梯度失配等问题，不利于网络的优化和收敛。这种不受欢迎的变化会阻碍SNN的良好表现和深入。为了解决这些问题，我们尝试通过设计一种新的膜电位分布损失(MPD - loss)来纠正膜电位分布(MPD)，该损失可以明确地惩罚不希望发生的位移，而无需在推理阶段引入任何额外的操作。此外，该方法还可以减轻snn的量化误差，这在其他研究中通常被忽略。实验结果表明，该方法可以在更短的时间步长内直接训练出更深、更大、性能更好的SNN。

{"title":"RecDis-SNN: Rectifying Membrane Potential Distribution for Directly Training Spiking Neural Networks","authors":"Yu-Zhu Guo, Xin-Yi Tong, Y. Chen, Liwen Zhang, Xiaode Liu, Zhe Ma, Xuhui Huang","doi":"10.1109/CVPR52688.2022.00042","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00042","url":null,"abstract":"The brain-inspired and event-driven Spiking Neural Network (SNN) aiming at mimicking the synaptic activity of biological neurons has received increasing attention. It transmits binary spike signals between network units when the membrane potential exceeds the firing threshold. This biomimetic mechanism of SNN appears energy-efficiency with its power sparsity and asynchronous operations on spike events. Unfortunately, with the propagation of binary spikes, the distribution of membrane potential will shift, leading to degeneration, saturation, and gradient mismatch problems, which would be disadvantageous to the network optimization and convergence. Such undesired shifts would prevent the SNN from performing well and going deep. To tackle these problems, we attempt to rectify the membrane potential distribution (MPD) by designing a novel distribution loss, MPD-Loss, which can explicitly penalize the un-desired shifts without introducing any additional operations in the inference phase. Moreover, the proposed method can also mitigate the quantization error in SNNs, which is usually ignored in other works. Experimental results demonstrate that the proposed method can directly train a deeper, larger, and better-performing SNN within fewer timesteps.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116551491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Learning to Learn across Diverse Data Biases in Deep Face Recognition 学习在深度人脸识别中的不同数据偏差中学习

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00404

Chang Liu, Xiang Yu, Yao-Hung Hubert Tsai, M. Faraki, Ramin Moslemi, Manmohan Chandraker, Y. Fu

Convolutional Neural Networks have achieved remarkable success in face recognition, in part due to the abundant availability of data. However, the data used for training CNNs is often imbalanced. Prior works largely focus on the long-tailed nature of face datasets in data volume per identity, or focus on single bias variation. In this paper, we show that many bias variations such as ethnicity, head pose, occlusion and blur can jointly affect the accuracy significantly. We propose a sample level weighting approach termed Multi-variation Cosine Margin (MvCoM), to simultaneously consider the multiple variation factors, which orthogonally enhances the face recognition losses to incorporate the importance of training samples. Further, we leverage a learning to learn approach, guided by a held-out meta learning set and use an additive modeling to predict the MvCoM. Extensive experiments on challenging face recognition benchmarks demonstrate the advantages of our method in jointly handling imbalances due to multiple variations.

卷积神经网络在人脸识别方面取得了显著的成功，部分原因是数据的丰富可用性。然而，用于训练cnn的数据往往是不平衡的。先前的工作主要集中在每个身份的数据量中人脸数据集的长尾性质，或者集中在单偏差变异上。在本文中，我们证明了种族，头部姿势，遮挡和模糊等许多偏差变化会共同影响准确性。我们提出了一种称为多变量余弦裕度(MvCoM)的样本水平加权方法，以同时考虑多个变化因素，从而正交增强人脸识别损失，以纳入训练样本的重要性。此外，我们利用一种学习到学习的方法，由一个固定的元学习集指导，并使用加法建模来预测MvCoM。在具有挑战性的人脸识别基准上进行的大量实验表明，我们的方法在联合处理由多种变化引起的不平衡方面具有优势。

引用次数: 9

CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation 跨语言图像匹配的弱监督语义分割

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00444

Jinheng Xie, Xianxu Hou, Kai Ye, Linlin Shen

It has been widely known that CAM (Class Activation Map) usually only activates discriminative object regions and falsely includes lots of object-related backgrounds. As only a fixed set of image-level object labels are available to the WSSS (weakly supervised semantic segmentation) model, it could be very difficult to suppress those diverse background regions consisting of open set objects. In this paper, we propose a novel Cross Language Image Matching (CLIMS) framework, based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress closely-related open background regions. In particular, we design object, background region and text label matching losses to guide the model to excite more reasonable object regions for CAM of each category. In addition, we design a co-occurring background suppression loss to prevent the model from activating closely-related background regions, with a predefined set of class-related background text descriptions. These designs enable the proposed CLIMS to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC2012 dataset show that our CLIMS significantly outperforms the previous state-of-the-art methods. Code will be available at https://github.com/CVI-SZU/CLIMS.

人们普遍认为，类激活图(Class Activation Map, CAM)通常只激活有区别的目标区域，错误地包含了大量与目标相关的背景。由于弱监督语义分割(WSSS)模型只有一组固定的图像级对象标签，因此很难抑制由开放集对象组成的不同背景区域。在本文中，我们提出了一个新的跨语言图像匹配(CLIMS)框架，该框架基于最近提出的对比语言图像预训练(CLIP)模型，用于WSSS。我们的框架的核心思想是引入自然语言监督来激活更完整的目标区域，抑制密切相关的开放背景区域。特别地，我们设计了对象、背景区域和文本标签匹配损失，以指导模型为每个类别的CAM激发更合理的对象区域。此外，我们设计了一个共同发生的背景抑制损失，以防止模型激活密切相关的背景区域，并预定义了一组与类相关的背景文本描述。这些设计使所提出的CLIMS能够为目标对象生成更完整、更紧凑的激活图。在PASCAL VOC2012数据集上的大量实验表明，我们的CLIMS显著优于之前最先进的方法。代码将在https://github.com/CVI-SZU/CLIMS上提供。

{"title":"CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation","authors":"Jinheng Xie, Xianxu Hou, Kai Ye, Linlin Shen","doi":"10.1109/CVPR52688.2022.00444","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00444","url":null,"abstract":"It has been widely known that CAM (Class Activation Map) usually only activates discriminative object regions and falsely includes lots of object-related backgrounds. As only a fixed set of image-level object labels are available to the WSSS (weakly supervised semantic segmentation) model, it could be very difficult to suppress those diverse background regions consisting of open set objects. In this paper, we propose a novel Cross Language Image Matching (CLIMS) framework, based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress closely-related open background regions. In particular, we design object, background region and text label matching losses to guide the model to excite more reasonable object regions for CAM of each category. In addition, we design a co-occurring background suppression loss to prevent the model from activating closely-related background regions, with a predefined set of class-related background text descriptions. These designs enable the proposed CLIMS to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC2012 dataset show that our CLIMS significantly outperforms the previous state-of-the-art methods. Code will be available at https://github.com/CVI-SZU/CLIMS.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114271813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47