2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献_第9页

NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go 神经形态:无监督形状插值和对应在一个去

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00739

Marvin Eisenberger, David Novotný, Gael Kerchenbaum, Patrick Labatut, N. Neverova, D. Cremers, A. Vedaldi

We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes and produces in one go, i.e. in a single feed forward pass, a smooth interpolation and point-to-point correspondences between them. The interpolation, expressed as a deformation field, changes the pose of the source shape to resemble the target, but leaves the object identity unchanged. NeuroMorph uses an elegant architecture combining graph convolutions with global feature pooling to extract local features. During training, the model is incentivized to create realistic deformations by approximating geodesics on the underlying shape space manifold. This strong geometric prior allows to train our model end-to-end and in a fully unsupervised manner without requiring any manual correspondence annotations. NeuroMorph works well for a large variety of input shapes, including non-isometric pairs from different object categories. It obtains state-of-the-art results for both shape correspondence and interpolation tasks, matching or surpassing the performance of recent unsupervised and supervised methods on multiple benchmarks.

我们提出了NeuroMorph，一种新的神经网络架构，它将两个3D形状作为输入，并一次性产生，即在单个前馈传递中，平滑插值和它们之间的点对点对应。插值以变形场的形式表示，改变源形状的位姿，使其与目标形状相似，但不改变目标的身份。NeuroMorph使用一种优雅的架构，结合图卷积和全局特征池来提取局部特征。在训练过程中，激励模型通过在底层形状空间流形上近似测地线来创建逼真的变形。这种强大的几何先验允许以完全无监督的方式端到端训练我们的模型，而不需要任何手动通信注释。NeuroMorph可以很好地处理各种各样的输入形状，包括来自不同对象类别的非等距对。它为形状对应和插值任务获得了最先进的结果，在多个基准上匹配或超过了最近的无监督和有监督方法的性能。

{"title":"NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go","authors":"Marvin Eisenberger, David Novotný, Gael Kerchenbaum, Patrick Labatut, N. Neverova, D. Cremers, A. Vedaldi","doi":"10.1109/CVPR46437.2021.00739","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00739","url":null,"abstract":"We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes and produces in one go, i.e. in a single feed forward pass, a smooth interpolation and point-to-point correspondences between them. The interpolation, expressed as a deformation field, changes the pose of the source shape to resemble the target, but leaves the object identity unchanged. NeuroMorph uses an elegant architecture combining graph convolutions with global feature pooling to extract local features. During training, the model is incentivized to create realistic deformations by approximating geodesics on the underlying shape space manifold. This strong geometric prior allows to train our model end-to-end and in a fully unsupervised manner without requiring any manual correspondence annotations. NeuroMorph works well for a large variety of input shapes, including non-isometric pairs from different object categories. It obtains state-of-the-art results for both shape correspondence and interpolation tasks, matching or surpassing the performance of recent unsupervised and supervised methods on multiple benchmarks.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131152914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Saliency-Guided Image Translation 显著性引导的图像翻译

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01624

Lai Jiang, Mai Xu, Xiaofei Wang, L. Sigal

In this paper, we propose a novel task for saliency-guided image translation, with the goal of image-to-image translation conditioned on the user specified saliency map. To address this problem, we develop a novel Generative Adversarial Network (GAN)-based model, called SalG-GAN. Given the original image and target saliency map, SalG-GAN can generate a translated image that satisfies the target saliency map. In SalG-GAN, a disentangled representation framework is proposed to encourage the model to learn diverse translations for the same target saliency condition. A saliency-based attention module is introduced as a special attention mechanism for facilitating the developed structures of saliency-guided generator, saliency cue encoder and saliency-guided global and local discriminators. Furthermore, we build a synthetic dataset and a real-world dataset with labeled visual attention for training and evaluating our SalG-GAN. The experimental results over both datasets verify the effectiveness of our model for saliency-guided image translation.

在本文中，我们提出了一种新的显著性引导图像翻译任务，其目标是在用户指定的显著性映射的条件下进行图像到图像的翻译。为了解决这个问题，我们开发了一种新的基于生成对抗网络(GAN)的模型，称为SalG-GAN。给定原始图像和目标显著性图，SalG-GAN可以生成满足目标显著性图的翻译图像。在SalG-GAN中，提出了一个解纠缠的表示框架，以鼓励模型在相同的目标显著性条件下学习不同的翻译。基于显著性的注意模块作为一种特殊的注意机制，促进了显著性引导生成器、显著性线索编码器和显著性引导全局和局部鉴别器结构的发展。此外，我们构建了一个合成数据集和一个带有标记视觉注意力的真实数据集，用于训练和评估我们的SalG-GAN。在两个数据集上的实验结果验证了我们的模型在显著性引导下的图像翻译中的有效性。

引用次数: 22

Intelligent Carpet: Inferring 3D Human Pose from Tactile Signals 智能地毯:从触觉信号推断3D人体姿势

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01110

Yiyue Luo, Yunzhu Li, Michael Foshey, Wan Shou, Pratyusha Sharma, Tomás Palacios, A. Torralba, W. Matusik

Daily human activities, e.g., locomotion, exercises, and resting, are heavily guided by the tactile interactions between the human and the ground. In this work, leveraging such tactile interactions, we propose a 3D human pose estimation approach using the pressure maps recorded by a tactile carpet as input. We build a low-cost, high-density, large-scale intelligent carpet, which enables the real-time recordings of human-floor tactile interactions in a seamless manner. We collect a synchronized tactile and visual dataset on various human activities. Employing a state-of-the-art camera-based pose estimation model as supervision, we design and implement a deep neural network model to infer 3D human poses using only the tactile information. Our pipeline can be further scaled up to multi-person pose estimation. We evaluate our system and demonstrate its potential applications in diverse fields.

人类的日常活动，如运动、锻炼和休息，在很大程度上是由人与地面之间的触觉相互作用指导的。在这项工作中，利用这种触觉交互，我们提出了一种3D人体姿势估计方法，使用触觉地毯记录的压力图作为输入。我们打造了一个低成本、高密度、大规模的智能地毯，可以无缝地实时记录人与地板的触觉互动。我们收集了各种人类活动的同步触觉和视觉数据集。采用最先进的基于相机的姿态估计模型作为监督，我们设计并实现了一个深度神经网络模型，仅使用触觉信息来推断3D人体姿态。我们的流水线可以进一步扩展到多人姿态估计。我们评估了我们的系统，并展示了它在不同领域的潜在应用。

引用次数: 21

Wide-Baseline Relative Camera Pose Estimation with Directional Learning 基于方向学习的宽基线相对相机姿态估计

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00327

Kefan Chen, Noah Snavely, A. Makadia

Modern deep learning techniques that regress the relative camera pose between two images have difficulty dealing with challenging scenarios, such as large camera motions resulting in occlusions and significant changes in perspective that leave little overlap between images. These models continue to struggle even with the benefit of large supervised training datasets. To address the limitations of these models, we take inspiration from techniques that show regressing keypoint locations in 2D and 3D can be improved by estimating a discrete distribution over keypoint locations. Analogously, in this paper we explore improving camera pose regression by instead predicting a discrete distribution over camera poses. To realize this idea, we introduce DirectionNet, which estimates discrete distributions over the 5D relative pose space using a novel parameterization to make the estimation problem tractable. Specifically, DirectionNet factorizes relative camera pose, specified by a 3D rotation and a translation direction, into a set of 3D direction vectors. Since 3D directions can be identified with points on the sphere, DirectionNet estimates discrete distributions on the sphere as its output. We evaluate our model on challenging synthetic and real pose estimation datasets constructed from Matterport3D and InteriorNet. Promising results show a near 50% reduction in error over direct regression methods.

现代深度学习技术对两幅图像之间的相对相机姿势进行回归，难以处理具有挑战性的场景，例如导致遮挡的大型相机运动和图像之间几乎没有重叠的重大视角变化。这些模型仍然在与大型监督训练数据集的优势作斗争。为了解决这些模型的局限性，我们从2D和3D中回归关键点位置的技术中获得灵感，这些技术可以通过估计关键点位置上的离散分布来改进。类似地，在本文中，我们探索通过预测相机姿势的离散分布来改进相机姿势回归。为了实现这一想法，我们引入了DirectionNet，它使用一种新的参数化方法来估计5D相对姿态空间上的离散分布，从而使估计问题易于处理。具体来说，DirectionNet将由3D旋转和平移方向指定的相对相机姿态分解为一组3D方向向量。由于三维方向可以通过球体上的点来识别，DirectionNet估计球体上的离散分布作为其输出。我们在Matterport3D和interornet构建的具有挑战性的合成和真实姿态估计数据集上评估了我们的模型。有希望的结果表明，与直接回归方法相比，误差减少了近50%。

{"title":"Wide-Baseline Relative Camera Pose Estimation with Directional Learning","authors":"Kefan Chen, Noah Snavely, A. Makadia","doi":"10.1109/CVPR46437.2021.00327","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00327","url":null,"abstract":"Modern deep learning techniques that regress the relative camera pose between two images have difficulty dealing with challenging scenarios, such as large camera motions resulting in occlusions and significant changes in perspective that leave little overlap between images. These models continue to struggle even with the benefit of large supervised training datasets. To address the limitations of these models, we take inspiration from techniques that show regressing keypoint locations in 2D and 3D can be improved by estimating a discrete distribution over keypoint locations. Analogously, in this paper we explore improving camera pose regression by instead predicting a discrete distribution over camera poses. To realize this idea, we introduce DirectionNet, which estimates discrete distributions over the 5D relative pose space using a novel parameterization to make the estimation problem tractable. Specifically, DirectionNet factorizes relative camera pose, specified by a 3D rotation and a translation direction, into a set of 3D direction vectors. Since 3D directions can be identified with points on the sphere, DirectionNet estimates discrete distributions on the sphere as its output. We evaluate our model on challenging synthetic and real pose estimation datasets constructed from Matterport3D and InteriorNet. Promising results show a near 50% reduction in error over direct regression methods.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Clusformer: A Transformer based Clustering Approach to Unsupervised Large-scale Face and Visual Landmark Recognition 聚类器:一种基于变压器的无监督大规模人脸和视觉地标识别聚类方法

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01070

Xuan-Bac Nguyen, Duc Toan Bui, C. Duong, T. D. Bui, Khoa Luu

The research in automatic unsupervised visual clustering has received considerable attention over the last couple years. It aims at explaining distributions of unlabeled visual images by clustering them via a parameterized model of appearance. Graph Convolutional Neural Networks (GCN) have recently been one of the most popular clustering methods. However, it has reached some limitations. Firstly, it is quite sensitive to hard or noisy samples. Secondly, it is hard to investigate with various deep network models due to its computational training time. Finally, it is hard to design an end-to-end training model between the deep feature extraction and GCN clustering modeling. This work therefore presents the Clusformer, a simple but new perspective of Transformer based approach, to automatic visual clustering via its unsupervised attention mechanism. The proposed method is able to robustly deal with noisy or hard samples. It is also flexible and effective to collaborate with different deep network models with various model sizes in an end-to-end framework. The proposed method is evaluated on two popular large-scale visual databases, i.e. Google Landmark and MS-Celeb1M face database, and outperforms prior unsupervised clustering methods. Code will be available at https://github.com/VinAIResearch/Clusformer

近年来，自动无监督视觉聚类的研究受到了广泛的关注。它旨在通过参数化的外观模型通过聚类来解释未标记视觉图像的分布。图卷积神经网络(GCN)是近年来最流行的聚类方法之一。然而，它已经达到了一些局限性。首先，它对硬样本或有噪声的样本非常敏感。其次，由于各种深度网络模型的计算训练时间，难以对其进行研究。最后，很难在深度特征提取和GCN聚类建模之间设计一个端到端的训练模型。因此，这项工作提出了Clusformer，一个简单但基于Transformer方法的新视角，通过其无监督注意机制实现自动视觉聚类。该方法能够鲁棒地处理噪声或硬样本。在端到端框架中与不同模型大小的不同深度网络模型进行协作也是灵活有效的。在谷歌Landmark和MS-Celeb1M两种流行的大规模视觉数据库上对该方法进行了评估，结果表明该方法优于先前的无监督聚类方法。代码将在https://github.com/VinAIResearch/Clusformer上提供

{"title":"Clusformer: A Transformer based Clustering Approach to Unsupervised Large-scale Face and Visual Landmark Recognition","authors":"Xuan-Bac Nguyen, Duc Toan Bui, C. Duong, T. D. Bui, Khoa Luu","doi":"10.1109/CVPR46437.2021.01070","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01070","url":null,"abstract":"The research in automatic unsupervised visual clustering has received considerable attention over the last couple years. It aims at explaining distributions of unlabeled visual images by clustering them via a parameterized model of appearance. Graph Convolutional Neural Networks (GCN) have recently been one of the most popular clustering methods. However, it has reached some limitations. Firstly, it is quite sensitive to hard or noisy samples. Secondly, it is hard to investigate with various deep network models due to its computational training time. Finally, it is hard to design an end-to-end training model between the deep feature extraction and GCN clustering modeling. This work therefore presents the Clusformer, a simple but new perspective of Transformer based approach, to automatic visual clustering via its unsupervised attention mechanism. The proposed method is able to robustly deal with noisy or hard samples. It is also flexible and effective to collaborate with different deep network models with various model sizes in an end-to-end framework. The proposed method is evaluated on two popular large-scale visual databases, i.e. Google Landmark and MS-Celeb1M face database, and outperforms prior unsupervised clustering methods. Code will be available at https://github.com/VinAIResearch/Clusformer","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130388856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Transitional Adaptation of Pretrained Models for Visual Storytelling 视觉叙事预训练模型的过渡适应

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01247

Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, Gunhee Kim

Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pre-trained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.

以前的视觉-语言生成任务模型通常是在各自的领域预训练一个视觉编码器和一个语言生成器，并与目标任务共同对它们进行微调。然而，这种直接迁移实践可能会受到视觉特异性和语言流畅性之间的不协调的影响，因为它们通常是在没有共同点的视觉和文本数据的大型语料库中单独训练的。在这项工作中，我们声称需要在预训练和微调之间进行过渡适应任务，以协调视觉编码器和语言模型，以挑战视觉讲故事等下游目标任务。我们提出了一种新的方法，称为预训练模型的过渡适应(TAPM)，它使多模态模块相互适应，在视觉输入之间的对齐任务更简单，而不需要文本标签。通过大量的实验，我们表明自适应步骤显著提高了多语言模型在序列视频和图像字幕任务中的性能。我们在LSMDC 2019的多句描述任务[50]和VIST的图像叙事任务[18]中实现了语言指标和人类评估方面的最新性能。我们的实验表明，标题质量的提高并不取决于语言模型的具体选择。

{"title":"Transitional Adaptation of Pretrained Models for Visual Storytelling","authors":"Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, Gunhee Kim","doi":"10.1109/CVPR46437.2021.01247","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01247","url":null,"abstract":"Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pre-trained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114795287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Learning to Identify Correct 2D-2D Line Correspondences on Sphere 学习在球体上识别正确的2D-2D直线对应

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01157

Haoang Li, Kai Chen, Ji Zhao, Jiangliu Wang, Pyojin Kim, Zhe Liu, Yunhui Liu

Given a set of putative 2D-2D line correspondences, we aim to identify correct matches. Existing methods exploit the geometric constraints. They are only applicable to structured scenes with orthogonality, parallelism and coplanarity. In contrast, we propose the first approach suitable for both structured and unstructured scenes. Instead of geometric constraint, we leverage the spatial regularity on sphere. Specifically, we propose to map line correspondences into vectors tangent to sphere. We use these vectors to encode both angular and positional variations of image lines, which is more reliable and concise than directly using inclinations, midpoints or endpoints of image lines. Neighboring vectors mapped from correct matches exhibit a spatial regularity called local trend consistency, regardless of the type of scenes. To encode this regularity, we design a neural network and also propose a novel loss function that enforces the smoothness constraint of vector field. In addition, we establish a large real-world dataset for image line matching. Experiments showed that our approach outperforms state-of-the-art ones in terms of accuracy, efficiency and robustness, and also leads to high generalization.

给定一组假定的2D-2D线对应，我们的目标是识别正确的匹配。现有的方法利用几何约束。它们只适用于具有正交、平行和共平面的结构化场景。相比之下，我们提出的第一种方法适用于结构化和非结构化场景。我们利用球面上的空间规则来代替几何约束。具体来说，我们建议将直线对应映射为与球体相切的向量。我们使用这些向量来编码图像线的角度和位置变化，这比直接使用图像线的倾角，中点或端点更可靠和简洁。无论场景类型如何，从正确匹配映射的邻近向量都表现出称为局部趋势一致性的空间规律性。为了对这种规律性进行编码，我们设计了一个神经网络，并提出了一个新的损失函数来加强向量场的平滑性约束。此外，我们建立了一个大型的真实世界数据集，用于图像线匹配。实验表明，我们的方法在准确性、效率和鲁棒性方面都优于目前最先进的方法，并且具有很高的泛化能力。

{"title":"Learning to Identify Correct 2D-2D Line Correspondences on Sphere","authors":"Haoang Li, Kai Chen, Ji Zhao, Jiangliu Wang, Pyojin Kim, Zhe Liu, Yunhui Liu","doi":"10.1109/CVPR46437.2021.01157","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01157","url":null,"abstract":"Given a set of putative 2D-2D line correspondences, we aim to identify correct matches. Existing methods exploit the geometric constraints. They are only applicable to structured scenes with orthogonality, parallelism and coplanarity. In contrast, we propose the first approach suitable for both structured and unstructured scenes. Instead of geometric constraint, we leverage the spatial regularity on sphere. Specifically, we propose to map line correspondences into vectors tangent to sphere. We use these vectors to encode both angular and positional variations of image lines, which is more reliable and concise than directly using inclinations, midpoints or endpoints of image lines. Neighboring vectors mapped from correct matches exhibit a spatial regularity called local trend consistency, regardless of the type of scenes. To encode this regularity, we design a neural network and also propose a novel loss function that enforces the smoothness constraint of vector field. In addition, we establish a large real-world dataset for image line matching. Experiments showed that our approach outperforms state-of-the-art ones in terms of accuracy, efficiency and robustness, and also leads to high generalization.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114418111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Debiased Subjective Assessment of Real-World Image Enhancement 真实世界图像增强的去偏见主观评价

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00077

Peibei Cao, Zhangyang Wang, Kede Ma

In real-world image enhancement, it is often challenging (if not impossible) to acquire ground-truth data, preventing the adoption of distance metrics for objective quality assessment. As a result, one often resorts to subjective quality assessment, the most straightforward and reliable means of evaluating image enhancement. Conventional subjective testing requires manually pre-selecting a small set of visual examples, which may suffer from three sources of biases: 1) sampling bias due to the extremely sparse distribution of the selected samples in the image space; 2) algorithmic bias due to potential overfitting the selected samples; 3) subjective bias due to further potential cherry-picking test results. This eventually makes the field of real-world image enhancement more of an art than a science. Here we take steps towards debiasing conventional subjective assessment by automatically sampling a set of adaptive and diverse images for subsequent testing. This is achieved by casting sample selection into a joint maximization of the discrepancy between the enhancers and the diversity among the selected input images. Careful visual inspection on the resulting enhanced images provides a debiased ranking of the enhancement algorithms. We demonstrate our subjective assessment method using three popular and practically demanding image enhancement tasks: dehazing, super-resolution, and low-light enhancement.

在现实世界的图像增强中，获取真实数据通常具有挑战性(如果不是不可能的话)，这阻碍了采用距离度量来进行客观质量评估。因此，人们常常求助于主观质量评估，这是评估图像增强的最直接和最可靠的手段。传统的主观测试需要手动预先选择一小部分视觉样本，这可能会受到三个偏差来源的影响:1)抽样偏差，这是由于所选样本在图像空间中的分布非常稀疏;2)由于所选样本可能的过拟合而导致的算法偏差;3)由于进一步潜在的挑选测试结果的主观偏见。这最终使得现实世界的图像增强领域更像是一门艺术，而不是一门科学。在这里，我们采取步骤，通过自动采样一组自适应和多样化的图像，为后续测试去偏见传统的主观评估。这是通过将样本选择转换为增强器之间差异和所选输入图像之间多样性的联合最大化来实现的。对得到的增强图像进行仔细的视觉检查，可以对增强算法进行无偏见的排序。我们使用三种流行且实际要求很高的图像增强任务来演示我们的主观评估方法:除雾、超分辨率和低光增强。

{"title":"Debiased Subjective Assessment of Real-World Image Enhancement","authors":"Peibei Cao, Zhangyang Wang, Kede Ma","doi":"10.1109/CVPR46437.2021.00077","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00077","url":null,"abstract":"In real-world image enhancement, it is often challenging (if not impossible) to acquire ground-truth data, preventing the adoption of distance metrics for objective quality assessment. As a result, one often resorts to subjective quality assessment, the most straightforward and reliable means of evaluating image enhancement. Conventional subjective testing requires manually pre-selecting a small set of visual examples, which may suffer from three sources of biases: 1) sampling bias due to the extremely sparse distribution of the selected samples in the image space; 2) algorithmic bias due to potential overfitting the selected samples; 3) subjective bias due to further potential cherry-picking test results. This eventually makes the field of real-world image enhancement more of an art than a science. Here we take steps towards debiasing conventional subjective assessment by automatically sampling a set of adaptive and diverse images for subsequent testing. This is achieved by casting sample selection into a joint maximization of the discrepancy between the enhancers and the diversity among the selected input images. Careful visual inspection on the resulting enhanced images provides a debiased ranking of the enhancement algorithms. We demonstrate our subjective assessment method using three popular and practically demanding image enhancement tasks: dehazing, super-resolution, and low-light enhancement.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114730955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Information Bottleneck Disentanglement for Identity Swapping 身份交换中的信息瓶颈解纠缠

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00341

Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, R. He

Improving the performance of face forgery detectors often requires more identity-swapped images of higher-quality. One core objective of identity swapping is to generate identity-discriminative faces that are distinct from the target while identical to the source. To this end, properly disentangling identity and identity-irrelevant information is critical and remains a challenging endeavor. In this work, we propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model. The key insight of our method is to formulate the learning of disentangled representations as optimizing an information bottleneck tradeoff, in terms of finding an optimal compression of the pretrained latent features. Moreover, a novel identity contrastive loss is proposed for further disentanglement by requiring a proper distance between the generated identity and the target. While the most prior works have focused on using various loss functions to implicitly guide the learning of representations, we demonstrate that our model can provide explicit supervision for learning disentangled representations, achieving impressive performance in generating more identity-discriminative swapped faces.

提高人脸伪造检测器的性能通常需要更多高质量的身份交换图像。身份交换的一个核心目标是生成与目标不同而与源相同的身份判别脸。为此，正确地分离身份和与身份无关的信息至关重要，并且仍然是一项具有挑战性的工作。在这项工作中，我们提出了一种新的信息解纠缠和交换网络，称为InfoSwap，从预训练的人脸识别模型中提取最具表现力的信息用于身份表示。我们的方法的关键见解是将解纠缠表示的学习表述为优化信息瓶颈权衡，以找到预训练潜在特征的最佳压缩。此外，为了进一步解除纠缠，提出了一种新的身份对比损失，该损失要求生成的身份与目标之间有适当的距离。虽然大多数先前的工作都集中在使用各种损失函数来隐式指导表征的学习，但我们证明了我们的模型可以为学习解纠缠表征提供显式监督，在生成更多身份判别交换面方面取得了令人印象深刻的性能。

{"title":"Information Bottleneck Disentanglement for Identity Swapping","authors":"Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, R. He","doi":"10.1109/CVPR46437.2021.00341","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00341","url":null,"abstract":"Improving the performance of face forgery detectors often requires more identity-swapped images of higher-quality. One core objective of identity swapping is to generate identity-discriminative faces that are distinct from the target while identical to the source. To this end, properly disentangling identity and identity-irrelevant information is critical and remains a challenging endeavor. In this work, we propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model. The key insight of our method is to formulate the learning of disentangled representations as optimizing an information bottleneck tradeoff, in terms of finding an optimal compression of the pretrained latent features. Moreover, a novel identity contrastive loss is proposed for further disentanglement by requiring a proper distance between the generated identity and the target. While the most prior works have focused on using various loss functions to implicitly guide the learning of representations, we demonstrate that our model can provide explicit supervision for learning disentangled representations, achieving impressive performance in generating more identity-discriminative swapped faces.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115452384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Nighttime Visibility Enhancement by Increasing the Dynamic Range and Suppression of Light Effects 通过增加动态范围和抑制光效来增强夜间能见度

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01180

Aashish Sharma, R. Tan

Most existing nighttime visibility enhancement methods focus on low light. Night images, however, do not only suffer from low light, but also from man-made light effects such as glow, glare, floodlight, etc. Hence, when the existing nighttime visibility enhancement methods are applied to these images, they intensify the effects, degrading the visibility even further. High dynamic range (HDR) imaging methods can address the low light and over-exposed regions, however they cannot remove the light effects, and thus cannot enhance the visibility in the affected regions. In this paper, given a single nighttime image as input, our goal is to enhance its visibility by increasing the dynamic range of the intensity, and thus can boost the intensity of the low light regions, and at the same time, suppress the light effects (glow, glare) simultaneously. First, we use a network to estimate the camera response function (CRF) from the input image to linearise the image. Second, we decompose the linearised image into low-frequency (LF) and high-frequency (HF) feature maps that are processed separately through two networks for light effects suppression and noise removal respectively. Third, we use a network to increase the dynamic range of the processed LF feature maps, which are then combined with the processed HF feature maps to generate the final output that has increased dynamic range and suppressed light effects. Our experiments show the effectiveness of our method in comparison with the state-of-the-art nighttime visibility enhancement methods.

大多数现有的夜间能见度增强方法都集中在弱光下。然而，夜间图像不仅会受到弱光的影响，还会受到人造光的影响，如辉光、眩光、泛光灯等。因此，当现有的夜间能见度增强方法应用于这些图像时，它们会增强效果，进一步降低能见度。高动态范围(High dynamic range, HDR)成像方法可以解决低光和过曝光区域的问题，但不能消除光的影响，因而不能提高受影响区域的能见度。本文以单幅夜间图像作为输入，我们的目标是通过增加强度的动态范围来增强其可见性，从而提高低光区域的强度，同时抑制光效应(辉光、眩光)。首先，我们使用网络从输入图像估计相机响应函数(CRF)以线性化图像。其次，我们将线性化后的图像分解为低频(LF)和高频(HF)特征图，分别通过两个网络进行光效果抑制和噪声去除处理。第三，我们使用一个网络来增加处理过的低频特征图的动态范围，然后将其与处理过的高频特征图相结合，生成具有增加动态范围和抑制光效应的最终输出。我们的实验表明，与最先进的夜间能见度增强方法相比，我们的方法是有效的。

{"title":"Nighttime Visibility Enhancement by Increasing the Dynamic Range and Suppression of Light Effects","authors":"Aashish Sharma, R. Tan","doi":"10.1109/CVPR46437.2021.01180","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01180","url":null,"abstract":"Most existing nighttime visibility enhancement methods focus on low light. Night images, however, do not only suffer from low light, but also from man-made light effects such as glow, glare, floodlight, etc. Hence, when the existing nighttime visibility enhancement methods are applied to these images, they intensify the effects, degrading the visibility even further. High dynamic range (HDR) imaging methods can address the low light and over-exposed regions, however they cannot remove the light effects, and thus cannot enhance the visibility in the affected regions. In this paper, given a single nighttime image as input, our goal is to enhance its visibility by increasing the dynamic range of the intensity, and thus can boost the intensity of the low light regions, and at the same time, suppress the light effects (glow, glare) simultaneously. First, we use a network to estimate the camera response function (CRF) from the input image to linearise the image. Second, we decompose the linearised image into low-frequency (LF) and high-frequency (HF) feature maps that are processed separately through two networks for light effects suppression and noise removal respectively. Third, we use a network to increase the dynamic range of the processed LF feature maps, which are then combined with the processed HF feature maps to generate the final output that has increased dynamic range and suppressed light effects. Our experiments show the effectiveness of our method in comparison with the state-of-the-art nighttime visibility enhancement methods.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123049030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19