Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.00739
Marvin Eisenberger, David Novotný, Gael Kerchenbaum, Patrick Labatut, N. Neverova, D. Cremers, A. Vedaldi
We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes and produces in one go, i.e. in a single feed forward pass, a smooth interpolation and point-to-point correspondences between them. The interpolation, expressed as a deformation field, changes the pose of the source shape to resemble the target, but leaves the object identity unchanged. NeuroMorph uses an elegant architecture combining graph convolutions with global feature pooling to extract local features. During training, the model is incentivized to create realistic deformations by approximating geodesics on the underlying shape space manifold. This strong geometric prior allows to train our model end-to-end and in a fully unsupervised manner without requiring any manual correspondence annotations. NeuroMorph works well for a large variety of input shapes, including non-isometric pairs from different object categories. It obtains state-of-the-art results for both shape correspondence and interpolation tasks, matching or surpassing the performance of recent unsupervised and supervised methods on multiple benchmarks.
{"title":"NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go","authors":"Marvin Eisenberger, David Novotný, Gael Kerchenbaum, Patrick Labatut, N. Neverova, D. Cremers, A. Vedaldi","doi":"10.1109/CVPR46437.2021.00739","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00739","url":null,"abstract":"We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes and produces in one go, i.e. in a single feed forward pass, a smooth interpolation and point-to-point correspondences between them. The interpolation, expressed as a deformation field, changes the pose of the source shape to resemble the target, but leaves the object identity unchanged. NeuroMorph uses an elegant architecture combining graph convolutions with global feature pooling to extract local features. During training, the model is incentivized to create realistic deformations by approximating geodesics on the underlying shape space manifold. This strong geometric prior allows to train our model end-to-end and in a fully unsupervised manner without requiring any manual correspondence annotations. NeuroMorph works well for a large variety of input shapes, including non-isometric pairs from different object categories. It obtains state-of-the-art results for both shape correspondence and interpolation tasks, matching or surpassing the performance of recent unsupervised and supervised methods on multiple benchmarks.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131152914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01624
Lai Jiang, Mai Xu, Xiaofei Wang, L. Sigal
In this paper, we propose a novel task for saliency-guided image translation, with the goal of image-to-image translation conditioned on the user specified saliency map. To address this problem, we develop a novel Generative Adversarial Network (GAN)-based model, called SalG-GAN. Given the original image and target saliency map, SalG-GAN can generate a translated image that satisfies the target saliency map. In SalG-GAN, a disentangled representation framework is proposed to encourage the model to learn diverse translations for the same target saliency condition. A saliency-based attention module is introduced as a special attention mechanism for facilitating the developed structures of saliency-guided generator, saliency cue encoder and saliency-guided global and local discriminators. Furthermore, we build a synthetic dataset and a real-world dataset with labeled visual attention for training and evaluating our SalG-GAN. The experimental results over both datasets verify the effectiveness of our model for saliency-guided image translation.
{"title":"Saliency-Guided Image Translation","authors":"Lai Jiang, Mai Xu, Xiaofei Wang, L. Sigal","doi":"10.1109/CVPR46437.2021.01624","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01624","url":null,"abstract":"In this paper, we propose a novel task for saliency-guided image translation, with the goal of image-to-image translation conditioned on the user specified saliency map. To address this problem, we develop a novel Generative Adversarial Network (GAN)-based model, called SalG-GAN. Given the original image and target saliency map, SalG-GAN can generate a translated image that satisfies the target saliency map. In SalG-GAN, a disentangled representation framework is proposed to encourage the model to learn diverse translations for the same target saliency condition. A saliency-based attention module is introduced as a special attention mechanism for facilitating the developed structures of saliency-guided generator, saliency cue encoder and saliency-guided global and local discriminators. Furthermore, we build a synthetic dataset and a real-world dataset with labeled visual attention for training and evaluating our SalG-GAN. The experimental results over both datasets verify the effectiveness of our model for saliency-guided image translation.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"417 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131527715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01110
Yiyue Luo, Yunzhu Li, Michael Foshey, Wan Shou, Pratyusha Sharma, Tomás Palacios, A. Torralba, W. Matusik
Daily human activities, e.g., locomotion, exercises, and resting, are heavily guided by the tactile interactions between the human and the ground. In this work, leveraging such tactile interactions, we propose a 3D human pose estimation approach using the pressure maps recorded by a tactile carpet as input. We build a low-cost, high-density, large-scale intelligent carpet, which enables the real-time recordings of human-floor tactile interactions in a seamless manner. We collect a synchronized tactile and visual dataset on various human activities. Employing a state-of-the-art camera-based pose estimation model as supervision, we design and implement a deep neural network model to infer 3D human poses using only the tactile information. Our pipeline can be further scaled up to multi-person pose estimation. We evaluate our system and demonstrate its potential applications in diverse fields.
{"title":"Intelligent Carpet: Inferring 3D Human Pose from Tactile Signals","authors":"Yiyue Luo, Yunzhu Li, Michael Foshey, Wan Shou, Pratyusha Sharma, Tomás Palacios, A. Torralba, W. Matusik","doi":"10.1109/CVPR46437.2021.01110","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01110","url":null,"abstract":"Daily human activities, e.g., locomotion, exercises, and resting, are heavily guided by the tactile interactions between the human and the ground. In this work, leveraging such tactile interactions, we propose a 3D human pose estimation approach using the pressure maps recorded by a tactile carpet as input. We build a low-cost, high-density, large-scale intelligent carpet, which enables the real-time recordings of human-floor tactile interactions in a seamless manner. We collect a synchronized tactile and visual dataset on various human activities. Employing a state-of-the-art camera-based pose estimation model as supervision, we design and implement a deep neural network model to infer 3D human poses using only the tactile information. Our pipeline can be further scaled up to multi-person pose estimation. We evaluate our system and demonstrate its potential applications in diverse fields.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131924872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.00327
Kefan Chen, Noah Snavely, A. Makadia
Modern deep learning techniques that regress the relative camera pose between two images have difficulty dealing with challenging scenarios, such as large camera motions resulting in occlusions and significant changes in perspective that leave little overlap between images. These models continue to struggle even with the benefit of large supervised training datasets. To address the limitations of these models, we take inspiration from techniques that show regressing keypoint locations in 2D and 3D can be improved by estimating a discrete distribution over keypoint locations. Analogously, in this paper we explore improving camera pose regression by instead predicting a discrete distribution over camera poses. To realize this idea, we introduce DirectionNet, which estimates discrete distributions over the 5D relative pose space using a novel parameterization to make the estimation problem tractable. Specifically, DirectionNet factorizes relative camera pose, specified by a 3D rotation and a translation direction, into a set of 3D direction vectors. Since 3D directions can be identified with points on the sphere, DirectionNet estimates discrete distributions on the sphere as its output. We evaluate our model on challenging synthetic and real pose estimation datasets constructed from Matterport3D and InteriorNet. Promising results show a near 50% reduction in error over direct regression methods.
{"title":"Wide-Baseline Relative Camera Pose Estimation with Directional Learning","authors":"Kefan Chen, Noah Snavely, A. Makadia","doi":"10.1109/CVPR46437.2021.00327","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00327","url":null,"abstract":"Modern deep learning techniques that regress the relative camera pose between two images have difficulty dealing with challenging scenarios, such as large camera motions resulting in occlusions and significant changes in perspective that leave little overlap between images. These models continue to struggle even with the benefit of large supervised training datasets. To address the limitations of these models, we take inspiration from techniques that show regressing keypoint locations in 2D and 3D can be improved by estimating a discrete distribution over keypoint locations. Analogously, in this paper we explore improving camera pose regression by instead predicting a discrete distribution over camera poses. To realize this idea, we introduce DirectionNet, which estimates discrete distributions over the 5D relative pose space using a novel parameterization to make the estimation problem tractable. Specifically, DirectionNet factorizes relative camera pose, specified by a 3D rotation and a translation direction, into a set of 3D direction vectors. Since 3D directions can be identified with points on the sphere, DirectionNet estimates discrete distributions on the sphere as its output. We evaluate our model on challenging synthetic and real pose estimation datasets constructed from Matterport3D and InteriorNet. Promising results show a near 50% reduction in error over direct regression methods.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01070
Xuan-Bac Nguyen, Duc Toan Bui, C. Duong, T. D. Bui, Khoa Luu
The research in automatic unsupervised visual clustering has received considerable attention over the last couple years. It aims at explaining distributions of unlabeled visual images by clustering them via a parameterized model of appearance. Graph Convolutional Neural Networks (GCN) have recently been one of the most popular clustering methods. However, it has reached some limitations. Firstly, it is quite sensitive to hard or noisy samples. Secondly, it is hard to investigate with various deep network models due to its computational training time. Finally, it is hard to design an end-to-end training model between the deep feature extraction and GCN clustering modeling. This work therefore presents the Clusformer, a simple but new perspective of Transformer based approach, to automatic visual clustering via its unsupervised attention mechanism. The proposed method is able to robustly deal with noisy or hard samples. It is also flexible and effective to collaborate with different deep network models with various model sizes in an end-to-end framework. The proposed method is evaluated on two popular large-scale visual databases, i.e. Google Landmark and MS-Celeb1M face database, and outperforms prior unsupervised clustering methods. Code will be available at https://github.com/VinAIResearch/Clusformer
{"title":"Clusformer: A Transformer based Clustering Approach to Unsupervised Large-scale Face and Visual Landmark Recognition","authors":"Xuan-Bac Nguyen, Duc Toan Bui, C. Duong, T. D. Bui, Khoa Luu","doi":"10.1109/CVPR46437.2021.01070","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01070","url":null,"abstract":"The research in automatic unsupervised visual clustering has received considerable attention over the last couple years. It aims at explaining distributions of unlabeled visual images by clustering them via a parameterized model of appearance. Graph Convolutional Neural Networks (GCN) have recently been one of the most popular clustering methods. However, it has reached some limitations. Firstly, it is quite sensitive to hard or noisy samples. Secondly, it is hard to investigate with various deep network models due to its computational training time. Finally, it is hard to design an end-to-end training model between the deep feature extraction and GCN clustering modeling. This work therefore presents the Clusformer, a simple but new perspective of Transformer based approach, to automatic visual clustering via its unsupervised attention mechanism. The proposed method is able to robustly deal with noisy or hard samples. It is also flexible and effective to collaborate with different deep network models with various model sizes in an end-to-end framework. The proposed method is evaluated on two popular large-scale visual databases, i.e. Google Landmark and MS-Celeb1M face database, and outperforms prior unsupervised clustering methods. Code will be available at https://github.com/VinAIResearch/Clusformer","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130388856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01247
Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, Gunhee Kim
Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pre-trained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.
{"title":"Transitional Adaptation of Pretrained Models for Visual Storytelling","authors":"Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, Gunhee Kim","doi":"10.1109/CVPR46437.2021.01247","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01247","url":null,"abstract":"Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pre-trained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114795287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01157
Haoang Li, Kai Chen, Ji Zhao, Jiangliu Wang, Pyojin Kim, Zhe Liu, Yunhui Liu
Given a set of putative 2D-2D line correspondences, we aim to identify correct matches. Existing methods exploit the geometric constraints. They are only applicable to structured scenes with orthogonality, parallelism and coplanarity. In contrast, we propose the first approach suitable for both structured and unstructured scenes. Instead of geometric constraint, we leverage the spatial regularity on sphere. Specifically, we propose to map line correspondences into vectors tangent to sphere. We use these vectors to encode both angular and positional variations of image lines, which is more reliable and concise than directly using inclinations, midpoints or endpoints of image lines. Neighboring vectors mapped from correct matches exhibit a spatial regularity called local trend consistency, regardless of the type of scenes. To encode this regularity, we design a neural network and also propose a novel loss function that enforces the smoothness constraint of vector field. In addition, we establish a large real-world dataset for image line matching. Experiments showed that our approach outperforms state-of-the-art ones in terms of accuracy, efficiency and robustness, and also leads to high generalization.
{"title":"Learning to Identify Correct 2D-2D Line Correspondences on Sphere","authors":"Haoang Li, Kai Chen, Ji Zhao, Jiangliu Wang, Pyojin Kim, Zhe Liu, Yunhui Liu","doi":"10.1109/CVPR46437.2021.01157","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01157","url":null,"abstract":"Given a set of putative 2D-2D line correspondences, we aim to identify correct matches. Existing methods exploit the geometric constraints. They are only applicable to structured scenes with orthogonality, parallelism and coplanarity. In contrast, we propose the first approach suitable for both structured and unstructured scenes. Instead of geometric constraint, we leverage the spatial regularity on sphere. Specifically, we propose to map line correspondences into vectors tangent to sphere. We use these vectors to encode both angular and positional variations of image lines, which is more reliable and concise than directly using inclinations, midpoints or endpoints of image lines. Neighboring vectors mapped from correct matches exhibit a spatial regularity called local trend consistency, regardless of the type of scenes. To encode this regularity, we design a neural network and also propose a novel loss function that enforces the smoothness constraint of vector field. In addition, we establish a large real-world dataset for image line matching. Experiments showed that our approach outperforms state-of-the-art ones in terms of accuracy, efficiency and robustness, and also leads to high generalization.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114418111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.00077
Peibei Cao, Zhangyang Wang, Kede Ma
In real-world image enhancement, it is often challenging (if not impossible) to acquire ground-truth data, preventing the adoption of distance metrics for objective quality assessment. As a result, one often resorts to subjective quality assessment, the most straightforward and reliable means of evaluating image enhancement. Conventional subjective testing requires manually pre-selecting a small set of visual examples, which may suffer from three sources of biases: 1) sampling bias due to the extremely sparse distribution of the selected samples in the image space; 2) algorithmic bias due to potential overfitting the selected samples; 3) subjective bias due to further potential cherry-picking test results. This eventually makes the field of real-world image enhancement more of an art than a science. Here we take steps towards debiasing conventional subjective assessment by automatically sampling a set of adaptive and diverse images for subsequent testing. This is achieved by casting sample selection into a joint maximization of the discrepancy between the enhancers and the diversity among the selected input images. Careful visual inspection on the resulting enhanced images provides a debiased ranking of the enhancement algorithms. We demonstrate our subjective assessment method using three popular and practically demanding image enhancement tasks: dehazing, super-resolution, and low-light enhancement.
{"title":"Debiased Subjective Assessment of Real-World Image Enhancement","authors":"Peibei Cao, Zhangyang Wang, Kede Ma","doi":"10.1109/CVPR46437.2021.00077","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00077","url":null,"abstract":"In real-world image enhancement, it is often challenging (if not impossible) to acquire ground-truth data, preventing the adoption of distance metrics for objective quality assessment. As a result, one often resorts to subjective quality assessment, the most straightforward and reliable means of evaluating image enhancement. Conventional subjective testing requires manually pre-selecting a small set of visual examples, which may suffer from three sources of biases: 1) sampling bias due to the extremely sparse distribution of the selected samples in the image space; 2) algorithmic bias due to potential overfitting the selected samples; 3) subjective bias due to further potential cherry-picking test results. This eventually makes the field of real-world image enhancement more of an art than a science. Here we take steps towards debiasing conventional subjective assessment by automatically sampling a set of adaptive and diverse images for subsequent testing. This is achieved by casting sample selection into a joint maximization of the discrepancy between the enhancers and the diversity among the selected input images. Careful visual inspection on the resulting enhanced images provides a debiased ranking of the enhancement algorithms. We demonstrate our subjective assessment method using three popular and practically demanding image enhancement tasks: dehazing, super-resolution, and low-light enhancement.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114730955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.00341
Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, R. He
Improving the performance of face forgery detectors often requires more identity-swapped images of higher-quality. One core objective of identity swapping is to generate identity-discriminative faces that are distinct from the target while identical to the source. To this end, properly disentangling identity and identity-irrelevant information is critical and remains a challenging endeavor. In this work, we propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model. The key insight of our method is to formulate the learning of disentangled representations as optimizing an information bottleneck tradeoff, in terms of finding an optimal compression of the pretrained latent features. Moreover, a novel identity contrastive loss is proposed for further disentanglement by requiring a proper distance between the generated identity and the target. While the most prior works have focused on using various loss functions to implicitly guide the learning of representations, we demonstrate that our model can provide explicit supervision for learning disentangled representations, achieving impressive performance in generating more identity-discriminative swapped faces.
{"title":"Information Bottleneck Disentanglement for Identity Swapping","authors":"Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, R. He","doi":"10.1109/CVPR46437.2021.00341","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00341","url":null,"abstract":"Improving the performance of face forgery detectors often requires more identity-swapped images of higher-quality. One core objective of identity swapping is to generate identity-discriminative faces that are distinct from the target while identical to the source. To this end, properly disentangling identity and identity-irrelevant information is critical and remains a challenging endeavor. In this work, we propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model. The key insight of our method is to formulate the learning of disentangled representations as optimizing an information bottleneck tradeoff, in terms of finding an optimal compression of the pretrained latent features. Moreover, a novel identity contrastive loss is proposed for further disentanglement by requiring a proper distance between the generated identity and the target. While the most prior works have focused on using various loss functions to implicitly guide the learning of representations, we demonstrate that our model can provide explicit supervision for learning disentangled representations, achieving impressive performance in generating more identity-discriminative swapped faces.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115452384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01180
Aashish Sharma, R. Tan
Most existing nighttime visibility enhancement methods focus on low light. Night images, however, do not only suffer from low light, but also from man-made light effects such as glow, glare, floodlight, etc. Hence, when the existing nighttime visibility enhancement methods are applied to these images, they intensify the effects, degrading the visibility even further. High dynamic range (HDR) imaging methods can address the low light and over-exposed regions, however they cannot remove the light effects, and thus cannot enhance the visibility in the affected regions. In this paper, given a single nighttime image as input, our goal is to enhance its visibility by increasing the dynamic range of the intensity, and thus can boost the intensity of the low light regions, and at the same time, suppress the light effects (glow, glare) simultaneously. First, we use a network to estimate the camera response function (CRF) from the input image to linearise the image. Second, we decompose the linearised image into low-frequency (LF) and high-frequency (HF) feature maps that are processed separately through two networks for light effects suppression and noise removal respectively. Third, we use a network to increase the dynamic range of the processed LF feature maps, which are then combined with the processed HF feature maps to generate the final output that has increased dynamic range and suppressed light effects. Our experiments show the effectiveness of our method in comparison with the state-of-the-art nighttime visibility enhancement methods.
{"title":"Nighttime Visibility Enhancement by Increasing the Dynamic Range and Suppression of Light Effects","authors":"Aashish Sharma, R. Tan","doi":"10.1109/CVPR46437.2021.01180","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01180","url":null,"abstract":"Most existing nighttime visibility enhancement methods focus on low light. Night images, however, do not only suffer from low light, but also from man-made light effects such as glow, glare, floodlight, etc. Hence, when the existing nighttime visibility enhancement methods are applied to these images, they intensify the effects, degrading the visibility even further. High dynamic range (HDR) imaging methods can address the low light and over-exposed regions, however they cannot remove the light effects, and thus cannot enhance the visibility in the affected regions. In this paper, given a single nighttime image as input, our goal is to enhance its visibility by increasing the dynamic range of the intensity, and thus can boost the intensity of the low light regions, and at the same time, suppress the light effects (glow, glare) simultaneously. First, we use a network to estimate the camera response function (CRF) from the input image to linearise the image. Second, we decompose the linearised image into low-frequency (LF) and high-frequency (HF) feature maps that are processed separately through two networks for light effects suppression and noise removal respectively. Third, we use a network to increase the dynamic range of the processed LF feature maps, which are then combined with the processed HF feature maps to generate the final output that has increased dynamic range and suppressed light effects. Our experiments show the effectiveness of our method in comparison with the state-of-the-art nighttime visibility enhancement methods.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123049030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}