Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00263
Camila Kolling, Martin D. Móre, Nathan Gavenski, E. Pooch, Otávio Parraga, Rodrigo C. Barros
Despite the success of neural architectures for Visual Question Answering (VQA), several recent studies have shown that VQA models are mostly driven by superficial correlations that are learned by exploiting undesired priors within training datasets. They often lack sufficient image grounding or tend to overly-rely on textual information, failing to capture knowledge from the images. This affects their generalization to test sets with slight changes in the distribution of facts. To address such an issue, some bias mitigation methods have relied on new training procedures that are capable of synthesizing counterfactual samples by masking critical objects within the images, and words within the questions, while also changing the corresponding ground truth. We propose a novel model-agnostic counterfactual training procedure, namely Efficient Counterfactual Debiasing (ECD), in which we introduce a new negative answer-assignment mechanism that exploits the probability distribution of the answers based on their frequencies, as well as an improved counterfactual sample synthesizer. Our experiments demonstrate that ECD is a simple, computationally-efficient counterfactual sample-synthesizer training procedure that establishes itself as the new state of the art for unbiased VQA.
{"title":"Efficient Counterfactual Debiasing for Visual Question Answering","authors":"Camila Kolling, Martin D. Móre, Nathan Gavenski, E. Pooch, Otávio Parraga, Rodrigo C. Barros","doi":"10.1109/WACV51458.2022.00263","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00263","url":null,"abstract":"Despite the success of neural architectures for Visual Question Answering (VQA), several recent studies have shown that VQA models are mostly driven by superficial correlations that are learned by exploiting undesired priors within training datasets. They often lack sufficient image grounding or tend to overly-rely on textual information, failing to capture knowledge from the images. This affects their generalization to test sets with slight changes in the distribution of facts. To address such an issue, some bias mitigation methods have relied on new training procedures that are capable of synthesizing counterfactual samples by masking critical objects within the images, and words within the questions, while also changing the corresponding ground truth. We propose a novel model-agnostic counterfactual training procedure, namely Efficient Counterfactual Debiasing (ECD), in which we introduce a new negative answer-assignment mechanism that exploits the probability distribution of the answers based on their frequencies, as well as an improved counterfactual sample synthesizer. Our experiments demonstrate that ECD is a simple, computationally-efficient counterfactual sample-synthesizer training procedure that establishes itself as the new state of the art for unbiased VQA.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114830917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00126
D. Mery, Bernardita Morris
Given a facial matcher, in explainable face verification, the task is to answer: how relevant are the parts of a probe image to establish the matching with an enrolled image. In many cases, however, the trained models cannot be manipulated and must be treated as "black-boxes". In this paper, we present six different saliency maps that can be used to explain any face verification algorithm with no manipulation inside of the face recognition model. The key idea of the methods is based on how the matching score of the two face images changes when the probe is perturbed. The proposed methods remove and aggregate different parts of the face, and measure contributions of these parts individually and in-collaboration as well. We test and compare our proposed methods in three different scenarios: synthetic images with different qualities and occlusions, real face images with different facial expressions, poses, and occlusions and faces from different demographic groups. In our experiments, five different face verification algorithms are used: ArcFace, Dlib, FaceNet (trained on VGGface2 and CasiaWebFace), and LBP. We conclude that one of the proposed methods achieves saliency maps that are stable and interpretable to humans. In addition, our method, in combination with a new visualization of saliency maps based on contours, shows promising results in comparison with other state-of-the-art art methods. This paper presents good insights into any face verification algorithm, in which it can be clearly appreciated which are the most relevant face areas that an algorithm takes into account to carry out the recognition process.
{"title":"On Black-Box Explanation for Face Verification","authors":"D. Mery, Bernardita Morris","doi":"10.1109/WACV51458.2022.00126","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00126","url":null,"abstract":"Given a facial matcher, in explainable face verification, the task is to answer: how relevant are the parts of a probe image to establish the matching with an enrolled image. In many cases, however, the trained models cannot be manipulated and must be treated as \"black-boxes\". In this paper, we present six different saliency maps that can be used to explain any face verification algorithm with no manipulation inside of the face recognition model. The key idea of the methods is based on how the matching score of the two face images changes when the probe is perturbed. The proposed methods remove and aggregate different parts of the face, and measure contributions of these parts individually and in-collaboration as well. We test and compare our proposed methods in three different scenarios: synthetic images with different qualities and occlusions, real face images with different facial expressions, poses, and occlusions and faces from different demographic groups. In our experiments, five different face verification algorithms are used: ArcFace, Dlib, FaceNet (trained on VGGface2 and CasiaWebFace), and LBP. We conclude that one of the proposed methods achieves saliency maps that are stable and interpretable to humans. In addition, our method, in combination with a new visualization of saliency maps based on contours, shows promising results in comparison with other state-of-the-art art methods. This paper presents good insights into any face verification algorithm, in which it can be clearly appreciated which are the most relevant face areas that an algorithm takes into account to carry out the recognition process.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134150962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00077
Yawen Lu, G. Lu
3D object reconstruction based on deep neural networks has been gaining attention in recent years. However, recovering 3D shapes of hidden and buried objects remains to be a challenge. Ground Penetrating Radar (GPR) is among the most powerful and widely used instruments for detecting and locating underground objects such as plant roots and pipes, with affordable prices and continually evolving technology. This paper first proposes a deep convolution neural network-based anchor-free GPR curve signal detection net- work utilizing B-scans from a GPR sensor. The detection results can help obtain precisely fitted parabola curves. Furthermore, a graph neural network-based root shape reconstruction network is designated in order to progressively recover major taproot and then fine root branches’ geometry. Our results on the gprMax simulated root data as well as the real-world GPR data collected from apple orchards demonstrate the potential of using the proposed framework as a new approach for fine-grained underground object shape reconstruction in a non-destructive way.
{"title":"3D Modeling Beneath Ground: Plant Root Detection and Reconstruction Based on Ground-Penetrating Radar","authors":"Yawen Lu, G. Lu","doi":"10.1109/WACV51458.2022.00077","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00077","url":null,"abstract":"3D object reconstruction based on deep neural networks has been gaining attention in recent years. However, recovering 3D shapes of hidden and buried objects remains to be a challenge. Ground Penetrating Radar (GPR) is among the most powerful and widely used instruments for detecting and locating underground objects such as plant roots and pipes, with affordable prices and continually evolving technology. This paper first proposes a deep convolution neural network-based anchor-free GPR curve signal detection net- work utilizing B-scans from a GPR sensor. The detection results can help obtain precisely fitted parabola curves. Furthermore, a graph neural network-based root shape reconstruction network is designated in order to progressively recover major taproot and then fine root branches’ geometry. Our results on the gprMax simulated root data as well as the real-world GPR data collected from apple orchards demonstrate the potential of using the proposed framework as a new approach for fine-grained underground object shape reconstruction in a non-destructive way.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134034696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00135
Ashwin D. D’Cruz, Christopher Tegho, Sean Greaves, Lachlan Kermode
Human rights investigations often entail triaging large volumes of open source images and video in order to find moments that are relevant to a given investigation and warrant further inspection. Searching for instances of tear gas usage online manually is laborious and time-consuming. In this paper, we study various object detection models for their potential use in the discovery and identification of tear gas canisters for human rights monitors. CNN based object detection typically requires large volumes of training data, and prior to our work, an appropriate dataset of tear gas canisters did not exist. We benchmark methods for training object detectors using limited labelled data: we fine-tune different object detection models on the limited labelled data and compare performance to a few shot detector and augmentation strategies using synthetic data. We provide a dataset for evaluating and training tear gas canister detectors and indicate how such detectors can be deployed in real-world contexts for investigating human rights violations. Our experiments show that various techniques can improve results, including fine-tuning state of the art detectors, using few shot detectors, and including synthetic data as part of the training set.
{"title":"Detecting Tear Gas Canisters With Limited Training Data","authors":"Ashwin D. D’Cruz, Christopher Tegho, Sean Greaves, Lachlan Kermode","doi":"10.1109/WACV51458.2022.00135","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00135","url":null,"abstract":"Human rights investigations often entail triaging large volumes of open source images and video in order to find moments that are relevant to a given investigation and warrant further inspection. Searching for instances of tear gas usage online manually is laborious and time-consuming. In this paper, we study various object detection models for their potential use in the discovery and identification of tear gas canisters for human rights monitors. CNN based object detection typically requires large volumes of training data, and prior to our work, an appropriate dataset of tear gas canisters did not exist. We benchmark methods for training object detectors using limited labelled data: we fine-tune different object detection models on the limited labelled data and compare performance to a few shot detector and augmentation strategies using synthetic data. We provide a dataset for evaluating and training tear gas canister detectors and indicate how such detectors can be deployed in real-world contexts for investigating human rights violations. Our experiments show that various techniques can improve results, including fine-tuning state of the art detectors, using few shot detectors, and including synthetic data as part of the training set.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133648026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00013
A. Abdelhamed, Jonghwa Yim, Abhijith Punnappurath, Michael S. Brown, Jihwan Choe, Kihwan Kim
Most smartphones support the use of real-time camera filters to impart visual effects to captured images. Currently, such filters come preinstalled on-device or need to be downloaded and installed before use (e.g., Instagram filters). Recent work [24] proposed a method to extract a camera filter directly from an example photo that has already had a filter applied. The work in [24] focused only on the color and tonal aspects of the underlying filter. In this paper, we introduce a method to extract two spatially varying effects commonly used by on-device camera filters—namely, image vignetting and image grain. Specifically, we show how to extract the parameters for vignetting and image grain present in an example image and replicate these effects as an on-device filter. We use lightweight CNNs to estimate the filter parameters and employ efficient techniques—isotropic Gaussian filters and simplex noise—for regenerating the filters. Our design achieves a reasonable trade-off between efficiency and realism. We show that our method can extract vignetting and image grain filters from stylized photos and replicate the filters on captured images more faithfully, as compared to color and style transfer methods. Our method is significantly efficient and has been already deployed to millions of flagship smartphones.
{"title":"Extracting Vignetting and Grain Filter Effects from Photos","authors":"A. Abdelhamed, Jonghwa Yim, Abhijith Punnappurath, Michael S. Brown, Jihwan Choe, Kihwan Kim","doi":"10.1109/WACV51458.2022.00013","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00013","url":null,"abstract":"Most smartphones support the use of real-time camera filters to impart visual effects to captured images. Currently, such filters come preinstalled on-device or need to be downloaded and installed before use (e.g., Instagram filters). Recent work [24] proposed a method to extract a camera filter directly from an example photo that has already had a filter applied. The work in [24] focused only on the color and tonal aspects of the underlying filter. In this paper, we introduce a method to extract two spatially varying effects commonly used by on-device camera filters—namely, image vignetting and image grain. Specifically, we show how to extract the parameters for vignetting and image grain present in an example image and replicate these effects as an on-device filter. We use lightweight CNNs to estimate the filter parameters and employ efficient techniques—isotropic Gaussian filters and simplex noise—for regenerating the filters. Our design achieves a reasonable trade-off between efficiency and realism. We show that our method can extract vignetting and image grain filters from stylized photos and replicate the filters on captured images more faithfully, as compared to color and style transfer methods. Our method is significantly efficient and has been already deployed to millions of flagship smartphones.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124915304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00406
Mohammad Emad, Maurice Peemen, H. Corporaal
Modern deep learning super-resolution approaches have achieved remarkable performance where the low-resolution (LR) input is a degraded high-resolution (HR) image by a fixed known kernel i.e. kernel-specific super-resolution (SR). However, real images often vary in their degradation kernels, thus a single kernel-specific SR approach does not often produce accurate HR results. Recently, degradation-aware networks are introduced to generate blind SR results for unknown kernel conditions. They can restore images for multiple blur kernels. However, they have to compromise in quality compared to their kernel-specific counterparts. To address this issue, we propose a novel blind SR method called Mixture of Experts Super-Resolution (MoESR), which uses different experts for different degradation kernels. A broad space of degradation kernels is covered by kernel-specific SR networks (experts). We present an accurate kernel prediction method (gating mechanism) by evaluating the sharpness of images generated by experts. Based on the predicted kernel, our most suited expert network is selected for the input image. Finally, we fine-tune the selected network on the test image itself to leverage the advantage of internal learning. Our experimental results on standard synthetic datasets and real images demonstrate that MoESR outperforms state-of-the-art methods both quantitatively and qualitatively. Especially for the challenging ×4 SR task, our PSNR improvement of 0.93 dB on the DIV2KRK dataset is substantial1.
{"title":"MoESR: Blind Super-Resolution using Kernel-Aware Mixture of Experts","authors":"Mohammad Emad, Maurice Peemen, H. Corporaal","doi":"10.1109/WACV51458.2022.00406","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00406","url":null,"abstract":"Modern deep learning super-resolution approaches have achieved remarkable performance where the low-resolution (LR) input is a degraded high-resolution (HR) image by a fixed known kernel i.e. kernel-specific super-resolution (SR). However, real images often vary in their degradation kernels, thus a single kernel-specific SR approach does not often produce accurate HR results. Recently, degradation-aware networks are introduced to generate blind SR results for unknown kernel conditions. They can restore images for multiple blur kernels. However, they have to compromise in quality compared to their kernel-specific counterparts. To address this issue, we propose a novel blind SR method called Mixture of Experts Super-Resolution (MoESR), which uses different experts for different degradation kernels. A broad space of degradation kernels is covered by kernel-specific SR networks (experts). We present an accurate kernel prediction method (gating mechanism) by evaluating the sharpness of images generated by experts. Based on the predicted kernel, our most suited expert network is selected for the input image. Finally, we fine-tune the selected network on the test image itself to leverage the advantage of internal learning. Our experimental results on standard synthetic datasets and real images demonstrate that MoESR outperforms state-of-the-art methods both quantitatively and qualitatively. Especially for the challenging ×4 SR task, our PSNR improvement of 0.93 dB on the DIV2KRK dataset is substantial1.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130423645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00373
Siavash Khodadadeh, S. Ghadar, Saeid Motiian, Wei-An Lin, Ladislau Bölöni, R. Kalarot
Several recent papers introduced techniques to adjust the attributes of human faces generated by unconditional GANs such as StyleGAN. Despite efforts to disentangle the attributes, a request to change one attribute often triggers unwanted changes to other attributes as well. More importantly, in some cases, a human observer would not recognize the edited face to belong to the same person. We propose an approach where a neural network takes as input the latent encoding of a face and the desired attribute changes and outputs the latent space encoding of the edited image. The network is trained offline using unsupervised data, with training labels generated by an off-the-shelf attribute classifier. The desired attribute changes and conservation laws, such as identity maintenance, are encoded in the training loss. The number of attributes the mapper can simultaneously modify is only limited by the attributes available to the classifier – we trained a network that handles 35 attributes, more than any previous approach. As no optimization is performed at deployment time, the computation time is negligible, allowing real-time attribute editing. Qualitative and quantitative comparisons with the current state-of-the-art show our method is better at conserving the identity of the face and restricting changes to the requested attributes.
{"title":"Latent to Latent: A Learned Mapper for Identity Preserving Editing of Multiple Face Attributes in StyleGAN-generated Images","authors":"Siavash Khodadadeh, S. Ghadar, Saeid Motiian, Wei-An Lin, Ladislau Bölöni, R. Kalarot","doi":"10.1109/WACV51458.2022.00373","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00373","url":null,"abstract":"Several recent papers introduced techniques to adjust the attributes of human faces generated by unconditional GANs such as StyleGAN. Despite efforts to disentangle the attributes, a request to change one attribute often triggers unwanted changes to other attributes as well. More importantly, in some cases, a human observer would not recognize the edited face to belong to the same person. We propose an approach where a neural network takes as input the latent encoding of a face and the desired attribute changes and outputs the latent space encoding of the edited image. The network is trained offline using unsupervised data, with training labels generated by an off-the-shelf attribute classifier. The desired attribute changes and conservation laws, such as identity maintenance, are encoded in the training loss. The number of attributes the mapper can simultaneously modify is only limited by the attributes available to the classifier – we trained a network that handles 35 attributes, more than any previous approach. As no optimization is performed at deployment time, the computation time is negligible, allowing real-time attribute editing. Qualitative and quantitative comparisons with the current state-of-the-art show our method is better at conserving the identity of the face and restricting changes to the requested attributes.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129867628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00192
Alex Krasner, Mikhail Sizintsev, Abhinav Rajvanshi, Han-Pang Chiu, Niluthpol Chowdhury Mithun, Kevin Kaighn, Philip Miller, R. Villamil, S. Samarasekera
Understanding the perceived scene during navigation enables intelligent robot behaviors. Current vision-based semantic SLAM (Simultaneous Localization and Mapping) systems provide these capabilities. However, their performance decreases in visually-degraded environments, that are common places for critical robotic applications, such as search and rescue missions. In this paper, we present SIGNAV, a real-time semantic SLAM system to operate in perceptually-challenging situations. To improve the robustness for navigation in dark environments, SIGNAV leverages a multi-sensor navigation architecture to fuse vision with additional sensing modalities, including an inertial measurement unit (IMU), LiDAR, and wheel odometry. A new 2.5D semantic segmentation method is also developed to combine both images and LiDAR depth maps to generate semantic labels of 3D mapped points in real time. We demonstrate that the navigation accuracy from SIGNAV in a variety of indoor environments under both normal lighting and dark conditions. SIGNAV also provides semantic scene understanding capabilities in visually-degraded environments. We also show the benefits of semantic information to SIGNAV’s performance.
{"title":"SIGNAV: Semantically-Informed GPS-Denied Navigation and Mapping in Visually-Degraded Environments","authors":"Alex Krasner, Mikhail Sizintsev, Abhinav Rajvanshi, Han-Pang Chiu, Niluthpol Chowdhury Mithun, Kevin Kaighn, Philip Miller, R. Villamil, S. Samarasekera","doi":"10.1109/WACV51458.2022.00192","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00192","url":null,"abstract":"Understanding the perceived scene during navigation enables intelligent robot behaviors. Current vision-based semantic SLAM (Simultaneous Localization and Mapping) systems provide these capabilities. However, their performance decreases in visually-degraded environments, that are common places for critical robotic applications, such as search and rescue missions. In this paper, we present SIGNAV, a real-time semantic SLAM system to operate in perceptually-challenging situations. To improve the robustness for navigation in dark environments, SIGNAV leverages a multi-sensor navigation architecture to fuse vision with additional sensing modalities, including an inertial measurement unit (IMU), LiDAR, and wheel odometry. A new 2.5D semantic segmentation method is also developed to combine both images and LiDAR depth maps to generate semantic labels of 3D mapped points in real time. We demonstrate that the navigation accuracy from SIGNAV in a variety of indoor environments under both normal lighting and dark conditions. SIGNAV also provides semantic scene understanding capabilities in visually-degraded environments. We also show the benefits of semantic information to SIGNAV’s performance.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126349083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00239
Hao Sun, Nick E. Pears, Yajie Gu
Learning a disentangled representation is essential to build 3D face models that accurately capture identity and expression. We propose a novel variational autoencoder (VAE) framework to disentangle identity and expression from 3D input faces that have a wide variety of expressions. Specifically, we design a system that has two decoders: one for neutral-expression faces (i.e. identity-only faces) and one for the original (expressive) input faces respectively. Crucially, we have an additional mutual-information regulariser applied on the identity part to solve the issue of imbalanced information over the expressive input faces and the reconstructed neutral faces. Our evaluations on two public datasets (CoMA and BU-3DFE) show that this model achieves competitive results on the 3D face reconstruction task and state-of-the-art results on identity-expression disentanglement. We also show that by updating to a conditional VAE, we have a system that generates different levels of expressions from semantically meaningful variables.
{"title":"Information Bottlenecked Variational Autoencoder for Disentangled 3D Facial Expression Modelling","authors":"Hao Sun, Nick E. Pears, Yajie Gu","doi":"10.1109/WACV51458.2022.00239","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00239","url":null,"abstract":"Learning a disentangled representation is essential to build 3D face models that accurately capture identity and expression. We propose a novel variational autoencoder (VAE) framework to disentangle identity and expression from 3D input faces that have a wide variety of expressions. Specifically, we design a system that has two decoders: one for neutral-expression faces (i.e. identity-only faces) and one for the original (expressive) input faces respectively. Crucially, we have an additional mutual-information regulariser applied on the identity part to solve the issue of imbalanced information over the expressive input faces and the reconstructed neutral faces. Our evaluations on two public datasets (CoMA and BU-3DFE) show that this model achieves competitive results on the 3D face reconstruction task and state-of-the-art results on identity-expression disentanglement. We also show that by updating to a conditional VAE, we have a system that generates different levels of expressions from semantically meaningful variables.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124056310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00045
Vaishnavi Khindkar, Chetan Arora, V. Balasubramanian, A. Subramanian, Rohit Saluja, C.V. Jawahar
Advancements in adaptive object detection can lead to tremendous improvements in applications like autonomous navigation, as they alleviate the distributional shifts along the detection pipeline. Prior works adopt adversarial learning to align image features at global and local levels, yet the instance-specific misalignment persists. Also, adaptive object detection remains challenging due to visual diversity in background scenes and intricate combinations of objects. Motivated by structural importance, we aim to attend prominent instance-specific regions, overcoming the feature misalignment issue. We propose a novel resIduaL seLf-attentive featUre alignMEnt (ILLUME) method for adaptive object detection. ILLUME comprises Self-Attention Feature Map (SAFM) module that enhances structural attention to object-related regions and thereby generates domain invariant features. Our approach significantly reduces the domain distance with the improved feature alignment of the instances. Qualitative results demonstrate the ability of ILLUME to attend important object instances required for alignment. Experimental results on several benchmark datasets show that our method outperforms the existing state-of-the-art approaches.
{"title":"To miss-attend is to misalign! Residual Self-Attentive Feature Alignment for Adapting Object Detectors","authors":"Vaishnavi Khindkar, Chetan Arora, V. Balasubramanian, A. Subramanian, Rohit Saluja, C.V. Jawahar","doi":"10.1109/WACV51458.2022.00045","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00045","url":null,"abstract":"Advancements in adaptive object detection can lead to tremendous improvements in applications like autonomous navigation, as they alleviate the distributional shifts along the detection pipeline. Prior works adopt adversarial learning to align image features at global and local levels, yet the instance-specific misalignment persists. Also, adaptive object detection remains challenging due to visual diversity in background scenes and intricate combinations of objects. Motivated by structural importance, we aim to attend prominent instance-specific regions, overcoming the feature misalignment issue. We propose a novel resIduaL seLf-attentive featUre alignMEnt (ILLUME) method for adaptive object detection. ILLUME comprises Self-Attention Feature Map (SAFM) module that enhances structural attention to object-related regions and thereby generates domain invariant features. Our approach significantly reduces the domain distance with the improved feature alignment of the instances. Qualitative results demonstrate the ability of ILLUME to attend important object instances required for alignment. Experimental results on several benchmark datasets show that our method outperforms the existing state-of-the-art approaches.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"22 6S 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115946073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}