Pub Date : 2022-06-01DOI: 10.1109/CVPR52688.2022.00773
Hongzu Su, Jingjing Li, Zhi Chen, Lei Zhu, Ke Lu
Generalized zero-shot learning (GZSL) aims to recognize samples whose categories may not have been seen at training. Recognizing unseen classes as seen ones or vice versa often leads to poor performance in GZSL. Therefore, distinguishing seen and unseen domains is naturally an effective yet challenging solution for GZSL. In this paper, we present a novel method which leverages both visual and semantic modalities to distinguish seen and unseen categories. Specifically, our method deploys two variational autoencoders to generate latent representations for visual and semantic modalities in a shared latent space, in which we align latent representations of both modalities by Wasserstein distance and reconstruct two modalities with the representations of each other. In order to learn a clearer boundary between seen and unseen classes, we propose a two-stage training strategy which takes advantage of seen and unseen semantic descriptions and searches a threshold to separate seen and unseen visual samples. At last, a seen expert and an unseen expert are used for final classification. Extensive experiments on five widely used benchmarks verify that the proposed method can significantly improve the results of GZSL. For instance, our method correctly recognizes more than 99% samples when separating domains and improves the final classification accuracy from 72.6% to 82.9% on AWA1.
{"title":"Distinguishing Unseen from Seen for Generalized Zero-shot Learning","authors":"Hongzu Su, Jingjing Li, Zhi Chen, Lei Zhu, Ke Lu","doi":"10.1109/CVPR52688.2022.00773","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00773","url":null,"abstract":"Generalized zero-shot learning (GZSL) aims to recognize samples whose categories may not have been seen at training. Recognizing unseen classes as seen ones or vice versa often leads to poor performance in GZSL. Therefore, distinguishing seen and unseen domains is naturally an effective yet challenging solution for GZSL. In this paper, we present a novel method which leverages both visual and semantic modalities to distinguish seen and unseen categories. Specifically, our method deploys two variational autoencoders to generate latent representations for visual and semantic modalities in a shared latent space, in which we align latent representations of both modalities by Wasserstein distance and reconstruct two modalities with the representations of each other. In order to learn a clearer boundary between seen and unseen classes, we propose a two-stage training strategy which takes advantage of seen and unseen semantic descriptions and searches a threshold to separate seen and unseen visual samples. At last, a seen expert and an unseen expert are used for final classification. Extensive experiments on five widely used benchmarks verify that the proposed method can significantly improve the results of GZSL. For instance, our method correctly recognizes more than 99% samples when separating domains and improves the final classification accuracy from 72.6% to 82.9% on AWA1.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"756 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114003652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-01DOI: 10.1109/CVPR52688.2022.01263
D. Kang, Hyeonjoong Jang, Jungeon Lee, C. Kyung, Min H. Kim
Omnidirectional cameras have been used widely to better understand surrounding environments. They are often configured as stereo to estimate depth. However, due to the optics of the fish eye lens, conventional epipolar geometry is inapplicable directly to omnidirectional camera images. Intermediate formats of omnidirectional images, such as equirect-angular images, have been used. However, stereo matching performance on these image formats has been lower than the conventional stereo due to severe image distortion near pole regions. In this paper, to address the distortion problem of omnidirectional images, we devise a novel subdivision scheme of a spherical geodesic grid. This enables more isotropic patch sampling of spherical image information in the omnidirectional camera space. By extending the existing equalarc scheme, our spherical geodesic grid is tessellated with an equalepiline subdivision scheme, making the cell sizes and in-between distances as uniform as possible, i.e., the arc length of the spherical grid cell's edges is well regularized. Also, our uniformly tessellated coordinates in a 2D image can be transformed into spherical coordinates via one-to-one mapping, allowing for analytical forward/backward transformation. Our uniform tessellation scheme achieves a higher accuracy of stereo matching than the traditional cylindrical and cubemap-based approaches, reducing the memory footage required for stereo matching by 20%.
{"title":"Uniform Subdivision of Omnidirectional Camera Space for Efficient Spherical Stereo Matching","authors":"D. Kang, Hyeonjoong Jang, Jungeon Lee, C. Kyung, Min H. Kim","doi":"10.1109/CVPR52688.2022.01263","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01263","url":null,"abstract":"Omnidirectional cameras have been used widely to better understand surrounding environments. They are often configured as stereo to estimate depth. However, due to the optics of the fish eye lens, conventional epipolar geometry is inapplicable directly to omnidirectional camera images. Intermediate formats of omnidirectional images, such as equirect-angular images, have been used. However, stereo matching performance on these image formats has been lower than the conventional stereo due to severe image distortion near pole regions. In this paper, to address the distortion problem of omnidirectional images, we devise a novel subdivision scheme of a spherical geodesic grid. This enables more isotropic patch sampling of spherical image information in the omnidirectional camera space. By extending the existing equalarc scheme, our spherical geodesic grid is tessellated with an equalepiline subdivision scheme, making the cell sizes and in-between distances as uniform as possible, i.e., the arc length of the spherical grid cell's edges is well regularized. Also, our uniformly tessellated coordinates in a 2D image can be transformed into spherical coordinates via one-to-one mapping, allowing for analytical forward/backward transformation. Our uniform tessellation scheme achieves a higher accuracy of stereo matching than the traditional cylindrical and cubemap-based approaches, reducing the memory footage required for stereo matching by 20%.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114864919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-01DOI: 10.1109/CVPR52688.2022.01725
Xinyu Zhou, Peiqi Duan, Yi Ma, Boxin Shi
This paper proposes to use neuromorphic events for correcting rolling shutter (RS) images as consecutive global shutter (GS) frames. RS effect introduces edge distortion and region occlusion into images caused by row-wise read-out of CMOS sensors. We introduce a novel computational imaging setup consisting of an RS sensor and an event sensor, and propose a neural network called EvUnroll to solve this problem by exploring the high-temporal-resolution property of events. We use events to bridge a spatio-temporal connection between RS and GS, establish a flow estimation module to correct edge distortions, and design a synthesis-based restoration module to restore occluded regions. The results of two branches are fused through a refining module to generate corrected GS images. We further propose datasets captured by a high-speed camera and an RS-Event hybrid camera system for training and testing our network. Experimental results on both public and proposed datasets show a systematic performance improvement compared to state-of-the-art methods.
{"title":"EvUnroll: Neuromorphic Events based Rolling Shutter Image Correction","authors":"Xinyu Zhou, Peiqi Duan, Yi Ma, Boxin Shi","doi":"10.1109/CVPR52688.2022.01725","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01725","url":null,"abstract":"This paper proposes to use neuromorphic events for correcting rolling shutter (RS) images as consecutive global shutter (GS) frames. RS effect introduces edge distortion and region occlusion into images caused by row-wise read-out of CMOS sensors. We introduce a novel computational imaging setup consisting of an RS sensor and an event sensor, and propose a neural network called EvUnroll to solve this problem by exploring the high-temporal-resolution property of events. We use events to bridge a spatio-temporal connection between RS and GS, establish a flow estimation module to correct edge distortions, and design a synthesis-based restoration module to restore occluded regions. The results of two branches are fused through a refining module to generate corrected GS images. We further propose datasets captured by a high-speed camera and an RS-Event hybrid camera system for training and testing our network. Experimental results on both public and proposed datasets show a systematic performance improvement compared to state-of-the-art methods.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127675359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-01DOI: 10.1109/CVPR52688.2022.00060
Yidong Chen, Chen Li, Z. Lu
When the images are formulated as discrete measures, computing Wasserstein-p distance between them is challenging due to the complexity of solving the corresponding Kantorovich's problem. In this paper, we propose a novel algorithm to compute the Wasserstein-p distance between discrete measures by restricting the optimal transport (OT) problem on a subset. First, we define the restricted OT problem and prove the solution of the restricted problem converges to Kantorovich's OT solution. Second, we propose the SparseSinkhorn algorithm for the restricted problem and provide a multi-scale algorithm to estimate the subset. Finally, we implement the proposed algorithm on CUDA and illustrate the linear computational cost in terms of time and memory requirements. We compute Wasserstein-p distance, estimate the transport mapping, and transfer color between color images with size ranges from $64times 64$ to $1920times 1200$. (Our code is available at https://github.com/ucascnic/CudaOT)
{"title":"Computing Wasserstein-$p$ Distance Between Images with Linear Cost","authors":"Yidong Chen, Chen Li, Z. Lu","doi":"10.1109/CVPR52688.2022.00060","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00060","url":null,"abstract":"When the images are formulated as discrete measures, computing Wasserstein-p distance between them is challenging due to the complexity of solving the corresponding Kantorovich's problem. In this paper, we propose a novel algorithm to compute the Wasserstein-p distance between discrete measures by restricting the optimal transport (OT) problem on a subset. First, we define the restricted OT problem and prove the solution of the restricted problem converges to Kantorovich's OT solution. Second, we propose the SparseSinkhorn algorithm for the restricted problem and provide a multi-scale algorithm to estimate the subset. Finally, we implement the proposed algorithm on CUDA and illustrate the linear computational cost in terms of time and memory requirements. We compute Wasserstein-p distance, estimate the transport mapping, and transfer color between color images with size ranges from $64times 64$ to $1920times 1200$. (Our code is available at https://github.com/ucascnic/CudaOT)","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126302626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-01DOI: 10.1109/CVPR52688.2022.01050
Abhijith Punnappurath, Abdullah Abuolaim, A. Abdelhamed, Alex Levinshtein, M. S. Brown
Many flagship smartphone cameras now use a dedicated neural image signal processor (ISP) to render noisy raw sensor images to the final processed output. Training night-mode ISP networks relies on large-scale datasets of image pairs with: (1) a noisy raw image captured with a short exposure and a high ISO gain; and (2) a ground truth low-noise raw image captured with a long exposure and low ISO that has been rendered through the ISP. Capturing such image pairs is tedious and time-consuming, requiring careful setup to ensure alignment between the image pairs. In addition, ground truth images are often prone to motion blur due to the long exposure. To address this problem, we propose a method that synthesizes nighttime images from day-time images. Daytime images are easy to capture, exhibit low-noise (even on smartphone cameras) and rarely suffer from motion blur. We outline a processing framework to convert daytime raw images to have the appearance of realistic nighttime raw images with different levels of noise. Our procedure allows us to easily produce aligned noisy and clean nighttime image pairs. We show the effectiveness of our synthesis framework by training neural ISPs for nightmode rendering. Furthermore, we demonstrate that using our synthetic nighttime images together with small amounts of real data (e.g., 5% to 10%) yields performance almost on par with training exclusively on real nighttime images. Our dataset and code are available at https://github.com/SamsungLabs/day-to-night.
{"title":"Day-to-Night Image Synthesis for Training Nighttime Neural ISPs","authors":"Abhijith Punnappurath, Abdullah Abuolaim, A. Abdelhamed, Alex Levinshtein, M. S. Brown","doi":"10.1109/CVPR52688.2022.01050","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01050","url":null,"abstract":"Many flagship smartphone cameras now use a dedicated neural image signal processor (ISP) to render noisy raw sensor images to the final processed output. Training night-mode ISP networks relies on large-scale datasets of image pairs with: (1) a noisy raw image captured with a short exposure and a high ISO gain; and (2) a ground truth low-noise raw image captured with a long exposure and low ISO that has been rendered through the ISP. Capturing such image pairs is tedious and time-consuming, requiring careful setup to ensure alignment between the image pairs. In addition, ground truth images are often prone to motion blur due to the long exposure. To address this problem, we propose a method that synthesizes nighttime images from day-time images. Daytime images are easy to capture, exhibit low-noise (even on smartphone cameras) and rarely suffer from motion blur. We outline a processing framework to convert daytime raw images to have the appearance of realistic nighttime raw images with different levels of noise. Our procedure allows us to easily produce aligned noisy and clean nighttime image pairs. We show the effectiveness of our synthesis framework by training neural ISPs for nightmode rendering. Furthermore, we demonstrate that using our synthetic nighttime images together with small amounts of real data (e.g., 5% to 10%) yields performance almost on par with training exclusively on real nighttime images. Our dataset and code are available at https://github.com/SamsungLabs/day-to-night.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126413773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Photo retouching finds many applications in various fields. However, most existing methods are designed for global retouching and seldom pay attention to the local region, while the latter is actually much more tedious and time-consuming in photography pipelines. In this paper, we propose a novel adaptive blend pyramid network, which aims to achieve fast local retouching on ultra high-resolution photos. The network is mainly composed of two components: a context-aware local retouching layer (LRL) and an adaptive blend pyramid layer (BPL). The LRL is designed to implement local retouching on low-resolution images, giving full consideration of the global context and local texture information, and the BPL is then developed to progressively expand the low-resolution results to the higher ones, with the help of the proposed adaptive blend module and refining module. Our method outperforms the existing methods by a large margin on two local photo retouching tasks and exhibits excellent performance in terms of running speed, achieving real-time inference on 4K images with a single NVIDIA Tesla P100 GPU. Moreover, we introduce the first high-definition cloth retouching dataset CRHD-3K to promote the research on local photo retouching. The dataset is available at https://github.com/youngLbw/crhd-3K.
照片修饰在各个领域都有很多应用。然而,现有的大多数方法都是针对全局修饰而设计的,很少关注局部区域,而后者实际上在摄影流程中更加繁琐和耗时。本文提出了一种新的自适应混合金字塔网络,旨在实现超高分辨率照片的快速局部修图。该网络主要由两个部分组成:上下文感知的局部修饰层(LRL)和自适应混合金字塔层(BPL)。设计LRL对低分辨率图像进行局部修饰,充分考虑全局上下文和局部纹理信息,然后开发BPL,借助本文提出的自适应混合模块和细化模块,逐步将低分辨率结果扩展到更高分辨率结果。我们的方法在两个局部照片修图任务上大大优于现有方法,并且在运行速度方面表现出色,在单个NVIDIA Tesla P100 GPU上实现了对4K图像的实时推理。此外,我们引入了第一个高清布片修图数据集CRHD-3K,推动了局部照片修图的研究。该数据集可在https://github.com/youngLbw/crhd-3K上获得。
{"title":"ABPN: Adaptive Blend Pyramid Network for Real-Time Local Retouching of Ultra High-Resolution Photo","authors":"Biwen Lei, Xiefan Guo, Hongyu Yang, Miaomiao Cui, Xuansong Xie, Dihe Huang","doi":"10.1109/CVPR52688.2022.00215","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00215","url":null,"abstract":"Photo retouching finds many applications in various fields. However, most existing methods are designed for global retouching and seldom pay attention to the local region, while the latter is actually much more tedious and time-consuming in photography pipelines. In this paper, we propose a novel adaptive blend pyramid network, which aims to achieve fast local retouching on ultra high-resolution photos. The network is mainly composed of two components: a context-aware local retouching layer (LRL) and an adaptive blend pyramid layer (BPL). The LRL is designed to implement local retouching on low-resolution images, giving full consideration of the global context and local texture information, and the BPL is then developed to progressively expand the low-resolution results to the higher ones, with the help of the proposed adaptive blend module and refining module. Our method outperforms the existing methods by a large margin on two local photo retouching tasks and exhibits excellent performance in terms of running speed, achieving real-time inference on 4K images with a single NVIDIA Tesla P100 GPU. Moreover, we introduce the first high-definition cloth retouching dataset CRHD-3K to promote the research on local photo retouching. The dataset is available at https://github.com/youngLbw/crhd-3K.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125637011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-01DOI: 10.1109/CVPR52688.2022.02015
Hengtao Guo, Benjamin Planche, Meng Zheng, S. Karanam, Terrence Chen, Ziyan Wu
A variety of diagnostic and therapeutic protocols rely on locating in vivo target anatomical structures, which can be obtained from medical scans. However, organs move and deform as the patient changes his/her pose. In order to obtain accurate target location information, clinicians have to either conduct frequent intraoperative scans, resulting in higher exposition of patients to radiations, or adopt proxy procedures (e.g., creating and using custom molds to keep patients in the exact same pose during both preoperative organ scanning and subsequent treatment. Such custom proxy methods are typically sub-optimal, constraining the clinicians and costing precious time and money to the patients. To the best of our knowledge, this work is the first to present a learning-based approach to estimate the patient's internal organ deformation for arbitrary human poses in order to assist with radiotherapy and similar medical protocols. The underlying method first leverages medical scans to learn a patient-specific representation that potentially encodes the organ's shape and elastic properties. During inference, given the patient's current body pose information and the organ's representation extracted from previous medical scans, our method can estimate their current organ deformation to offer guidance to clinicians. We conduct experiments on a well-sized dataset which is augmented through real clinical data using finite element modeling. Our results suggest that pose-dependent organ deformation can be learned through a point cloud autoencoder conditioned on the parametric pose input. We hope that this work can be a starting point for future research towards closing the loop between human mesh recovery and anatomical reconstruction, with applications beyond the medical domain.
{"title":"SMPL-A: Modeling Person-Specific Deformable Anatomy","authors":"Hengtao Guo, Benjamin Planche, Meng Zheng, S. Karanam, Terrence Chen, Ziyan Wu","doi":"10.1109/CVPR52688.2022.02015","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.02015","url":null,"abstract":"A variety of diagnostic and therapeutic protocols rely on locating in vivo target anatomical structures, which can be obtained from medical scans. However, organs move and deform as the patient changes his/her pose. In order to obtain accurate target location information, clinicians have to either conduct frequent intraoperative scans, resulting in higher exposition of patients to radiations, or adopt proxy procedures (e.g., creating and using custom molds to keep patients in the exact same pose during both preoperative organ scanning and subsequent treatment. Such custom proxy methods are typically sub-optimal, constraining the clinicians and costing precious time and money to the patients. To the best of our knowledge, this work is the first to present a learning-based approach to estimate the patient's internal organ deformation for arbitrary human poses in order to assist with radiotherapy and similar medical protocols. The underlying method first leverages medical scans to learn a patient-specific representation that potentially encodes the organ's shape and elastic properties. During inference, given the patient's current body pose information and the organ's representation extracted from previous medical scans, our method can estimate their current organ deformation to offer guidance to clinicians. We conduct experiments on a well-sized dataset which is augmented through real clinical data using finite element modeling. Our results suggest that pose-dependent organ deformation can be learned through a point cloud autoencoder conditioned on the parametric pose input. We hope that this work can be a starting point for future research towards closing the loop between human mesh recovery and anatomical reconstruction, with applications beyond the medical domain.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122220277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-01DOI: 10.1109/CVPR52688.2022.00491
Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, Si Liu
Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos. Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features. However, these methods suffer from spatial misalignment or false distractors due to delayed and implicit spatial-temporal interaction occurring in the decoding phase. To tackle these limitations, we propose a Language-Bridged Duplex Transfer (LBDT) module which utilizes language as an intermediary bridge to accomplish explicit and adaptive spatial-temporal interaction earlier in the encoding phase. Concretely, cross-modal attention is performed among the temporal encoder, referring words and the spatial encoder to aggregate and transfer language-relevant motion and appearance information. In addition, we also propose a Bilateral Channel Activation (BCA) module in the decoding phase for further denoising and highlighting the spatial-temporal consistent features via channel-wise activation. Extensive experiments show our method achieves new state-of-the-art performances on four popular benchmarks with 6.8% and 6.9% absolute AP gains on A2D Sentences and J-HMDB Sentences respectively, while consuming around 7× less computational overhead11https://github.com/dzh19990407/LBDT.
{"title":"Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation","authors":"Zihan Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Jizhong Han, Si Liu","doi":"10.1109/CVPR52688.2022.00491","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00491","url":null,"abstract":"Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos. Previous methods either depend on 3D ConvNets or incorporate additional 2D ConvNets as encoders to extract mixed spatial-temporal features. However, these methods suffer from spatial misalignment or false distractors due to delayed and implicit spatial-temporal interaction occurring in the decoding phase. To tackle these limitations, we propose a Language-Bridged Duplex Transfer (LBDT) module which utilizes language as an intermediary bridge to accomplish explicit and adaptive spatial-temporal interaction earlier in the encoding phase. Concretely, cross-modal attention is performed among the temporal encoder, referring words and the spatial encoder to aggregate and transfer language-relevant motion and appearance information. In addition, we also propose a Bilateral Channel Activation (BCA) module in the decoding phase for further denoising and highlighting the spatial-temporal consistent features via channel-wise activation. Extensive experiments show our method achieves new state-of-the-art performances on four popular benchmarks with 6.8% and 6.9% absolute AP gains on A2D Sentences and J-HMDB Sentences respectively, while consuming around 7× less computational overhead11https://github.com/dzh19990407/LBDT.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121686797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-01DOI: 10.1109/CVPR52688.2022.01633
Sun'ao Liu, Hongtao Xie, Hai Xu, Yongdong Zhang, Qi Tian
Current attention-based methods for semantic segmentation mainly model pixel relation through pairwise affinity and coarse segmentation. For the first time, this paper explores modeling pixel relation via Class Activation Map (CAM). Beyond the previous CAM generated from image-level classification, we present Partial CAM, which sub-divides the task into region-level prediction and achieves better localization performance. In order to eliminate the intra-class inconsistency caused by the variances of local context, we further propose Partial Class Activation Attention (PCAA) that simultaneously utilizes local and global class-level representations for attention calculation. Once obtained the partial CAM, PCAA collects local class centers and computes pixel-to-class relation locally. Applying local-specific representations ensures reliable results under different local contexts. To guarantee global consistency, we gather global representations from all local class centers and conduct feature aggregation. Experimental results confirm that Partial CAM outperforms the previous two strategies as pixel relation. Notably, our method achieves state-of-the-art performance on several challenging benchmarks including Cityscapes, Pascal Context, and ADE20K. Code is available at https://github.com/lsa1997/PCAA.
{"title":"Partial Class Activation Attention for Semantic Segmentation","authors":"Sun'ao Liu, Hongtao Xie, Hai Xu, Yongdong Zhang, Qi Tian","doi":"10.1109/CVPR52688.2022.01633","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01633","url":null,"abstract":"Current attention-based methods for semantic segmentation mainly model pixel relation through pairwise affinity and coarse segmentation. For the first time, this paper explores modeling pixel relation via Class Activation Map (CAM). Beyond the previous CAM generated from image-level classification, we present Partial CAM, which sub-divides the task into region-level prediction and achieves better localization performance. In order to eliminate the intra-class inconsistency caused by the variances of local context, we further propose Partial Class Activation Attention (PCAA) that simultaneously utilizes local and global class-level representations for attention calculation. Once obtained the partial CAM, PCAA collects local class centers and computes pixel-to-class relation locally. Applying local-specific representations ensures reliable results under different local contexts. To guarantee global consistency, we gather global representations from all local class centers and conduct feature aggregation. Experimental results confirm that Partial CAM outperforms the previous two strategies as pixel relation. Notably, our method achieves state-of-the-art performance on several challenging benchmarks including Cityscapes, Pascal Context, and ADE20K. Code is available at https://github.com/lsa1997/PCAA.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131932761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-01DOI: 10.1109/CVPR52688.2022.01232
Lei Liu, Yuze Chen, Junchi Yan, Yinqiang Zheng
The industry practice for night video surveillance is to use auxiliary near-infrared (NIR) LEDs, usually centered at 850nm or 940nm, for scene illumination. NIR LEDs are used to save power consumption while hiding the surveillance coverage area from naked human eyes. The captured images are almost monochromatic, and visual color and texture tend to disappear, which hinders human and machine perception. A few existing studies have tried to convert such NIR images to RGB images through deep learning, which can not provide satisfying results, nor generalize well beyond the training dataset. In this paper, we aim to break the fundamental restrictions on reliable NIR-to-RGB (NIR2RGB) translation by examining the imaging mechanism of single-chip silicon-based RGB cameras under NIR illuminations, and propose to retrieve the optimal LED multiplexing via deep learning. Experimental results show that this translation task can be significantly improved by properly multiplexing NIR LEDs close to the visible spectral range than using 850nm and 940nm LEDs.
{"title":"Optimal LED Spectral Multiplexing for NIR2RGB Translation","authors":"Lei Liu, Yuze Chen, Junchi Yan, Yinqiang Zheng","doi":"10.1109/CVPR52688.2022.01232","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01232","url":null,"abstract":"The industry practice for night video surveillance is to use auxiliary near-infrared (NIR) LEDs, usually centered at 850nm or 940nm, for scene illumination. NIR LEDs are used to save power consumption while hiding the surveillance coverage area from naked human eyes. The captured images are almost monochromatic, and visual color and texture tend to disappear, which hinders human and machine perception. A few existing studies have tried to convert such NIR images to RGB images through deep learning, which can not provide satisfying results, nor generalize well beyond the training dataset. In this paper, we aim to break the fundamental restrictions on reliable NIR-to-RGB (NIR2RGB) translation by examining the imaging mechanism of single-chip silicon-based RGB cameras under NIR illuminations, and propose to retrieve the optimal LED multiplexing via deep learning. Experimental results show that this translation task can be significantly improved by properly multiplexing NIR LEDs close to the visible spectral range than using 850nm and 940nm LEDs.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133830350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}