Pub Date : 2020-06-01DOI: 10.1109/CVPR42600.2020.00708
Wang Zeng, Wanli Ouyang, P. Luo, Wentao Liu, Xiaogang Wang
Estimating 3D mesh of the human body from a single 2D image is an important task with many applications such as augmented reality and Human-Robot interaction. However, prior works reconstructed 3D mesh from global image feature extracted by using convolutional neural network (CNN), where the dense correspondences between the mesh surface and the image pixels are missing, leading to suboptimal solution. This paper proposes a model-free 3D human mesh estimation framework, named DecoMR, which explicitly establishes the dense correspondence between the mesh and the local image features in the UV space (i.e. a 2D space used for texture mapping of 3D mesh). DecoMR first predicts pixel-to-surface dense correspondence map (i.e., IUV image), with which we transfer local features from the image space to the UV space. Then the transferred local image features are processed in the UV space to regress a location map, which is well aligned with transferred features. Finally we reconstruct 3D human mesh from the regressed location map with a predefined mapping function. We also observe that the existing discontinuous UV map are unfriendly to the learning of network. Therefore, we propose a novel UV map that maintains most of the neighboring relations on the original mesh surface. Experiments demonstrate that our proposed local feature alignment and continuous UV map outperforms existing 3D mesh based methods on multiple public benchmarks. Code will be made available at https: //github.com/zengwang430521/DecoMR.
{"title":"3D Human Mesh Regression With Dense Correspondence","authors":"Wang Zeng, Wanli Ouyang, P. Luo, Wentao Liu, Xiaogang Wang","doi":"10.1109/CVPR42600.2020.00708","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00708","url":null,"abstract":"Estimating 3D mesh of the human body from a single 2D image is an important task with many applications such as augmented reality and Human-Robot interaction. However, prior works reconstructed 3D mesh from global image feature extracted by using convolutional neural network (CNN), where the dense correspondences between the mesh surface and the image pixels are missing, leading to suboptimal solution. This paper proposes a model-free 3D human mesh estimation framework, named DecoMR, which explicitly establishes the dense correspondence between the mesh and the local image features in the UV space (i.e. a 2D space used for texture mapping of 3D mesh). DecoMR first predicts pixel-to-surface dense correspondence map (i.e., IUV image), with which we transfer local features from the image space to the UV space. Then the transferred local image features are processed in the UV space to regress a location map, which is well aligned with transferred features. Finally we reconstruct 3D human mesh from the regressed location map with a predefined mapping function. We also observe that the existing discontinuous UV map are unfriendly to the learning of network. Therefore, we propose a novel UV map that maintains most of the neighboring relations on the original mesh surface. Experiments demonstrate that our proposed local feature alignment and continuous UV map outperforms existing 3D mesh based methods on multiple public benchmarks. Code will be made available at https: //github.com/zengwang430521/DecoMR.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"62 1","pages":"7052-7061"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77511763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00463
Hanyu Shi, Guosheng Lin, Hao Wang, Tzu-Yi Hung, Zhenhua Wang
Point clouds are useful in many applications like autonomous driving and robotics as they provide natural 3D information of the surrounding environments. While there are extensive research on 3D point clouds, scene understanding on 4D point clouds, a series of consecutive 3D point clouds frames, is an emerging topic and yet under-investigated. With 4D point clouds (3D point cloud videos), robotic systems could enhance their robustness by leveraging the temporal information from previous frames. However, the existing semantic segmentation methods on 4D point clouds suffer from low precision due to the spatial and temporal information loss in their network structures. In this paper, we propose SpSequenceNet to address this problem. The network is designed based on 3D sparse convolution. And we introduce two novel modules, a cross-frame global attention module and a cross-frame local interpolation module, to capture spatial and temporal information in 4D point clouds. We conduct extensive experiments on SemanticKITTI, and achieve the state-of-the-art result of 43.1% on mIoU, which is 1.5% higher than the previous best approach.
{"title":"SpSequenceNet: Semantic Segmentation Network on 4D Point Clouds","authors":"Hanyu Shi, Guosheng Lin, Hao Wang, Tzu-Yi Hung, Zhenhua Wang","doi":"10.1109/cvpr42600.2020.00463","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00463","url":null,"abstract":"Point clouds are useful in many applications like autonomous driving and robotics as they provide natural 3D information of the surrounding environments. While there are extensive research on 3D point clouds, scene understanding on 4D point clouds, a series of consecutive 3D point clouds frames, is an emerging topic and yet under-investigated. With 4D point clouds (3D point cloud videos), robotic systems could enhance their robustness by leveraging the temporal information from previous frames. However, the existing semantic segmentation methods on 4D point clouds suffer from low precision due to the spatial and temporal information loss in their network structures. In this paper, we propose SpSequenceNet to address this problem. The network is designed based on 3D sparse convolution. And we introduce two novel modules, a cross-frame global attention module and a cross-frame local interpolation module, to capture spatial and temporal information in 4D point clouds. We conduct extensive experiments on SemanticKITTI, and achieve the state-of-the-art result of 43.1% on mIoU, which is 1.5% higher than the previous best approach.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"37 1","pages":"4573-4582"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79819596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00455
Zhenyu Zhang, Stéphane Lathuilière, E. Ricci, N. Sebe, Yan Yan, Jian Yang
Online depth learning is the problem of consistently adapting a depth estimation model to handle a continuously changing environment. This problem is challenging due to the network easily overfits on the current environment and forgets its past experiences. To address such problem, this paper presents a novel Learning to Prevent Forgetting (LPF) method for online mono-depth adaptation to new target domains in unsupervised manner. Instead of updating the universal parameters, LPF learns adapter modules to efficiently adjust the feature representation and distribution without losing the pre-learned knowledge in online condition. Specifically, to adapt temporal-continuous depth patterns in videos, we introduce a novel meta-learning approach to learn adapter modules by combining online adaptation process into the learning objective. To further avoid overfitting, we propose a novel temporal-consistent regularization to harmonize the gradient descent procedure at each online learning step. Extensive evaluations on real-world datasets demonstrate that the proposed method, with very limited parameters, significantly improves the estimation quality.
{"title":"Online Depth Learning Against Forgetting in Monocular Videos","authors":"Zhenyu Zhang, Stéphane Lathuilière, E. Ricci, N. Sebe, Yan Yan, Jian Yang","doi":"10.1109/cvpr42600.2020.00455","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00455","url":null,"abstract":"Online depth learning is the problem of consistently adapting a depth estimation model to handle a continuously changing environment. This problem is challenging due to the network easily overfits on the current environment and forgets its past experiences. To address such problem, this paper presents a novel Learning to Prevent Forgetting (LPF) method for online mono-depth adaptation to new target domains in unsupervised manner. Instead of updating the universal parameters, LPF learns adapter modules to efficiently adjust the feature representation and distribution without losing the pre-learned knowledge in online condition. Specifically, to adapt temporal-continuous depth patterns in videos, we introduce a novel meta-learning approach to learn adapter modules by combining online adaptation process into the learning objective. To further avoid overfitting, we propose a novel temporal-consistent regularization to harmonize the gradient descent procedure at each online learning step. Extensive evaluations on real-world datasets demonstrate that the proposed method, with very limited parameters, significantly improves the estimation quality.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"20 1","pages":"4493-4502"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80092190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/CVPR42600.2020.01235
Xi Zhang, Xiaolin Wu, Xinliang Zhai, Xianye Ben, Chengjie Tu
Close-up talking heads are among the most common and salient object in video contents, such as face-to-face conversations in social media, teleconferences, news broadcasting, talk shows, etc. Due to the high sensitivity of human visual system to faces, compression distortions in talking heads videos are highly visible and annoying. To address this problem, we present a novel deep convolutional neural network (DCNN) method for very low bit rate video reconstruction of talking heads. The key innovation is a new DCNN architecture that can exploit the audio-video correlations to repair compression defects in the face region. We further improve reconstruction quality by embedding into our DCNN the encoder information of the video compression standards and introducing a constraining projection module in the network. Extensive experiments demonstrate that the proposed DCNN method outperforms the existing state-of-the-art methods on videos of talking heads.
{"title":"DAVD-Net: Deep Audio-Aided Video Decompression of Talking Heads","authors":"Xi Zhang, Xiaolin Wu, Xinliang Zhai, Xianye Ben, Chengjie Tu","doi":"10.1109/CVPR42600.2020.01235","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01235","url":null,"abstract":"Close-up talking heads are among the most common and salient object in video contents, such as face-to-face conversations in social media, teleconferences, news broadcasting, talk shows, etc. Due to the high sensitivity of human visual system to faces, compression distortions in talking heads videos are highly visible and annoying. To address this problem, we present a novel deep convolutional neural network (DCNN) method for very low bit rate video reconstruction of talking heads. The key innovation is a new DCNN architecture that can exploit the audio-video correlations to repair compression defects in the face region. We further improve reconstruction quality by embedding into our DCNN the encoder information of the video compression standards and introducing a constraining projection module in the network. Extensive experiments demonstrate that the proposed DCNN method outperforms the existing state-of-the-art methods on videos of talking heads.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"18 1","pages":"12332-12341"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82436270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00829
Fukun Yin, Shizhe Zhou
Non-contact measurement of human body height can be very difficult under some circumstances.In this paper we address the problem of accurately estimating the height of a person with arbitrary postures from a single depth image. By introducing a novel part-based intermediate representation plus a four-stage increasingly complex deep neural network, we manage to achieve significantly higher accuracy than previous methods. We first describe the human body in the form of a segmentation of human torso as four nearly rigid parts and then predict their lengths respectively by 3 CNNs. Instead of directly adding the lengths of these parts together, we further construct another independent developing CNN that combines the intermediate representation, part lengths and depth information together to finally predict the body height results.Here we develop an increasingly complex network architecture and adopt a hybrid pooling to optimize training process. To the best of our knowledge, this is the first method that estimates height only from a single depth image. In experiments our average accuracy reaches at 99.1% for people in various positions and postures.
{"title":"Accurate Estimation of Body Height From a Single Depth Image via a Four-Stage Developing Network","authors":"Fukun Yin, Shizhe Zhou","doi":"10.1109/cvpr42600.2020.00829","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00829","url":null,"abstract":"Non-contact measurement of human body height can be very difficult under some circumstances.In this paper we address the problem of accurately estimating the height of a person with arbitrary postures from a single depth image. By introducing a novel part-based intermediate representation plus a four-stage increasingly complex deep neural network, we manage to achieve significantly higher accuracy than previous methods. We first describe the human body in the form of a segmentation of human torso as four nearly rigid parts and then predict their lengths respectively by 3 CNNs. Instead of directly adding the lengths of these parts together, we further construct another independent developing CNN that combines the intermediate representation, part lengths and depth information together to finally predict the body height results.Here we develop an increasingly complex network architecture and adopt a hybrid pooling to optimize training process. To the best of our knowledge, this is the first method that estimates height only from a single depth image. In experiments our average accuracy reaches at 99.1% for people in various positions and postures.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"29 1","pages":"8264-8273"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81394264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00589
Ziqian Bai, Zhaopeng Cui, Jamal Ahmed Rahim, Xiaoming Liu, P. Tan
We present a method for 3D face reconstruction from multi-view images with different expressions. We formulate this problem from the perspective of non-rigid multi-view stereo (NRMVS). Unlike previous learning-based methods, which often regress the face shape directly, our method optimizes the 3D face shape by explicitly enforcing multi-view appearance consistency, which is known to be effective in recovering shape details according to conventional multi-view stereo methods. Furthermore, by estimating face shape through optimization based on multi-view consistency, our method can potentially have better generalization to unseen data. However, this optimization is challenging since each input image has a different expression. We facilitate it with a CNN network that learns to regularize the non-rigid 3D face according to the input image and preliminary optimization results. Extensive experiments show that our method achieves the state-of-the-art performance on various datasets and generalizes well to in-the-wild data.
{"title":"Deep Facial Non-Rigid Multi-View Stereo","authors":"Ziqian Bai, Zhaopeng Cui, Jamal Ahmed Rahim, Xiaoming Liu, P. Tan","doi":"10.1109/cvpr42600.2020.00589","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00589","url":null,"abstract":"We present a method for 3D face reconstruction from multi-view images with different expressions. We formulate this problem from the perspective of non-rigid multi-view stereo (NRMVS). Unlike previous learning-based methods, which often regress the face shape directly, our method optimizes the 3D face shape by explicitly enforcing multi-view appearance consistency, which is known to be effective in recovering shape details according to conventional multi-view stereo methods. Furthermore, by estimating face shape through optimization based on multi-view consistency, our method can potentially have better generalization to unseen data. However, this optimization is challenging since each input image has a different expression. We facilitate it with a CNN network that learns to regularize the non-rigid 3D face according to the input image and preliminary optimization results. Extensive experiments show that our method achieves the state-of-the-art performance on various datasets and generalizes well to in-the-wild data.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"65 1","pages":"5849-5859"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78682524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00443
Yifan Yang, Guorong Li, Zhe Wu, Li Su, Qingming Huang, N. Sebe
One of the critical challenges of object counting is the dramatic scale variations, which is introduced by arbitrary perspectives. We propose a reverse perspective network to solve the scale variations of input images, instead of generating perspective maps to smooth final outputs. The reverse perspective network explicitly evaluates the perspective distortions, and efficiently corrects the distortions by uniformly warping the input images. Then the proposed network delivers images with similar instance scales to the regressor. Thus the regression network doesn't need multi-scale receptive fields to match the various scales. Besides, to further solve the scale problem of more congested areas, we enhance the corresponding regions of ground-truth with the evaluation errors. Then we force the regressor to learn from the augmented ground-truth via an adversarial process. Furthermore, to verify the proposed model, we collected a vehicle counting dataset based on Unmanned Aerial Vehicles (UAVs). The proposed dataset has fierce scale variations. Extensive experimental results on four benchmark datasets show the improvements of our method against the state-of-the-arts.
{"title":"Reverse Perspective Network for Perspective-Aware Object Counting","authors":"Yifan Yang, Guorong Li, Zhe Wu, Li Su, Qingming Huang, N. Sebe","doi":"10.1109/cvpr42600.2020.00443","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00443","url":null,"abstract":"One of the critical challenges of object counting is the dramatic scale variations, which is introduced by arbitrary perspectives. We propose a reverse perspective network to solve the scale variations of input images, instead of generating perspective maps to smooth final outputs. The reverse perspective network explicitly evaluates the perspective distortions, and efficiently corrects the distortions by uniformly warping the input images. Then the proposed network delivers images with similar instance scales to the regressor. Thus the regression network doesn't need multi-scale receptive fields to match the various scales. Besides, to further solve the scale problem of more congested areas, we enhance the corresponding regions of ground-truth with the evaluation errors. Then we force the regressor to learn from the augmented ground-truth via an adversarial process. Furthermore, to verify the proposed model, we collected a vehicle counting dataset based on Unmanned Aerial Vehicles (UAVs). The proposed dataset has fierce scale variations. Extensive experimental results on four benchmark datasets show the improvements of our method against the state-of-the-arts.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"96 1","pages":"4373-4382"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78733276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00456
Lingjing Wang, Xiang Li, Yi Fang
Recently, deep neural networks are introduced as supervised discriminative models for the learning of 3D point cloud segmentation. Most previous supervised methods require a large number of training data with human annotation part labels to guide the training process to ensure the model's generalization abilities on test data. In comparison, we propose a novel 3D shape segmentation method that requires few labeled data for training. Given an input 3D shape, the training of our model starts with identifying a similar 3D shape with part annotations from a mini-pool of shape templates (e.g. 10 shapes). With the selected template shape, a novel Coherent Point Transformer is proposed to fully leverage the power of a deep neural network to smoothly morph the template shape towards the input shape. Then, based on the transformed template shapes with part labels, a newly proposed Part-specific Density Estimator is developed to learn a continuous part-specific probability distribution function on the entire 3D space with a batch consistency regularization term. With the learned part-specific probability distribution, our model is able to predict the part labels of a new input 3D shape in an end-to-end manner. We demonstrate that our proposed method can achieve remarkable segmentation results on the ShapeNet dataset with few shots, compared to previous supervised learning approaches.
{"title":"Few-Shot Learning of Part-Specific Probability Space for 3D Shape Segmentation","authors":"Lingjing Wang, Xiang Li, Yi Fang","doi":"10.1109/cvpr42600.2020.00456","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00456","url":null,"abstract":"Recently, deep neural networks are introduced as supervised discriminative models for the learning of 3D point cloud segmentation. Most previous supervised methods require a large number of training data with human annotation part labels to guide the training process to ensure the model's generalization abilities on test data. In comparison, we propose a novel 3D shape segmentation method that requires few labeled data for training. Given an input 3D shape, the training of our model starts with identifying a similar 3D shape with part annotations from a mini-pool of shape templates (e.g. 10 shapes). With the selected template shape, a novel Coherent Point Transformer is proposed to fully leverage the power of a deep neural network to smoothly morph the template shape towards the input shape. Then, based on the transformed template shapes with part labels, a newly proposed Part-specific Density Estimator is developed to learn a continuous part-specific probability distribution function on the entire 3D space with a batch consistency regularization term. With the learned part-specific probability distribution, our model is able to predict the part labels of a new input 3D shape in an end-to-end manner. We demonstrate that our proposed method can achieve remarkable segmentation results on the ShapeNet dataset with few shots, compared to previous supervised learning approaches.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"110 1","pages":"4503-4512"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87930548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generating virtual object shadows consistent with the real-world environment shading effects is important but challenging in computer vision and augmented reality applications. To address this problem, we propose an end-to-end Generative Adversarial Network for shadow generation named ARShadowGAN for augmented reality in single light scenes. Our ARShadowGAN makes full use of attention mechanism and is able to directly model the mapping relation between the virtual object shadow and the real-world environment without any explicit estimation of the illumination and 3D geometric information. In addition, we collect an image set which provides rich clues for shadow generation and construct a dataset for training and evaluating our proposed ARShadowGAN. The extensive experimental results show that our proposed ARShadowGAN is capable of directly generating plausible virtual object shadows in single light scenes. Our source code is available at https://github.com/ldq9526/ARShadowGAN.
{"title":"ARShadowGAN: Shadow Generative Adversarial Network for Augmented Reality in Single Light Scenes","authors":"Daquan Liu, Chengjiang Long, Hongpan Zhang, Hanning Yu, Xinzhi Dong, Chunxia Xiao","doi":"10.1109/cvpr42600.2020.00816","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00816","url":null,"abstract":"Generating virtual object shadows consistent with the real-world environment shading effects is important but challenging in computer vision and augmented reality applications. To address this problem, we propose an end-to-end Generative Adversarial Network for shadow generation named ARShadowGAN for augmented reality in single light scenes. Our ARShadowGAN makes full use of attention mechanism and is able to directly model the mapping relation between the virtual object shadow and the real-world environment without any explicit estimation of the illumination and 3D geometric information. In addition, we collect an image set which provides rich clues for shadow generation and construct a dataset for training and evaluating our proposed ARShadowGAN. The extensive experimental results show that our proposed ARShadowGAN is capable of directly generating plausible virtual object shadows in single light scenes. Our source code is available at https://github.com/ldq9526/ARShadowGAN.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"34 1","pages":"8136-8145"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86861866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00305
Ming Jiang, Shi Chen, Jinhui Yang, Qi Zhao
While most visual attention studies focus on bottom-up attention with restricted field-of-view, real-life situations are filled with embodied vision tasks. The role of attention is more significant in the latter due to the information overload, and attention to the most important regions is critical to the success of tasks. The effects of visual attention on task performance in this context have also been widely ignored. This research addresses a number of challenges to bridge this research gap, on both the data and model aspects. Specifically, we introduce the first dataset of top-down attention in immersive scenes. The Immersive Question-directed Visual Attention (IQVA) dataset features visual attention and corresponding task performance (i.e., answer correctness). It consists of 975 questions and answers collected from people viewing 360° videos in a head-mounted display. Analyses of the data demonstrate a significant correlation between people's task performance and their eye movements, suggesting the role of attention in task performance. With that, a neural network is developed to encode the differences of correct and incorrect attention and jointly predict the two. The proposed attention model for the first time takes into account answer correctness, whose outputs naturally distinguish important regions from distractions. This study with new data and features may enable new tasks that leverage attention and answer correctness, and inspire new research that reveals the process behind decision making in performing various tasks.
{"title":"Fantastic Answers and Where to Find Them: Immersive Question-Directed Visual Attention","authors":"Ming Jiang, Shi Chen, Jinhui Yang, Qi Zhao","doi":"10.1109/cvpr42600.2020.00305","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00305","url":null,"abstract":"While most visual attention studies focus on bottom-up attention with restricted field-of-view, real-life situations are filled with embodied vision tasks. The role of attention is more significant in the latter due to the information overload, and attention to the most important regions is critical to the success of tasks. The effects of visual attention on task performance in this context have also been widely ignored. This research addresses a number of challenges to bridge this research gap, on both the data and model aspects. Specifically, we introduce the first dataset of top-down attention in immersive scenes. The Immersive Question-directed Visual Attention (IQVA) dataset features visual attention and corresponding task performance (i.e., answer correctness). It consists of 975 questions and answers collected from people viewing 360° videos in a head-mounted display. Analyses of the data demonstrate a significant correlation between people's task performance and their eye movements, suggesting the role of attention in task performance. With that, a neural network is developed to encode the differences of correct and incorrect attention and jointly predict the two. The proposed attention model for the first time takes into account answer correctness, whose outputs naturally distinguish important regions from distractions. This study with new data and features may enable new tasks that leverage attention and answer correctness, and inspire new research that reveals the process behind decision making in performing various tasks.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"179 1","pages":"2977-2986"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86225766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}