Pub Date : 2020-06-01DOI: 10.1109/CVPR42600.2020.00367
Bolei Xu, Jingxin Liu, Xianxu Hou, Bozhi Liu, G. Qiu
Previous deep learning approaches to color constancy usually directly estimate illuminant value from input image. Such approaches might suffer heavily from being sensitive to the variation of image content. To overcome this problem, we introduce a deep metric learning approach named Illuminant-Guided Triplet Network (IGTN) to color constancy. IGTN generates an Illuminant Consistent and Discriminative Feature (ICDF) for achieving robust and accurate illuminant color estimation. ICDF is composed of semantic and color features based on a learnable color histogram scheme. In the ICDF space, regardless of the similarities of their contents, images taken under the same or similar illuminants are placed close to each other and at the same time images taken under different illuminants are placed far apart. We also adopt an end-to-end training strategy to simultaneously group image features and estimate illuminant value, and thus our approach does not have to classify illuminant in a separate module. We evaluate our method on two public datasets and demonstrate our method outperforms state-of-the-art approaches. Furthermore, we demonstrate that our method is less sensitive to image appearances, and can achieve more robust and consistent results than other methods on a High Dynamic Range dataset.
{"title":"End-to-End Illuminant Estimation Based on Deep Metric Learning","authors":"Bolei Xu, Jingxin Liu, Xianxu Hou, Bozhi Liu, G. Qiu","doi":"10.1109/CVPR42600.2020.00367","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00367","url":null,"abstract":"Previous deep learning approaches to color constancy usually directly estimate illuminant value from input image. Such approaches might suffer heavily from being sensitive to the variation of image content. To overcome this problem, we introduce a deep metric learning approach named Illuminant-Guided Triplet Network (IGTN) to color constancy. IGTN generates an Illuminant Consistent and Discriminative Feature (ICDF) for achieving robust and accurate illuminant color estimation. ICDF is composed of semantic and color features based on a learnable color histogram scheme. In the ICDF space, regardless of the similarities of their contents, images taken under the same or similar illuminants are placed close to each other and at the same time images taken under different illuminants are placed far apart. We also adopt an end-to-end training strategy to simultaneously group image features and estimate illuminant value, and thus our approach does not have to classify illuminant in a separate module. We evaluate our method on two public datasets and demonstrate our method outperforms state-of-the-art approaches. Furthermore, we demonstrate that our method is less sensitive to image appearances, and can achieve more robust and consistent results than other methods on a High Dynamic Range dataset.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"3613-3622"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88739409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/CVPR42600.2020.00554
Wu Shi, Y. Qiao
Texture synthesis using deep neural networks can generate high quality and diversified textures. However, it usually requires a heavy optimization process. The following works accelerate the process by using feed-forward networks, but at the cost of scalability. diversity or quality. We propose a new efficient method that aims to simulate the optimization process while retains most of the properties. Our method takes a noise image and the gradients from a descriptor network as inputs, and synthesize a refined image with respect to the target image. The proposed method can synthesize images with better quality and diversity than the other fast synthesis methods do. Moreover, our method trained on a large scale dataset can generalize to synthesize unseen textures.
{"title":"Fast Texture Synthesis via Pseudo Optimizer","authors":"Wu Shi, Y. Qiao","doi":"10.1109/CVPR42600.2020.00554","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00554","url":null,"abstract":"Texture synthesis using deep neural networks can generate high quality and diversified textures. However, it usually requires a heavy optimization process. The following works accelerate the process by using feed-forward networks, but at the cost of scalability. diversity or quality. We propose a new efficient method that aims to simulate the optimization process while retains most of the properties. Our method takes a noise image and the gradients from a descriptor network as inputs, and synthesize a refined image with respect to the target image. The proposed method can synthesize images with better quality and diversity than the other fast synthesis methods do. Moreover, our method trained on a large scale dataset can generalize to synthesize unseen textures.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"25 1","pages":"5497-5506"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88788862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00498
Alejandro Fontan, Javier Civera, Rudolph Triebel
This paper presents an information-theoretic approach to point selection in direct RGB-D odometry. The aim is to select only the most informative measurements, in order to reduce the optimization problem with a minimal impact in the accuracy. It is usual practice in visual odometry/SLAM to track several hundreds of points, achieving real-time performance in high-end desktop PCs. Reducing their computational footprint will facilitate the implementation of odometry and SLAM in low-end platforms such as small robots and AR/VR glasses. Our experimental results show that our novel information-based selection criterion allows us to reduce the number of tracked points an order of magnitude (down to only 24 of them), achieving an accuracy similar to the state of the art (sometimes outperforming it) while reducing 10 times the computational demand.
{"title":"Information-Driven Direct RGB-D Odometry","authors":"Alejandro Fontan, Javier Civera, Rudolph Triebel","doi":"10.1109/cvpr42600.2020.00498","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00498","url":null,"abstract":"This paper presents an information-theoretic approach to point selection in direct RGB-D odometry. The aim is to select only the most informative measurements, in order to reduce the optimization problem with a minimal impact in the accuracy. It is usual practice in visual odometry/SLAM to track several hundreds of points, achieving real-time performance in high-end desktop PCs. Reducing their computational footprint will facilitate the implementation of odometry and SLAM in low-end platforms such as small robots and AR/VR glasses. Our experimental results show that our novel information-based selection criterion allows us to reduce the number of tracked points an order of magnitude (down to only 24 of them), achieving an accuracy similar to the state of the art (sometimes outperforming it) while reducing 10 times the computational demand.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"34 1","pages":"4928-4936"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80634732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/CVPR42600.2020.01050
Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, Bo Li
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances. Code is available at https://github.com/spyflying/CMPC-Refseg.
{"title":"Referring Image Segmentation via Cross-Modal Progressive Comprehension","authors":"Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, Bo Li","doi":"10.1109/CVPR42600.2020.01050","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01050","url":null,"abstract":"Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances. Code is available at https://github.com/spyflying/CMPC-Refseg.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"27 2 1","pages":"10485-10494"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86958202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00344
Hailiang Xu, Siqi Xie, FangFu Chen
Maximally Stable Extremal Regions (MSER) algorithms are based on the component tree and are used to detect invariant regions. OpenCV MSER, the most popular MSER implementation, uses a linked list to associate pixels with ERs. The data-structure of an ER contains the attributes of a head and a tail linked node, which makes OpenCV MSER hard to be performed in parallel using existing parallel component tree strategies. Besides, pixel extraction (i.e. extracting the pixels in MSERs) in OpenCV MSER is very slow. In this paper, we propose two novel MSER algorithms, called Fast MSER V1 and V2. They first divide an image into several spatial partitions, then construct sub-trees and doubly linked lists (for V1) or a labelled image (for V2) on the partitions in parallel. A novel sub-tree merging algorithm is used in V1 to merge the sub-trees into the final tree, and the doubly linked lists are also merged in the process. While V2 merges the sub-trees using an existing merging algorithm. Finally, MSERs are recognized, the pixels in them are extracted through two novel pixel extraction methods taking advantage of the fact that a lot of pixels in parent and child MSERs are duplicated. Both V1 and V2 outperform three open source MSER algorithms (28 and 26 times faster than OpenCV MSER), and reduce the memory of the pixels in MSERs by 78%.
{"title":"Fast MSER","authors":"Hailiang Xu, Siqi Xie, FangFu Chen","doi":"10.1109/cvpr42600.2020.00344","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00344","url":null,"abstract":"Maximally Stable Extremal Regions (MSER) algorithms are based on the component tree and are used to detect invariant regions. OpenCV MSER, the most popular MSER implementation, uses a linked list to associate pixels with ERs. The data-structure of an ER contains the attributes of a head and a tail linked node, which makes OpenCV MSER hard to be performed in parallel using existing parallel component tree strategies. Besides, pixel extraction (i.e. extracting the pixels in MSERs) in OpenCV MSER is very slow. In this paper, we propose two novel MSER algorithms, called Fast MSER V1 and V2. They first divide an image into several spatial partitions, then construct sub-trees and doubly linked lists (for V1) or a labelled image (for V2) on the partitions in parallel. A novel sub-tree merging algorithm is used in V1 to merge the sub-trees into the final tree, and the doubly linked lists are also merged in the process. While V2 merges the sub-trees using an existing merging algorithm. Finally, MSERs are recognized, the pixels in them are extracted through two novel pixel extraction methods taking advantage of the fact that a lot of pixels in parent and child MSERs are duplicated. Both V1 and V2 outperform three open source MSER algorithms (28 and 26 times faster than OpenCV MSER), and reduce the memory of the pixels in MSERs by 78%.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 1 1","pages":"3377-3386"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83641777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00523
A. Neuberger, Eran Borenstein, Bar Hilleli, Eduard Oks, Sharon Alpert
This paper presents a new image-based virtual try-on approach (Outfit-VITON) that helps visualize how a composition of clothing items selected from various reference images form a cohesive outfit on a person in a query image. Our algorithm has two distinctive properties. First, it is inexpensive, as it simply requires a large set of single (non-corresponding) images (both real and catalog) of people wearing various garments without explicit 3D information. The training phase requires only single images, eliminating the need for manually creating image pairs, where one image shows a person wearing a particular garment and the other shows the same catalog garment alone. Secondly, it can synthesize images of multiple garments composed into a single, coherent outfit; and it enables control of the type of garments rendered in the final outfit. Once trained, our approach can then synthesize a cohesive outfit from multiple images of clothed human models, while fitting the outfit to the body shape and pose of the query person. An online optimization step takes care of fine details such as intricate textures and logos. Quantitative and qualitative evaluations on an image dataset containing large shape and style variations demonstrate superior accuracy compared to existing state-of-the-art methods, especially when dealing with highly detailed garments.
{"title":"Image Based Virtual Try-On Network From Unpaired Data","authors":"A. Neuberger, Eran Borenstein, Bar Hilleli, Eduard Oks, Sharon Alpert","doi":"10.1109/cvpr42600.2020.00523","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00523","url":null,"abstract":"This paper presents a new image-based virtual try-on approach (Outfit-VITON) that helps visualize how a composition of clothing items selected from various reference images form a cohesive outfit on a person in a query image. Our algorithm has two distinctive properties. First, it is inexpensive, as it simply requires a large set of single (non-corresponding) images (both real and catalog) of people wearing various garments without explicit 3D information. The training phase requires only single images, eliminating the need for manually creating image pairs, where one image shows a person wearing a particular garment and the other shows the same catalog garment alone. Secondly, it can synthesize images of multiple garments composed into a single, coherent outfit; and it enables control of the type of garments rendered in the final outfit. Once trained, our approach can then synthesize a cohesive outfit from multiple images of clothed human models, while fitting the outfit to the body shape and pose of the query person. An online optimization step takes care of fine details such as intricate textures and logos. Quantitative and qualitative evaluations on an image dataset containing large shape and style variations demonstrate superior accuracy compared to existing state-of-the-art methods, especially when dealing with highly detailed garments.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"5 1","pages":"5183-5192"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89974278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00488
Lyujian Lu, Hua Wang, Saad Elbeleidy, F. Nie
With rapid progress in high-throughput genotyping and neuroimaging, researches of complex brain disorders, such as Alzheimer’s Disease (AD), have gained significant attention in recent years. Many prediction models have been studied to relate neuroimaging measures to cognitive status over the progressions when these disease develops. Missing data is one of the biggest challenge in accurate cognitive score prediction of subjects in longitudinal neuroimaging studies. To tackle this problem, in this paper we propose a novel formulation to learn an enriched representation for imaging biomarkers that can simultaneously capture both the information conveyed by baseline neuroimaging records and that by progressive variations of varied counts of available follow-up records over time. While the numbers of the brain scans of the participants vary, the learned biomarker representation for every participant is a fixed-length vector, which enable us to use traditional learning models to study AD developments. Our new objective is formulated to maximize the ratio of the summations of a number of L1-norm distances for improved robustness, which, though, is difficult to efficiently solve in general. Thus we derive a new efficient iterative solution algorithm and rigorously prove its convergence. We have performed extensive experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. A performance gain has been achieved to predict four different cognitive scores, when we compare the original baseline representations against the learned representations with enrichments. These promising empirical results have demonstrated improved performances of our new method that validate its effectiveness.
{"title":"Predicting Cognitive Declines Using Longitudinally Enriched Representations for Imaging Biomarkers","authors":"Lyujian Lu, Hua Wang, Saad Elbeleidy, F. Nie","doi":"10.1109/cvpr42600.2020.00488","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00488","url":null,"abstract":"With rapid progress in high-throughput genotyping and neuroimaging, researches of complex brain disorders, such as Alzheimer’s Disease (AD), have gained significant attention in recent years. Many prediction models have been studied to relate neuroimaging measures to cognitive status over the progressions when these disease develops. Missing data is one of the biggest challenge in accurate cognitive score prediction of subjects in longitudinal neuroimaging studies. To tackle this problem, in this paper we propose a novel formulation to learn an enriched representation for imaging biomarkers that can simultaneously capture both the information conveyed by baseline neuroimaging records and that by progressive variations of varied counts of available follow-up records over time. While the numbers of the brain scans of the participants vary, the learned biomarker representation for every participant is a fixed-length vector, which enable us to use traditional learning models to study AD developments. Our new objective is formulated to maximize the ratio of the summations of a number of L1-norm distances for improved robustness, which, though, is difficult to efficiently solve in general. Thus we derive a new efficient iterative solution algorithm and rigorously prove its convergence. We have performed extensive experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. A performance gain has been achieved to predict four different cognitive scores, when we compare the original baseline representations against the learned representations with enrichments. These promising empirical results have demonstrated improved performances of our new method that validate its effectiveness.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"24 1","pages":"4826-4835"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90801949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00622
Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, W. Freeman, R. Sukthankar, C. Sminchisescu
We present a statistical, articulated 3D human shape modeling pipeline, within a fully trainable, modular, deep learning framework. Given high-resolution complete 3D body scans of humans, captured in various poses, together with additional closeups of their head and facial expressions, as well as hand articulation, and given initial, artist designed, gender neutral rigged quad-meshes, we train all model parameters including non-linear shape spaces based on variational auto-encoders, pose-space deformation correctives, skeleton joint center predictors, and blend skinning functions, in a single consistent learning loop. The models are simultaneously trained with all the 3d dynamic scan data (over 60,000 diverse human configurations in our new dataset) in order to capture correlations and ensure consistency of various components. Models support facial expression analysis, as well as body (with detailed hand) shape and pose estimation. We provide fully train-able generic human models of different resolutions- the moderate-resolution GHUM consisting of 10,168 vertices and the low-resolution GHUML(ite) of 3,194 vertices–, run comparisons between them, analyze the impact of different components and illustrate their reconstruction from image data. The models will be available for research.
{"title":"GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models","authors":"Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, W. Freeman, R. Sukthankar, C. Sminchisescu","doi":"10.1109/cvpr42600.2020.00622","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00622","url":null,"abstract":"We present a statistical, articulated 3D human shape modeling pipeline, within a fully trainable, modular, deep learning framework. Given high-resolution complete 3D body scans of humans, captured in various poses, together with additional closeups of their head and facial expressions, as well as hand articulation, and given initial, artist designed, gender neutral rigged quad-meshes, we train all model parameters including non-linear shape spaces based on variational auto-encoders, pose-space deformation correctives, skeleton joint center predictors, and blend skinning functions, in a single consistent learning loop. The models are simultaneously trained with all the 3d dynamic scan data (over 60,000 diverse human configurations in our new dataset) in order to capture correlations and ensure consistency of various components. Models support facial expression analysis, as well as body (with detailed hand) shape and pose estimation. We provide fully train-able generic human models of different resolutions- the moderate-resolution GHUM consisting of 10,168 vertices and the low-resolution GHUML(ite) of 3,194 vertices–, run comparisons between them, analyze the impact of different components and illustrate their reconstruction from image data. The models will be available for research.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"88 1","pages":"6183-6192"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89766141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00724
Mihai Fieraru, M. Zanfir, Elisabeta Oneata, A. Popa, Vlad Olaru, C. Sminchisescu
Understanding 3d human interactions is fundamental for fine grained scene analysis and behavioural modeling. However, most of the existing models focus on analyzing a single person in isolation, and those who process several people focus largely on resolving multi-person data association, rather than inferring interactions. This may lead to incorrect, lifeless 3d estimates, that miss the subtle human contact aspects--the essence of the event--and are of little use for detailed behavioral understanding. This paper addresses such issues and makes several contributions: (1) we introduce models for interaction signature estimation (ISP) encompassing contact detection, segmentation, and 3d contact signature prediction; (2) we show how such components can be leveraged in order to produce augmented losses that ensure contact consistency during 3d reconstruction; (3) we construct several large datasets for learning and evaluating 3d contact prediction and reconstruction methods; specifically, we introduce CHI3D, a lab-based accurate 3d motion capture dataset with 631 sequences containing 2,525 contact events, 728,664 ground truth 3d poses, as well as FlickrCI3D, a dataset of 11,216 images, with 14,081 processed pairs of people, and 81,233 facet-level surface correspondences within 138,213 selected contact regions. Finally, (4) we present models and baselines to illustrate how contact estimation supports meaningful 3d reconstruction where essential interactions are captured. Models and data are made available for research purposes at http://vision.imar.ro/ci3d.
{"title":"Three-Dimensional Reconstruction of Human Interactions","authors":"Mihai Fieraru, M. Zanfir, Elisabeta Oneata, A. Popa, Vlad Olaru, C. Sminchisescu","doi":"10.1109/cvpr42600.2020.00724","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00724","url":null,"abstract":"Understanding 3d human interactions is fundamental for fine grained scene analysis and behavioural modeling. However, most of the existing models focus on analyzing a single person in isolation, and those who process several people focus largely on resolving multi-person data association, rather than inferring interactions. This may lead to incorrect, lifeless 3d estimates, that miss the subtle human contact aspects--the essence of the event--and are of little use for detailed behavioral understanding. This paper addresses such issues and makes several contributions: (1) we introduce models for interaction signature estimation (ISP) encompassing contact detection, segmentation, and 3d contact signature prediction; (2) we show how such components can be leveraged in order to produce augmented losses that ensure contact consistency during 3d reconstruction; (3) we construct several large datasets for learning and evaluating 3d contact prediction and reconstruction methods; specifically, we introduce CHI3D, a lab-based accurate 3d motion capture dataset with 631 sequences containing 2,525 contact events, 728,664 ground truth 3d poses, as well as FlickrCI3D, a dataset of 11,216 images, with 14,081 processed pairs of people, and 81,233 facet-level surface correspondences within 138,213 selected contact regions. Finally, (4) we present models and baselines to illustrate how contact estimation supports meaningful 3d reconstruction where essential interactions are captured. Models and data are made available for research purposes at http://vision.imar.ro/ci3d.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"50 1","pages":"7212-7221"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90423395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An intriguing property of adversarial examples is their transferability, which suggests that black-box attacks are feasible in real-world applications. Previous works mostly study the transferability on non-targeted setting. However, recent studies show that targeted adversarial examples are more difficult to transfer than non-targeted ones. In this paper, we find there exist two defects that lead to the difficulty in generating transferable examples. First, the magnitude of gradient is decreasing during iterative attack, causing excessive consistency between two successive noises in accumulation of momentum, which is termed as noise curing. Second, it is not enough for targeted adversarial examples to just get close to target class without moving away from true class. To overcome the above problems, we propose a novel targeted attack approach to effectively generate more transferable adversarial examples. Specifically, we first introduce the Poincar'{e} distance as the similarity metric to make the magnitude of gradient self-adaptive during iterative attack to alleviate noise curing. Furthermore, we regularize the targeted attack process with metric learning to take adversarial examples away from true label and gain more transferable targeted adversarial examples. Experiments on ImageNet validate the superiority of our approach achieving 8% higher attack success rate over other state-of-the-art methods on average in black-box targeted attack.
{"title":"Towards Transferable Targeted Attack","authors":"Maosen Li, Cheng Deng, Tengjiao Li, Junchi Yan, Xinbo Gao, Heng Huang","doi":"10.1109/cvpr42600.2020.00072","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00072","url":null,"abstract":"An intriguing property of adversarial examples is their transferability, which suggests that black-box attacks are feasible in real-world applications. Previous works mostly study the transferability on non-targeted setting. However, recent studies show that targeted adversarial examples are more difficult to transfer than non-targeted ones. In this paper, we find there exist two defects that lead to the difficulty in generating transferable examples. First, the magnitude of gradient is decreasing during iterative attack, causing excessive consistency between two successive noises in accumulation of momentum, which is termed as noise curing. Second, it is not enough for targeted adversarial examples to just get close to target class without moving away from true class. To overcome the above problems, we propose a novel targeted attack approach to effectively generate more transferable adversarial examples. Specifically, we first introduce the Poincar'{e} distance as the similarity metric to make the magnitude of gradient self-adaptive during iterative attack to alleviate noise curing. Furthermore, we regularize the targeted attack process with metric learning to take adversarial examples away from true label and gain more transferable targeted adversarial examples. Experiments on ImageNet validate the superiority of our approach achieving 8% higher attack success rate over other state-of-the-art methods on average in black-box targeted attack.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"6 1","pages":"638-646"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73506622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}