Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00879
Jinjie Mai, Meng Yang, Wenfeng Luo
Weakly supervised object localization (WSOL) aims to localize object with only weak supervision like image-level labels. However, a long-standing problem for available techniques based on the classification network is that they often result in highlighting the most discriminative parts rather than the entire extent of object. Nevertheless, trying to explore the integral extent of the object could degrade the performance of image classification on the contrary. To remedy this, we propose a simple yet powerful approach by introducing a novel adversarial erasing technique, erasing integrated learning (EIL). By integrating discriminative region mining and adversarial erasing in a single forward-backward propagation in a vanilla CNN, the proposed EIL explores the high response class-specific area and the less discriminative region simultaneously, thus could maintain high performance in classification and jointly discover the full extent of the object. Furthermore, we apply multiple EIL (MEIL) modules at different levels of the network in a sequential manner, which for the first time integrates semantic features of multiple levels and multiple scales through adversarial erasing learning. In particular, the proposed EIL and advanced MEIL both achieve a new state-of-the-art performance in CUB-200-2011 and ILSVRC 2016 benchmark, making significant improvement in localization while advancing high performance in image classification.
{"title":"Erasing Integrated Learning: A Simple Yet Effective Approach for Weakly Supervised Object Localization","authors":"Jinjie Mai, Meng Yang, Wenfeng Luo","doi":"10.1109/cvpr42600.2020.00879","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00879","url":null,"abstract":"Weakly supervised object localization (WSOL) aims to localize object with only weak supervision like image-level labels. However, a long-standing problem for available techniques based on the classification network is that they often result in highlighting the most discriminative parts rather than the entire extent of object. Nevertheless, trying to explore the integral extent of the object could degrade the performance of image classification on the contrary. To remedy this, we propose a simple yet powerful approach by introducing a novel adversarial erasing technique, erasing integrated learning (EIL). By integrating discriminative region mining and adversarial erasing in a single forward-backward propagation in a vanilla CNN, the proposed EIL explores the high response class-specific area and the less discriminative region simultaneously, thus could maintain high performance in classification and jointly discover the full extent of the object. Furthermore, we apply multiple EIL (MEIL) modules at different levels of the network in a sequential manner, which for the first time integrates semantic features of multiple levels and multiple scales through adversarial erasing learning. In particular, the proposed EIL and advanced MEIL both achieve a new state-of-the-art performance in CUB-200-2011 and ILSVRC 2016 benchmark, making significant improvement in localization while advancing high performance in image classification.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"7 1 1","pages":"8763-8772"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83619856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00187
Zhi-Hao Lin, S. Huang, Y. Wang
Point clouds are among the popular geometry representations for 3D vision applications. However, without regular structures like 2D images, processing and summarizing information over these unordered data points are very challenging. Although a number of previous works attempt to analyze point clouds and achieve promising performances, their performances would degrade significantly when data variations like shift and scale changes are presented. In this paper, we propose 3D Graph Convolution Networks (3D-GCN), which is designed to extract local 3D features from point clouds across scales, while shift and scale-invariance properties are introduced. The novelty of our 3D-GCN lies in the definition of learnable kernels with a graph max-pooling mechanism. We show that 3D-GCN can be applied to 3D classification and segmentation tasks, with ablation studies and visualizations verifying the design of 3D-GCN.
{"title":"Convolution in the Cloud: Learning Deformable Kernels in 3D Graph Convolution Networks for Point Cloud Analysis","authors":"Zhi-Hao Lin, S. Huang, Y. Wang","doi":"10.1109/cvpr42600.2020.00187","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00187","url":null,"abstract":"Point clouds are among the popular geometry representations for 3D vision applications. However, without regular structures like 2D images, processing and summarizing information over these unordered data points are very challenging. Although a number of previous works attempt to analyze point clouds and achieve promising performances, their performances would degrade significantly when data variations like shift and scale changes are presented. In this paper, we propose 3D Graph Convolution Networks (3D-GCN), which is designed to extract local 3D features from point clouds across scales, while shift and scale-invariance properties are introduced. The novelty of our 3D-GCN lies in the definition of learnable kernels with a graph max-pooling mechanism. We show that 3D-GCN can be applied to 3D classification and segmentation tasks, with ablation studies and visualizations verifying the design of 3D-GCN.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"43 1","pages":"1797-1806"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76155480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.01153
Xiaoyi Dong, Dongdong Chen, Hang Zhou, G. Hua, Weiming Zhang, Nenghai Yu
In this paper, we look into the problem of 3D adversary attack, and propose to leverage the internal properties of the point clouds and the adversarial examples to design a new self-robust deep neural network (DNN) based 3D recognition systems. As a matter of fact, on one hand, point clouds are highly structured. Hence for each local part of clean point clouds, it is possible to learn what is it (``part of a bottle") and its relative position (``upper part of a bottle") to the global object center. On the other hand, with the visual quality constraint, 3D adversarial samples often only produce small local perturbations, thus they will roughly keep the original global center but may cause incorrect local relative position estimation. Motivated by these two properties, we use relative position (dubbed as ``gather-vector") as the adversarial indicator and propose a new robust gather module. Equipped with this module, we further propose a new self-robust 3D point recognition network. Through extensive experiments, we demonstrate that the proposed method can improve the robustness of the target attack under the white-box setting significantly. For I-FGSM based attack, our method reduces the attack success rate from 94.37 % to 75.69 %. For C&W based attack, our method reduces the attack success rate more than 40.00 %. Moreover, our method is complementary to other types of defense methods to achieve better defense results.
{"title":"Self-Robust 3D Point Recognition via Gather-Vector Guidance","authors":"Xiaoyi Dong, Dongdong Chen, Hang Zhou, G. Hua, Weiming Zhang, Nenghai Yu","doi":"10.1109/cvpr42600.2020.01153","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01153","url":null,"abstract":"In this paper, we look into the problem of 3D adversary attack, and propose to leverage the internal properties of the point clouds and the adversarial examples to design a new self-robust deep neural network (DNN) based 3D recognition systems. As a matter of fact, on one hand, point clouds are highly structured. Hence for each local part of clean point clouds, it is possible to learn what is it (``part of a bottle\") and its relative position (``upper part of a bottle\") to the global object center. On the other hand, with the visual quality constraint, 3D adversarial samples often only produce small local perturbations, thus they will roughly keep the original global center but may cause incorrect local relative position estimation. Motivated by these two properties, we use relative position (dubbed as ``gather-vector\") as the adversarial indicator and propose a new robust gather module. Equipped with this module, we further propose a new self-robust 3D point recognition network. Through extensive experiments, we demonstrate that the proposed method can improve the robustness of the target attack under the white-box setting significantly. For I-FGSM based attack, our method reduces the attack success rate from 94.37 % to 75.69 %. For C&W based attack, our method reduces the attack success rate more than 40.00 %. Moreover, our method is complementary to other types of defense methods to achieve better defense results.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"118 2 1","pages":"11513-11521"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76183257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
State-of-the-art pedestrian detectors have performed promisingly on non-occluded pedestrians, yet they are still confronted by heavy occlusions. Although many previous works have attempted to alleviate the pedestrian occlusion issue, most of them rest on still images. In this paper, we exploit the local temporal context of pedestrians in videos and propose a tube feature aggregation network (TFAN) aiming at enhancing pedestrian detectors against severe occlusions. Specifically, for an occluded pedestrian in the current frame, we iteratively search for its relevant counterparts along temporal axis to form a tube. Then, features from the tube are aggregated according to an adaptive weight to enhance the feature representations of the occluded pedestrian. Furthermore, we devise a temporally discriminative embedding module (TDEM) and a part-based relation module (PRM), respectively, which adapts our approach to better handle tube drifting and heavy occlusions. Extensive experiments are conducted on three datasets, Caltech, NightOwls and KAIST, showing that our proposed method is significantly effective for heavily occluded pedestrian detection. Moreover, we achieve the state-of-the-art performance on the Caltech and NightOwls datasets.
{"title":"Temporal-Context Enhanced Detection of Heavily Occluded Pedestrians","authors":"Jialian Wu, Chunluan Zhou, Ming Yang, Qian Zhang, Yuan Li, Junsong Yuan","doi":"10.1109/cvpr42600.2020.01344","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01344","url":null,"abstract":"State-of-the-art pedestrian detectors have performed promisingly on non-occluded pedestrians, yet they are still confronted by heavy occlusions. Although many previous works have attempted to alleviate the pedestrian occlusion issue, most of them rest on still images. In this paper, we exploit the local temporal context of pedestrians in videos and propose a tube feature aggregation network (TFAN) aiming at enhancing pedestrian detectors against severe occlusions. Specifically, for an occluded pedestrian in the current frame, we iteratively search for its relevant counterparts along temporal axis to form a tube. Then, features from the tube are aggregated according to an adaptive weight to enhance the feature representations of the occluded pedestrian. Furthermore, we devise a temporally discriminative embedding module (TDEM) and a part-based relation module (PRM), respectively, which adapts our approach to better handle tube drifting and heavy occlusions. Extensive experiments are conducted on three datasets, Caltech, NightOwls and KAIST, showing that our proposed method is significantly effective for heavily occluded pedestrian detection. Moreover, we achieve the state-of-the-art performance on the Caltech and NightOwls datasets.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"30 1","pages":"13427-13436"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77111475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/CVPR42600.2020.01399
K. Pnvr, Hao Zhou, D. Jacobs
We propose a novel method for combining synthetic and real images when training networks to determine geometric information from a single image. We suggest a method for mapping both image types into a single, shared domain. This is connected to a primary network for end-to-end training. Ideally, this results in images from two domains that present shared information to the primary network. Our experiments demonstrate significant improvements over the state-of-the-art in two important domains, surface normal estimation of human faces and monocular depth estimation for outdoor scenes, both in an unsupervised setting.
{"title":"SharinGAN: Combining Synthetic and Real Data for Unsupervised Geometry Estimation","authors":"K. Pnvr, Hao Zhou, D. Jacobs","doi":"10.1109/CVPR42600.2020.01399","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01399","url":null,"abstract":"We propose a novel method for combining synthetic and real images when training networks to determine geometric information from a single image. We suggest a method for mapping both image types into a single, shared domain. This is connected to a primary network for end-to-end training. Ideally, this results in images from two domains that present shared information to the primary network. Our experiments demonstrate significant improvements over the state-of-the-art in two important domains, surface normal estimation of human faces and monocular depth estimation for outdoor scenes, both in an unsupervised setting.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"178 1","pages":"13971-13980"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76466513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00678
Yinqiang Zheng, Mingfang Zhang, Feng Lu
Many successful optical flow estimation methods have been proposed, but they become invalid when tested in dark scenes because low-light scenarios are not considered when they are designed and current optical flow benchmark datasets lack low-light samples. Even if we preprocess to enhance the dark images, which achieves great visual perception, it still leads to poor optical flow results or even worse ones, because information like motion consistency may be broken while enhancing. We propose an end-to-end data-driven method that avoids error accumulation and learns optical flow directly from low-light noisy images. Specifically, we develop a method to synthesize large-scale low-light optical flow datasets by simulating the noise model on dark raw images. We also collect a new optical flow dataset in raw format with a large range of exposure to be used as a benchmark. The models trained on our synthetic dataset can relatively maintain optical flow accuracy as the image brightness descends and they outperform the existing methods greatly on low-light images.
{"title":"Optical Flow in the Dark","authors":"Yinqiang Zheng, Mingfang Zhang, Feng Lu","doi":"10.1109/cvpr42600.2020.00678","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00678","url":null,"abstract":"Many successful optical flow estimation methods have been proposed, but they become invalid when tested in dark scenes because low-light scenarios are not considered when they are designed and current optical flow benchmark datasets lack low-light samples. Even if we preprocess to enhance the dark images, which achieves great visual perception, it still leads to poor optical flow results or even worse ones, because information like motion consistency may be broken while enhancing. We propose an end-to-end data-driven method that avoids error accumulation and learns optical flow directly from low-light noisy images. Specifically, we develop a method to synthesize large-scale low-light optical flow datasets by simulating the noise model on dark raw images. We also collect a new optical flow dataset in raw format with a large range of exposure to be used as a benchmark. The models trained on our synthetic dataset can relatively maintain optical flow accuracy as the image brightness descends and they outperform the existing methods greatly on low-light images.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"106 1","pages":"6748-6756"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79557004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.00807
Yerlan Idelbayev, M. A. Carreira-Perpiñán
Neural net compression can be achieved by approximating each layer's weight matrix by a low-rank matrix. The real difficulty in doing this is not in training the resulting neural net (made up of one low-rank matrix per layer), but in determining what the optimal rank of each layer is—effectively, an architecture search problem with one hyperparameter per layer. We show that, with a suitable formulation, this problem is amenable to a mixed discrete-continuous optimization jointly over the ranks and over the matrix elements, and give a corresponding algorithm. We show that this indeed can select ranks much better than existing approaches, making low-rank compression much more attractive than previously thought. For example, we can make a VGG network faster than a ResNet and with nearly the same classification error.
{"title":"Low-Rank Compression of Neural Nets: Learning the Rank of Each Layer","authors":"Yerlan Idelbayev, M. A. Carreira-Perpiñán","doi":"10.1109/cvpr42600.2020.00807","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00807","url":null,"abstract":"Neural net compression can be achieved by approximating each layer's weight matrix by a low-rank matrix. The real difficulty in doing this is not in training the resulting neural net (made up of one low-rank matrix per layer), but in determining what the optimal rank of each layer is—effectively, an architecture search problem with one hyperparameter per layer. We show that, with a suitable formulation, this problem is amenable to a mixed discrete-continuous optimization jointly over the ranks and over the matrix elements, and give a corresponding algorithm. We show that this indeed can select ranks much better than existing approaches, making low-rank compression much more attractive than previously thought. For example, we can make a VGG network faster than a ResNet and with nearly the same classification error.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"85 1","pages":"8046-8056"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81086220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/CVPR42600.2020.00057
Manel Baradad, A. Torralba
Current state-of-the-art methods that predict 3D from single images ignore the fact that the height of objects and their upright orientation is invariant to the camera pose and intrinsic parameters. To account for this, we propose a system that directly regresses 3D world coordinates for each pixel. First, our system predicts the camera position with respect to the ground plane and its intrinsic parameters. Followed by that, it predicts the 3D position for each pixel along the rays spanned by the camera. The predicted 3D coordinates and normals are invariant to a change in the camera position or its model, and we can directly impose a regression loss on these world coordinates. Our approach yields competitive results for depth and camera pose estimation (while not being explicitly trained to predict any of these) and improves across-dataset generalization performance over existing state-of-the-art methods.
{"title":"Height and Uprightness Invariance for 3D Prediction From a Single View","authors":"Manel Baradad, A. Torralba","doi":"10.1109/CVPR42600.2020.00057","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00057","url":null,"abstract":"Current state-of-the-art methods that predict 3D from single images ignore the fact that the height of objects and their upright orientation is invariant to the camera pose and intrinsic parameters. To account for this, we propose a system that directly regresses 3D world coordinates for each pixel. First, our system predicts the camera position with respect to the ground plane and its intrinsic parameters. Followed by that, it predicts the 3D position for each pixel along the rays spanned by the camera. The predicted 3D coordinates and normals are invariant to a change in the camera position or its model, and we can directly impose a regression loss on these world coordinates. Our approach yields competitive results for depth and camera pose estimation (while not being explicitly trained to predict any of these) and improves across-dataset generalization performance over existing state-of-the-art methods.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"33 1","pages":"488-497"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83911854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.01280
Yongzhi Li, Duo Zhang, Yadong Mu
Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.
{"title":"Visual-Semantic Matching by Exploring High-Order Attention and Distraction","authors":"Yongzhi Li, Duo Zhang, Yadong Mu","doi":"10.1109/cvpr42600.2020.01280","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01280","url":null,"abstract":"Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"15 1","pages":"12783-12792"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88215546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-01DOI: 10.1109/cvpr42600.2020.01364
Long Sha, Jennifer Hobbs, Panna Felsen, Xinyu Wei, P. Lucey, Sujoy Ganguly
The increasing number of vision-based tracking systems deployed in production have necessitated fast, robust camera calibration. In the domain of sport, the majority of current work focuses on sports where lines and intersections are easy to extract, and appearance is relatively consistent across venues. However, for more challenging sports like basketball, those techniques are not sufficient. In this paper, we propose an end-to-end approach for single moving camera calibration across challenging scenarios in sports. Our method contains three key modules: 1) area-based court segmentation, 2) camera pose estimation with embedded templates, 3) homography prediction via a spatial transform network (STN). All three modules are connected, enabling end-to-end training. We evaluate our method on a new college basketball dataset and demonstrate state of the art performance in variable and dynamic environments. We also validate our method on the World Cup 2014 dataset to show its competitive performance against the state-of-the-art methods. Lastly, we show that our method is two orders of magnitude faster than the previous state of the art on both datasets.
{"title":"End-to-End Camera Calibration for Broadcast Videos","authors":"Long Sha, Jennifer Hobbs, Panna Felsen, Xinyu Wei, P. Lucey, Sujoy Ganguly","doi":"10.1109/cvpr42600.2020.01364","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01364","url":null,"abstract":"The increasing number of vision-based tracking systems deployed in production have necessitated fast, robust camera calibration. In the domain of sport, the majority of current work focuses on sports where lines and intersections are easy to extract, and appearance is relatively consistent across venues. However, for more challenging sports like basketball, those techniques are not sufficient. In this paper, we propose an end-to-end approach for single moving camera calibration across challenging scenarios in sports. Our method contains three key modules: 1) area-based court segmentation, 2) camera pose estimation with embedded templates, 3) homography prediction via a spatial transform network (STN). All three modules are connected, enabling end-to-end training. We evaluate our method on a new college basketball dataset and demonstrate state of the art performance in variable and dynamic environments. We also validate our method on the World Cup 2014 dataset to show its competitive performance against the state-of-the-art methods. Lastly, we show that our method is two orders of magnitude faster than the previous state of the art on both datasets.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"81 1","pages":"13624-13633"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88445050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}