Philippe Weinzaepfel, G. Csurka, Yohann Cabon, M. Humenberger
We introduce a novel CNN-based approach for visual localization from a single RGB image that relies on densely matching a set of Objects-of-Interest (OOIs). In this paper, we focus on planar objects which are highly descriptive in an environment, such as paintings in museums or logos and storefronts in malls or airports. For each OOI, we define a reference image for which 3D world coordinates are available. Given a query image, our CNN model detects the OOIs, segments them and finds a dense set of 2D-2D matches between each detected OOI and its corresponding reference image. Given these 2D-2D matches, together with the 3D world coordinates of each reference image, we obtain a set of 2D-3D matches from which solving a Perspective-n-Point problem gives a pose estimate. We show that 2D-3D matches for reference images, as well as OOI annotations can be obtained for all training images from a single instance annotation per OOI by leveraging Structure-from-Motion reconstruction. We introduce a novel synthetic dataset, VirtualGallery, which targets challenges such as varying lighting conditions and different occlusion levels. Our results show that our method achieves high precision and is robust to these challenges. We also experiment using the Baidu localization dataset captured in a shopping mall. Our approach is the first deep regression-based method to scale to such a larger environment.
{"title":"Visual Localization by Learning Objects-Of-Interest Dense Match Regression","authors":"Philippe Weinzaepfel, G. Csurka, Yohann Cabon, M. Humenberger","doi":"10.1109/CVPR.2019.00578","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00578","url":null,"abstract":"We introduce a novel CNN-based approach for visual localization from a single RGB image that relies on densely matching a set of Objects-of-Interest (OOIs). In this paper, we focus on planar objects which are highly descriptive in an environment, such as paintings in museums or logos and storefronts in malls or airports. For each OOI, we define a reference image for which 3D world coordinates are available. Given a query image, our CNN model detects the OOIs, segments them and finds a dense set of 2D-2D matches between each detected OOI and its corresponding reference image. Given these 2D-2D matches, together with the 3D world coordinates of each reference image, we obtain a set of 2D-3D matches from which solving a Perspective-n-Point problem gives a pose estimate. We show that 2D-3D matches for reference images, as well as OOI annotations can be obtained for all training images from a single instance annotation per OOI by leveraging Structure-from-Motion reconstruction. We introduce a novel synthetic dataset, VirtualGallery, which targets challenges such as varying lighting conditions and different occlusion levels. Our results show that our method achieves high precision and is robust to these challenges. We also experiment using the Baidu localization dataset captured in a shopping mall. Our approach is the first deep regression-based method to scale to such a larger environment.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"27 1","pages":"5627-5636"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80314914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Ding, Nan Xue, Yang Long, Guisong Xia, Qikai Lu
Object detection in aerial images is an active yet challenging task in computer vision because of the bird’s-eye view perspective, the highly complex backgrounds, and the variant appearances of objects. Especially when detecting densely packed objects in aerial images, methods relying on horizontal proposals for common object detection often introduce mismatches between the Region of Interests (RoIs) and objects. This leads to the common misalignment between the final object classification confidence and localization accuracy. In this paper, we propose a RoI Transformer to address these problems. The core idea of RoI Transformer is to apply spatial transformations on RoIs and learn the transformation parameters under the supervision of oriented bounding box (OBB) annotations. RoI Transformer is with lightweight and can be easily embedded into detectors for oriented object detection. Simply apply the RoI Transformer to light head RCNN has achieved state-of-the-art performances on two common and challenging aerial datasets, i.e., DOTA and HRSC2016, with a neglectable reduction to detection speed. Our RoI Transformer exceeds the deformable Position Sensitive RoI pooling when oriented bounding-box annotations are available. Extensive experiments have also validated the flexibility and effectiveness of our RoI Transformer.
{"title":"Learning RoI Transformer for Oriented Object Detection in Aerial Images","authors":"Jian Ding, Nan Xue, Yang Long, Guisong Xia, Qikai Lu","doi":"10.1109/CVPR.2019.00296","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00296","url":null,"abstract":"Object detection in aerial images is an active yet challenging task in computer vision because of the bird’s-eye view perspective, the highly complex backgrounds, and the variant appearances of objects. Especially when detecting densely packed objects in aerial images, methods relying on horizontal proposals for common object detection often introduce mismatches between the Region of Interests (RoIs) and objects. This leads to the common misalignment between the final object classification confidence and localization accuracy. In this paper, we propose a RoI Transformer to address these problems. The core idea of RoI Transformer is to apply spatial transformations on RoIs and learn the transformation parameters under the supervision of oriented bounding box (OBB) annotations. RoI Transformer is with lightweight and can be easily embedded into detectors for oriented object detection. Simply apply the RoI Transformer to light head RCNN has achieved state-of-the-art performances on two common and challenging aerial datasets, i.e., DOTA and HRSC2016, with a neglectable reduction to detection speed. Our RoI Transformer exceeds the deformable Position Sensitive RoI pooling when oriented bounding-box annotations are available. Extensive experiments have also validated the flexibility and effectiveness of our RoI Transformer.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"18 1","pages":"2844-2853"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83039855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanyuan Zhao, Xue-mei Hu, Hui Guo, Zhan Ma, Tao Yue, Xun Cao
Developing high light efficiency imaging techniques to retrieve high dimensional optical signal is a long-term goal in computational photography. Multispectral imaging, which captures images of different wavelengths and boosting the abilities for revealing scene properties, has developed rapidly in the last few decades. From scanning method to snapshot imaging, the limit of light collection efficiency is kept being pushed which enables wider applications especially under the light-starved scenes. In this work, we propose a novel multispectral imaging technique, that could capture the multispectral images with a high light efficiency. Through investigating the dispersive blur caused by spectral dispersers and introducing the difference of blur (DoB) constraints, we propose a basic theory for capturing multispectral information from a single dispersive-blurred image and an additional spectrum of an arbitrary point in the scene. Based on the theory, we design a prototype system and develop an optimization algorithm to realize snapshot multispectral imaging. The effectiveness of the proposed method is verified on both the synthetic data and real captured images.
{"title":"Spectral Reconstruction From Dispersive Blur: A Novel Light Efficient Spectral Imager","authors":"Yuanyuan Zhao, Xue-mei Hu, Hui Guo, Zhan Ma, Tao Yue, Xun Cao","doi":"10.1109/CVPR.2019.01248","DOIUrl":"https://doi.org/10.1109/CVPR.2019.01248","url":null,"abstract":"Developing high light efficiency imaging techniques to retrieve high dimensional optical signal is a long-term goal in computational photography. Multispectral imaging, which captures images of different wavelengths and boosting the abilities for revealing scene properties, has developed rapidly in the last few decades. From scanning method to snapshot imaging, the limit of light collection efficiency is kept being pushed which enables wider applications especially under the light-starved scenes. In this work, we propose a novel multispectral imaging technique, that could capture the multispectral images with a high light efficiency. Through investigating the dispersive blur caused by spectral dispersers and introducing the difference of blur (DoB) constraints, we propose a basic theory for capturing multispectral information from a single dispersive-blurred image and an additional spectrum of an arbitrary point in the scene. Based on the theory, we design a prototype system and develop an optimization algorithm to realize snapshot multispectral imaging. The effectiveness of the proposed method is verified on both the synthetic data and real captured images.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"76 1","pages":"12194-12203"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89520518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, C. Yan, Tao Mei
Discovering social relations, e.g., kinship, friendship, etc., from visual contents can make machines better interpret the behaviors and emotions of human beings. Existing studies mainly focus on recognizing social relations from still images while neglecting another important media--video. On one hand, the actions and storylines in videos provide more important cues for social relation recognition. On the other hand, the key persons may appear at arbitrary spatial-temporal locations, even not in one same image from beginning to the end. To overcome these challenges, we propose a Multi-scale Spatial-Temporal Reasoning (MSTR) framework to recognize social relations from videos. For the spatial representation, we not only adopt a temporal segment network to learn global action and scene information, but also design a Triple Graphs model to capture visual relations between persons and objects. For the temporal domain, we propose a Pyramid Graph Convolutional Network to perform temporal reasoning with multi-scale receptive fields, which can obtain both long-term and short-term storylines in videos. By this means, MSTR can comprehensively explore the multi-scale actions and storylines in spatial-temporal dimensions for social relation reasoning in videos. Extensive experiments on a new large-scale Video Social Relation dataset demonstrate the effectiveness of the proposed framework.
{"title":"Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning","authors":"Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, C. Yan, Tao Mei","doi":"10.1109/CVPR.2019.00368","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00368","url":null,"abstract":"Discovering social relations, e.g., kinship, friendship, etc., from visual contents can make machines better interpret the behaviors and emotions of human beings. Existing studies mainly focus on recognizing social relations from still images while neglecting another important media--video. On one hand, the actions and storylines in videos provide more important cues for social relation recognition. On the other hand, the key persons may appear at arbitrary spatial-temporal locations, even not in one same image from beginning to the end. To overcome these challenges, we propose a Multi-scale Spatial-Temporal Reasoning (MSTR) framework to recognize social relations from videos. For the spatial representation, we not only adopt a temporal segment network to learn global action and scene information, but also design a Triple Graphs model to capture visual relations between persons and objects. For the temporal domain, we propose a Pyramid Graph Convolutional Network to perform temporal reasoning with multi-scale receptive fields, which can obtain both long-term and short-term storylines in videos. By this means, MSTR can comprehensively explore the multi-scale actions and storylines in spatial-temporal dimensions for social relation reasoning in videos. Extensive experiments on a new large-scale Video Social Relation dataset demonstrate the effectiveness of the proposed framework.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"104 1","pages":"3561-3569"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87486283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Neven, Bert De Brabandere, M. Proesmans, L. Gool
Current state-of-the-art instance segmentation methods are not suited for real-time applications like autonomous driving, which require fast execution times at high accuracy. Although the currently dominant proposal-based methods have high accuracy, they are slow and generate masks at a fixed and low resolution. Proposal-free methods, by contrast, can generate masks at high resolution and are often faster, but fail to reach the same accuracy as the proposal-based methods. In this work we propose a new clustering loss function for proposal-free instance segmentation. The loss function pulls the spatial embeddings of pixels belonging to the same instance together and jointly learns an instance-specific clustering bandwidth, maximizing the intersection-over-union of the resulting instance mask. When combined with a fast architecture, the network can perform instance segmentation in real-time while maintaining a high accuracy. We evaluate our method on the challenging Cityscapes benchmark and achieve top results (5% improvement over Mask R-CNN) at more than 10 fps on 2MP images.
{"title":"Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth","authors":"D. Neven, Bert De Brabandere, M. Proesmans, L. Gool","doi":"10.1109/CVPR.2019.00904","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00904","url":null,"abstract":"Current state-of-the-art instance segmentation methods are not suited for real-time applications like autonomous driving, which require fast execution times at high accuracy. Although the currently dominant proposal-based methods have high accuracy, they are slow and generate masks at a fixed and low resolution. Proposal-free methods, by contrast, can generate masks at high resolution and are often faster, but fail to reach the same accuracy as the proposal-based methods. In this work we propose a new clustering loss function for proposal-free instance segmentation. The loss function pulls the spatial embeddings of pixels belonging to the same instance together and jointly learns an instance-specific clustering bandwidth, maximizing the intersection-over-union of the resulting instance mask. When combined with a fast architecture, the network can perform instance segmentation in real-time while maintaining a high accuracy. We evaluate our method on the challenging Cityscapes benchmark and achieve top results (5% improvement over Mask R-CNN) at more than 10 fps on 2MP images.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"8829-8837"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87766096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. S. Hosseini, Lyndon Chan, Gabriel Tse, M. Tang, J. Deng, Sajad Norouzi, C. Rowsell, K. Plataniotis, S. Damaskinos
In recent years, computer vision techniques have made large advances in image recognition and been applied to aid radiological diagnosis. Computational pathology aims to develop similar tools for aiding pathologists in diagnosing digitized histopathological slides, which would improve diagnostic accuracy and productivity amidst increasing workloads. However, there is a lack of publicly-available databases of (1) localized patch-level images annotated with (2) a large range of Histological Tissue Type (HTT). As a result, computational pathology research is constrained to diagnosing specific diseases or classifying tissues from specific organs, and cannot be readily generalized to handle unexpected diseases and organs. In this paper, we propose a new digital pathology database, the ``Atlas of Digital Pathology'' (or ADP), which comprises of 17,668 patch images extracted from 100 slides annotated with up to 57 hierarchical HTTs. Our data is generalized to different tissue types across different organs and aims to provide training data for supervised multi-label learning of patch-level HTT in a digitized whole slide image. We demonstrate the quality of our image labels through pathologist consultation and by training three state-of-the-art neural networks on tissue type classification. Quantitative results support the visually consistency of our data and we demonstrate a tissue type-based visual attention aid as a sample tool that could be developed from our database.
{"title":"Atlas of Digital Pathology: A Generalized Hierarchical Histological Tissue Type-Annotated Database for Deep Learning","authors":"M. S. Hosseini, Lyndon Chan, Gabriel Tse, M. Tang, J. Deng, Sajad Norouzi, C. Rowsell, K. Plataniotis, S. Damaskinos","doi":"10.1109/CVPR.2019.01202","DOIUrl":"https://doi.org/10.1109/CVPR.2019.01202","url":null,"abstract":"In recent years, computer vision techniques have made large advances in image recognition and been applied to aid radiological diagnosis. Computational pathology aims to develop similar tools for aiding pathologists in diagnosing digitized histopathological slides, which would improve diagnostic accuracy and productivity amidst increasing workloads. However, there is a lack of publicly-available databases of (1) localized patch-level images annotated with (2) a large range of Histological Tissue Type (HTT). As a result, computational pathology research is constrained to diagnosing specific diseases or classifying tissues from specific organs, and cannot be readily generalized to handle unexpected diseases and organs. In this paper, we propose a new digital pathology database, the ``Atlas of Digital Pathology'' (or ADP), which comprises of 17,668 patch images extracted from 100 slides annotated with up to 57 hierarchical HTTs. Our data is generalized to different tissue types across different organs and aims to provide training data for supervised multi-label learning of patch-level HTT in a digitized whole slide image. We demonstrate the quality of our image labels through pathologist consultation and by training three state-of-the-art neural networks on tissue type classification. Quantitative results support the visually consistency of our data and we demonstrate a tissue type-based visual attention aid as a sample tool that could be developed from our database.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"31 1","pages":"11739-11748"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89642913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chih-Hui Ho, Pedro Morgado, Amir Persekian, N. Vasconcelos
The role of pose invariance in image recognition and retrieval is studied. A taxonomic classification of embeddings, according to their level of invariance, is introduced and used to clarify connections between existing embeddings, identify missing approaches, and propose invariant generalizations. This leads to a new family of pose invariant embeddings (PIEs), derived from existing approaches by a combination of two models, which follow from the interpretation of CNNs as estimators of class posterior probabilities: a view-to-object model and an object-to-class model. The new pose-invariant models are shown to have interesting properties, both theoretically and through experiments, where they outperform existing multiview approaches. Most notably, they achieve good performance for both 1) classification and retrieval, and 2) single and multiview inference. These are important properties for the design of real vision systems, where universal embeddings are preferable to task specific ones, and multiple images are usually not available at inference time. Finally, a new multiview dataset of real objects, imaged in the wild against complex backgrounds, is introduced. We believe that this is a much needed complement to the synthetic datasets in wide use and will contribute to the advancement of multiview recognition and retrieval.
{"title":"PIEs: Pose Invariant Embeddings","authors":"Chih-Hui Ho, Pedro Morgado, Amir Persekian, N. Vasconcelos","doi":"10.1109/CVPR.2019.01266","DOIUrl":"https://doi.org/10.1109/CVPR.2019.01266","url":null,"abstract":"The role of pose invariance in image recognition and retrieval is studied. A taxonomic classification of embeddings, according to their level of invariance, is introduced and used to clarify connections between existing embeddings, identify missing approaches, and propose invariant generalizations. This leads to a new family of pose invariant embeddings (PIEs), derived from existing approaches by a combination of two models, which follow from the interpretation of CNNs as estimators of class posterior probabilities: a view-to-object model and an object-to-class model. The new pose-invariant models are shown to have interesting properties, both theoretically and through experiments, where they outperform existing multiview approaches. Most notably, they achieve good performance for both 1) classification and retrieval, and 2) single and multiview inference. These are important properties for the design of real vision systems, where universal embeddings are preferable to task specific ones, and multiple images are usually not available at inference time. Finally, a new multiview dataset of real objects, imaged in the wild against complex backgrounds, is introduced. We believe that this is a much needed complement to the synthetic datasets in wide use and will contribute to the advancement of multiview recognition and retrieval.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"18 1","pages":"12369-12378"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89903885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a generic, flexible and 3D rotation invariant framework based on spherical symmetry for point cloud recognition. By introducing regular icosahedral lattice and its fractals to approximate and discretize sphere, convolution can be easily implemented to process 3D points. Based on the fractal structure, a hierarchical feature learning framework together with an adaptive sphere projection module is proposed to learn deep feature in an end-to-end manner. Our framework not only inherits the strong representation power and generalization capability from convolutional neural networks for image recognition, but also extends CNN to learn robust feature resistant to rotations and perturbations. The proposed model is effective yet robust. Comprehensive experimental study demonstrates that our approach can achieve competitive performance compared to state-of-the-art techniques on both 3D object classification and part segmentation tasks, meanwhile, outperform other rotation invariant models on rotated 3D object classification and retrieval tasks by a large margin.
{"title":"Spherical Fractal Convolutional Neural Networks for Point Cloud Recognition","authors":"Yongming Rao, Jiwen Lu, Jie Zhou","doi":"10.1109/CVPR.2019.00054","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00054","url":null,"abstract":"We present a generic, flexible and 3D rotation invariant framework based on spherical symmetry for point cloud recognition. By introducing regular icosahedral lattice and its fractals to approximate and discretize sphere, convolution can be easily implemented to process 3D points. Based on the fractal structure, a hierarchical feature learning framework together with an adaptive sphere projection module is proposed to learn deep feature in an end-to-end manner. Our framework not only inherits the strong representation power and generalization capability from convolutional neural networks for image recognition, but also extends CNN to learn robust feature resistant to rotations and perturbations. The proposed model is effective yet robust. Comprehensive experimental study demonstrates that our approach can achieve competitive performance compared to state-of-the-art techniques on both 3D object classification and part segmentation tasks, meanwhile, outperform other rotation invariant models on rotated 3D object classification and retrieval tasks by a large margin.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"65 1","pages":"452-460"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84029254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Bai, Peng Tang, Philip H. S. Torr, Longin Jan Latecki
This work studies the unsupervised re-ranking procedure for object retrieval and person re-identification with a specific concentration on an ensemble of multiple metrics (or similarities). While the re-ranking step is involved by running a diffusion process on the underlying data manifolds, the fusion step can leverage the complementarity of multiple metrics. We give a comprehensive summary of existing fusion with diffusion strategies, and systematically analyze their pros and cons. Based on the analysis, we propose a unified yet robust algorithm which inherits their advantages and discards their disadvantages. Hence, we call it Unified Ensemble Diffusion (UED). More interestingly, we derive that the inherited properties indeed stem from a theoretical framework, where the relevant works can be elegantly summarized as special cases of UED by imposing additional constraints on the objective function and varying the solver of similarity propagation. Extensive experiments with 3D shape retrieval, image retrieval and person re-identification demonstrate that the proposed framework outperforms the state of the arts, and at the same time suggest that re-ranking via metric fusion is a promising tool to further improve the retrieval performance of existing algorithms.
{"title":"Re-Ranking via Metric Fusion for Object Retrieval and Person Re-Identification","authors":"S. Bai, Peng Tang, Philip H. S. Torr, Longin Jan Latecki","doi":"10.1109/CVPR.2019.00083","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00083","url":null,"abstract":"This work studies the unsupervised re-ranking procedure for object retrieval and person re-identification with a specific concentration on an ensemble of multiple metrics (or similarities). While the re-ranking step is involved by running a diffusion process on the underlying data manifolds, the fusion step can leverage the complementarity of multiple metrics. We give a comprehensive summary of existing fusion with diffusion strategies, and systematically analyze their pros and cons. Based on the analysis, we propose a unified yet robust algorithm which inherits their advantages and discards their disadvantages. Hence, we call it Unified Ensemble Diffusion (UED). More interestingly, we derive that the inherited properties indeed stem from a theoretical framework, where the relevant works can be elegantly summarized as special cases of UED by imposing additional constraints on the objective function and varying the solver of similarity propagation. Extensive experiments with 3D shape retrieval, image retrieval and person re-identification demonstrate that the proposed framework outperforms the state of the arts, and at the same time suggest that re-ranking via metric fusion is a promising tool to further improve the retrieval performance of existing algorithms.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"73 1","pages":"740-749"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86352676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin-shan Pan, Jiangxin Dong, Jimmy S. J. Ren, Liang Lin, Jinhui Tang, Ming-Hsuan Yang
Joint filtering mainly uses an additional guidance image as a prior and transfers its structures to the target image in the filtering process. Different from existing algorithms that rely on locally linear models or hand-designed objective functions to extract the structural information from the guidance image, we propose a new joint filter based on a spatially variant linear representation model (SVLRM), where the target image is linearly represented by the guidance image. However, the SVLRM leads to a highly ill-posed problem. To estimate the linear representation coefficients, we develop an effective algorithm based on a deep convolutional neural network (CNN). The proposed deep CNN (constrained by the SVLRM) is able to estimate the spatially variant linear representation coefficients which are able to model the structural information of both the guidance and input images. We show that the proposed algorithm can be effectively applied to a variety of applications, including depth/RGB image upsampling and restoration, flash/no-flash image deblurring, natural image denoising, scale-aware filtering, etc. Extensive experimental results demonstrate that the proposed algorithm performs favorably against state-of-the-art methods that have been specially designed for each task.
{"title":"Spatially Variant Linear Representation Models for Joint Filtering","authors":"Jin-shan Pan, Jiangxin Dong, Jimmy S. J. Ren, Liang Lin, Jinhui Tang, Ming-Hsuan Yang","doi":"10.1109/CVPR.2019.00180","DOIUrl":"https://doi.org/10.1109/CVPR.2019.00180","url":null,"abstract":"Joint filtering mainly uses an additional guidance image as a prior and transfers its structures to the target image in the filtering process. Different from existing algorithms that rely on locally linear models or hand-designed objective functions to extract the structural information from the guidance image, we propose a new joint filter based on a spatially variant linear representation model (SVLRM), where the target image is linearly represented by the guidance image. However, the SVLRM leads to a highly ill-posed problem. To estimate the linear representation coefficients, we develop an effective algorithm based on a deep convolutional neural network (CNN). The proposed deep CNN (constrained by the SVLRM) is able to estimate the spatially variant linear representation coefficients which are able to model the structural information of both the guidance and input images. We show that the proposed algorithm can be effectively applied to a variety of applications, including depth/RGB image upsampling and restoration, flash/no-flash image deblurring, natural image denoising, scale-aware filtering, etc. Extensive experimental results demonstrate that the proposed algorithm performs favorably against state-of-the-art methods that have been specially designed for each task.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 1","pages":"1702-1711"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82880234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}