Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.383450
P. Devarakota, M. Castillo-Franco, R. Ginhoux, B. Mirbach, B. Ottersten
In [3], a low-resolution range sensor was investigated for an occupant classification system that distinguish person from child seats or an empty seat. The optimal deployment of vehicle airbags for maximum protection moreover requires information about the occupant's size and position. The detection of occupant's position involves the detection and localization of occupant's head. This is a challenging problem as the approaches based on local shape analysis (in 2D or 3D) alone are not robust enough as other parts of the person's body like shoulders, knee may have similar shapes as the head. This paper discusses and investigate the potential of a Reeb graph approach to describe the topology of vehicle occupants in terms of a skeleton. The essence of the proposed approach is that an occupant sitting in a vehicle has a typical topology which leads to different branches of a Reeb Graph and the possible location of the occupant's head are thus the end points of the Reeb graph. The proposed method is applied on real 3D range images and is compared to Ground truth information. Results show the feasibility of using topological information to identify the position of occupant's head.
{"title":"Application of the Reeb Graph Technique to Vehicle Occupant's Head Detection in Low-resolution Range Images","authors":"P. Devarakota, M. Castillo-Franco, R. Ginhoux, B. Mirbach, B. Ottersten","doi":"10.1109/CVPR.2007.383450","DOIUrl":"https://doi.org/10.1109/CVPR.2007.383450","url":null,"abstract":"In [3], a low-resolution range sensor was investigated for an occupant classification system that distinguish person from child seats or an empty seat. The optimal deployment of vehicle airbags for maximum protection moreover requires information about the occupant's size and position. The detection of occupant's position involves the detection and localization of occupant's head. This is a challenging problem as the approaches based on local shape analysis (in 2D or 3D) alone are not robust enough as other parts of the person's body like shoulders, knee may have similar shapes as the head. This paper discusses and investigate the potential of a Reeb graph approach to describe the topology of vehicle occupants in terms of a skeleton. The essence of the proposed approach is that an occupant sitting in a vehicle has a typical topology which leads to different branches of a Reeb Graph and the possible location of the occupant's head are thus the end points of the Reeb graph. The proposed method is applied on real 3D range images and is compared to Ground truth information. Results show the feasibility of using topological information to identify the position of occupant's head.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133915491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.382986
T. Syeda-Mahmood, Fei Wang
Clustering is a common operation for data partitioning in many practical applications. Often, such data distributions exhibit higher level structures which are important for problem characterization, but are not explicitly discovered by existing clustering algorithms. In this paper, we introduce multi-resolution perceptual grouping as an approach to unsupervised clustering. Specifically, we use the perceptual grouping constraints of proximity, density, contiguity and orientation similarity. We apply these constraints in a multi-resolution fashion, to group sample points in high dimensional spaces into salient clusters. We present an extensive evaluation of the clustering algorithm against state-of-the-art supervised and unsupervised clustering methods on large datasets.
{"title":"Unsupervised Clustering using Multi-Resolution Perceptual Grouping","authors":"T. Syeda-Mahmood, Fei Wang","doi":"10.1109/CVPR.2007.382986","DOIUrl":"https://doi.org/10.1109/CVPR.2007.382986","url":null,"abstract":"Clustering is a common operation for data partitioning in many practical applications. Often, such data distributions exhibit higher level structures which are important for problem characterization, but are not explicitly discovered by existing clustering algorithms. In this paper, we introduce multi-resolution perceptual grouping as an approach to unsupervised clustering. Specifically, we use the perceptual grouping constraints of proximity, density, contiguity and orientation similarity. We apply these constraints in a multi-resolution fashion, to group sample points in high dimensional spaces into salient clusters. We present an extensive evaluation of the clustering algorithm against state-of-the-art supervised and unsupervised clustering methods on large datasets.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132775853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.383220
David Liu, Tsuhan Chen
The bag-of-words representation has attracted a lot of attention recently in the field of object recognition. Based on the bag-of-words representation, topic models such as probabilistic latent semantic analysis (PLSA) have been applied to unsupervised object discovery in still images. In this paper, we extend topic models from still images to motion videos with the integration of a temporal model. We propose a novel spatial-temporal framework that uses topic models for appearance modeling, and the probabilistic data association (PDA) filter for motion modeling. The spatial and temporal models are tightly integrated so that motion ambiguities can be resolved by appearance, and appearance ambiguities can be resolved by motion. We show promising results that cannot be achieved by appearance or motion modeling alone.
{"title":"A Topic-Motion Model for Unsupervised Video Object Discovery","authors":"David Liu, Tsuhan Chen","doi":"10.1109/CVPR.2007.383220","DOIUrl":"https://doi.org/10.1109/CVPR.2007.383220","url":null,"abstract":"The bag-of-words representation has attracted a lot of attention recently in the field of object recognition. Based on the bag-of-words representation, topic models such as probabilistic latent semantic analysis (PLSA) have been applied to unsupervised object discovery in still images. In this paper, we extend topic models from still images to motion videos with the integration of a temporal model. We propose a novel spatial-temporal framework that uses topic models for appearance modeling, and the probabilistic data association (PDA) filter for motion modeling. The spatial and temporal models are tightly integrated so that motion ambiguities can be resolved by appearance, and appearance ambiguities can be resolved by motion. We show promising results that cannot be achieved by appearance or motion modeling alone.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"370 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130859782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.383454
N. Gómez, R. Alquézar, F. Serratosa
This paper presents a new method for object tracking in video sequences that is especially suitable in very noisy environments. In such situations, segmented images from one frame to the next one are usually so different that it is very hard or even impossible to match the corresponding regions or contours of both images. With the aim of tracking objects in these situations, our approach has two main characteristics. On one hand, we assume that the tracking approaches based on contours cannot be applied, and therefore, our system uses object recognition results computed from regions (specifically, colour spots from segmented images). On the other hand, we discard to match the spots of consecutive segmented images and, consequently, the methods that represent the objects by structures such as graphs or skeletons, since the structures obtained may be too different in consecutive frames. Thus, we represent the location of tracked objects through images of probabilities that are updated dynamically using both recognition and tracking results in previous steps. From these probabilities and a simple prediction of the apparent motion of the object in the image, a binary decision can be made for each pixel and abject.
{"title":"A New Method for Object Tracking Based on Regions Instead of Contours","authors":"N. Gómez, R. Alquézar, F. Serratosa","doi":"10.1109/CVPR.2007.383454","DOIUrl":"https://doi.org/10.1109/CVPR.2007.383454","url":null,"abstract":"This paper presents a new method for object tracking in video sequences that is especially suitable in very noisy environments. In such situations, segmented images from one frame to the next one are usually so different that it is very hard or even impossible to match the corresponding regions or contours of both images. With the aim of tracking objects in these situations, our approach has two main characteristics. On one hand, we assume that the tracking approaches based on contours cannot be applied, and therefore, our system uses object recognition results computed from regions (specifically, colour spots from segmented images). On the other hand, we discard to match the spots of consecutive segmented images and, consequently, the methods that represent the objects by structures such as graphs or skeletons, since the structures obtained may be too different in consecutive frames. Thus, we represent the location of tracked objects through images of probabilities that are updated dynamically using both recognition and tracking results in previous steps. From these probabilities and a simple prediction of the apparent motion of the object in the image, a binary decision can be made for each pixel and abject.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133489389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.383021
Xiaofeng Ren
Traditional aspect graphs are topology-based and are impractical for articulated objects. In this work we learn a small number of aspects, or prototypical views, from video data. Groundtruth segmentations in video sequences are utilized for both training and testing aspect models that operate on static images. We represent aspects of an articulated object as collections of line segments. In learning aspects, where object centers are known, a linear matching based on line location and orientation is used to measure similarity between views. We use K-medoid to find cluster centers. When using line aspects in recognition, matching is based on pairwise cues of relative location, relative orientation as well adjacency and parallelism. Matching with pairwise cues leads to a quadratic optimization that we solve with a spectral approximation. We show that our line aspect matching is capable of locating people in a variety of poses. Line aspect matching performs significantly better than an alternative approach using Hausdorff distance, showing merits of the line representation.
{"title":"Learning and Matching Line Aspects for Articulated Objects","authors":"Xiaofeng Ren","doi":"10.1109/CVPR.2007.383021","DOIUrl":"https://doi.org/10.1109/CVPR.2007.383021","url":null,"abstract":"Traditional aspect graphs are topology-based and are impractical for articulated objects. In this work we learn a small number of aspects, or prototypical views, from video data. Groundtruth segmentations in video sequences are utilized for both training and testing aspect models that operate on static images. We represent aspects of an articulated object as collections of line segments. In learning aspects, where object centers are known, a linear matching based on line location and orientation is used to measure similarity between views. We use K-medoid to find cluster centers. When using line aspects in recognition, matching is based on pairwise cues of relative location, relative orientation as well adjacency and parallelism. Matching with pairwise cues leads to a quadratic optimization that we solve with a spectral approximation. We show that our line aspect matching is capable of locating people in a variety of poses. Line aspect matching performs significantly better than an alternative approach using Hausdorff distance, showing merits of the line representation.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133419152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.383468
Ray Juang, A. Majumder
In this paper, we present a method for photometric self- calibration of a projector-camera system. In addition to the input transfer functions (commonly called gamma functions), we also reconstruct the spatial intensity fall-off from the center to fringe (commonly called the vignetting effect) for both the projector and camera. Projector-camera systems are becoming more popular in a large number of applications like scene capture, 3D reconstruction, and calibrating multi-projector displays. Our method enables the use of photometrically uncalibrated projectors and cameras in all such applications.
{"title":"Photometric Self-Calibration of a Projector-Camera System","authors":"Ray Juang, A. Majumder","doi":"10.1109/CVPR.2007.383468","DOIUrl":"https://doi.org/10.1109/CVPR.2007.383468","url":null,"abstract":"In this paper, we present a method for photometric self- calibration of a projector-camera system. In addition to the input transfer functions (commonly called gamma functions), we also reconstruct the spatial intensity fall-off from the center to fringe (commonly called the vignetting effect) for both the projector and camera. Projector-camera systems are becoming more popular in a large number of applications like scene capture, 3D reconstruction, and calibrating multi-projector displays. Our method enables the use of photometrically uncalibrated projectors and cameras in all such applications.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133260287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.383031
J. Schlecht, Kobus Barnard, Ekaterina H. Spriggs, B. Pryor
We present a new method to fit grammar-based stochastic models for biological structure to stacks of microscopic images captured at incremental focal lengths. Providing the ability to quantitatively represent structure and automatically fit it to image data enables important biological research. We consider the case where individuals can be represented as an instance of a stochastic grammar, similar to L-systems used in graphics to produce realistic plant models. In particular, we construct a stochastic grammar of Alternaria, a genus of fungus, and fit instances of it to microscopic image stacks. We express the image data as the result of a generative process composed of the underlying probabilistic structure model together with the parameters of the imaging system. Fitting the model then becomes probabilistic inference. For this we create a reversible-jump MCMC sampler to traverse the parameter space. We observe that incorporating spatial structure helps fit the model parts, and that simultaneously fitting the imaging system is also very helpful.
{"title":"Inferring Grammar-based Structure Models from 3D Microscopy Data","authors":"J. Schlecht, Kobus Barnard, Ekaterina H. Spriggs, B. Pryor","doi":"10.1109/CVPR.2007.383031","DOIUrl":"https://doi.org/10.1109/CVPR.2007.383031","url":null,"abstract":"We present a new method to fit grammar-based stochastic models for biological structure to stacks of microscopic images captured at incremental focal lengths. Providing the ability to quantitatively represent structure and automatically fit it to image data enables important biological research. We consider the case where individuals can be represented as an instance of a stochastic grammar, similar to L-systems used in graphics to produce realistic plant models. In particular, we construct a stochastic grammar of Alternaria, a genus of fungus, and fit instances of it to microscopic image stacks. We express the image data as the result of a generative process composed of the underlying probabilistic structure model together with the parameters of the imaging system. Fitting the model then becomes probabilistic inference. For this we create a reversible-jump MCMC sampler to traverse the parameter space. We observe that incorporating spatial structure helps fit the model parts, and that simultaneously fitting the imaging system is also very helpful.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127795696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.383380
J. R. Price, T. Gee, V. Paquit, K. Tobin
In this study, we aim to determine if iris recognition accuracy might be improved by correcting for the refractive effects of the human eye when the optical axes of the eye and camera are misaligned. We undertake this investigation using an anatomically-approximated, three-dimensional model of the human eye and ray-tracing. We generate synthetic iris imagery from different viewing angles using first a simple pattern of concentric rings on the iris for analysis, and then synthetic texture maps on the iris for experimentation. We estimate the distortion from the concentric-ring iris images and use the results to guide the sampling of textured iris images that are distorted by refraction. Using the well-known Gabor filter phase quantization approach, our model-based results indicate that the Hamming distances between iris signatures from different viewing angles can be significantly reduced by accounting for refraction. Over our experimental conditions comprising viewing angles from 0 to 60 degrees, we observe a median reduction in Hamming distance of 27.4% and a maximum reduction of 70.0% when we compensate for refraction. Maximum improvements are observed at viewing angles o/20deg-25deg.
{"title":"On the Efficacy of Correcting for Refractive Effects in Iris Recognition","authors":"J. R. Price, T. Gee, V. Paquit, K. Tobin","doi":"10.1109/CVPR.2007.383380","DOIUrl":"https://doi.org/10.1109/CVPR.2007.383380","url":null,"abstract":"In this study, we aim to determine if iris recognition accuracy might be improved by correcting for the refractive effects of the human eye when the optical axes of the eye and camera are misaligned. We undertake this investigation using an anatomically-approximated, three-dimensional model of the human eye and ray-tracing. We generate synthetic iris imagery from different viewing angles using first a simple pattern of concentric rings on the iris for analysis, and then synthetic texture maps on the iris for experimentation. We estimate the distortion from the concentric-ring iris images and use the results to guide the sampling of textured iris images that are distorted by refraction. Using the well-known Gabor filter phase quantization approach, our model-based results indicate that the Hamming distances between iris signatures from different viewing angles can be significantly reduced by accounting for refraction. Over our experimental conditions comprising viewing angles from 0 to 60 degrees, we observe a median reduction in Hamming distance of 27.4% and a maximum reduction of 70.0% when we compensate for refraction. Maximum improvements are observed at viewing angles o/20deg-25deg.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131856873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.383412
Clemens Arth, Florian Limberger, H. Bischof
In this paper we present a full-featured license plate detection and recognition system. The system is implemented on an embedded DSP platform and processes a video stream in real-time. It consists of a detection and a character recognition module. The detector is based on the AdaBoost approach presented by Viola and Jones. Detected license plates are segmented into individual characters by using a region-based approach. Character classification is performed with support vector classification. In order to speed up the detection process on the embedded device, a Kalman tracker is integrated into the system. The search area of the detector is limited to locations where the next location of a license plate is predicted. Furthermore, classification results of subsequent frames are combined to improve the class accuracy. The major advantages of our system are its real-time capability and that it does not require any additional sensor input (e.g. from infrared sensors) except a video stream. We evaluate our system on a large number of vehicles and license plates using bad quality video and show that the low resolution can be partly compensated by combining classification results of subsequent frames.
{"title":"Real-Time License Plate Recognition on an Embedded DSP-Platform","authors":"Clemens Arth, Florian Limberger, H. Bischof","doi":"10.1109/CVPR.2007.383412","DOIUrl":"https://doi.org/10.1109/CVPR.2007.383412","url":null,"abstract":"In this paper we present a full-featured license plate detection and recognition system. The system is implemented on an embedded DSP platform and processes a video stream in real-time. It consists of a detection and a character recognition module. The detector is based on the AdaBoost approach presented by Viola and Jones. Detected license plates are segmented into individual characters by using a region-based approach. Character classification is performed with support vector classification. In order to speed up the detection process on the embedded device, a Kalman tracker is integrated into the system. The search area of the detector is limited to locations where the next location of a license plate is predicted. Furthermore, classification results of subsequent frames are combined to improve the class accuracy. The major advantages of our system are its real-time capability and that it does not require any additional sensor input (e.g. from infrared sensors) except a video stream. We evaluate our system on a large number of vehicles and license plates using bad quality video and show that the low resolution can be partly compensated by combining classification results of subsequent frames.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133746885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-06-17DOI: 10.1109/CVPR.2007.383172
James Philbin, Ondřej Chum, M. Isard, Josef Sivic, Andrew Zisserman
In this paper, we present a large-scale object retrieval system. The user supplies a query object by selecting a region of a query image, and the system returns a ranked list of images that contain the same object, retrieved from a large corpus. We demonstrate the scalability and performance of our system on a dataset of over 1 million images crawled from the photo-sharing site, Flickr [3], using Oxford landmarks as queries. Building an image-feature vocabulary is a major time and performance bottleneck, due to the size of our dataset. To address this problem we compare different scalable methods for building a vocabulary and introduce a novel quantization method based on randomized trees which we show outperforms the current state-of-the-art on an extensive ground-truth. Our experiments show that the quantization has a major effect on retrieval quality. To further improve query performance, we add an efficient spatial verification stage to re-rank the results returned from our bag-of-words model and show that this consistently improves search quality, though by less of a margin when the visual vocabulary is large. We view this work as a promising step towards much larger, "web-scale " image corpora.
{"title":"Object retrieval with large vocabularies and fast spatial matching","authors":"James Philbin, Ondřej Chum, M. Isard, Josef Sivic, Andrew Zisserman","doi":"10.1109/CVPR.2007.383172","DOIUrl":"https://doi.org/10.1109/CVPR.2007.383172","url":null,"abstract":"In this paper, we present a large-scale object retrieval system. The user supplies a query object by selecting a region of a query image, and the system returns a ranked list of images that contain the same object, retrieved from a large corpus. We demonstrate the scalability and performance of our system on a dataset of over 1 million images crawled from the photo-sharing site, Flickr [3], using Oxford landmarks as queries. Building an image-feature vocabulary is a major time and performance bottleneck, due to the size of our dataset. To address this problem we compare different scalable methods for building a vocabulary and introduce a novel quantization method based on randomized trees which we show outperforms the current state-of-the-art on an extensive ground-truth. Our experiments show that the quantization has a major effect on retrieval quality. To further improve query performance, we add an efficient spatial verification stage to re-rank the results returned from our bag-of-words model and show that this consistently improves search quality, though by less of a margin when the visual vocabulary is large. We view this work as a promising step towards much larger, \"web-scale \" image corpora.","PeriodicalId":351008,"journal":{"name":"2007 IEEE Conference on Computer Vision and Pattern Recognition","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115196452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}