Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836018
Karin Tichmann, O. Junge
Motivated by a variational formulation of the motion segmentation problem, we propose a fully implicit variant of the (linearized) alternating direction method of multipliers for the minimization of convex functionals over a convex set. The new scheme does not require a step size restriction for stability and thus approaches the minimum using considerably fewer iterates. In numerical experiments on standard image sequences, the scheme often significantly outperforms other state of the art methods.
{"title":"A fully implicit alternating direction method of multipliers for the minimization of convex problems with an application to motion segmentation","authors":"Karin Tichmann, O. Junge","doi":"10.1109/WACV.2014.6836018","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836018","url":null,"abstract":"Motivated by a variational formulation of the motion segmentation problem, we propose a fully implicit variant of the (linearized) alternating direction method of multipliers for the minimization of convex functionals over a convex set. The new scheme does not require a step size restriction for stability and thus approaches the minimum using considerably fewer iterates. In numerical experiments on standard image sequences, the scheme often significantly outperforms other state of the art methods.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"57 1","pages":"823-830"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84567755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836023
Radu Dondera, Vlad I. Morariu, Yulu Wang, L. Davis
We propose an interactive video segmentation system built on the basis of occlusion and long term spatio-temporal structure cues. User supervision is incorporated in a superpixel graph clustering framework that differs crucially from prior art in that it modifies the graph according to the output of an occlusion boundary detector. Working with long temporal intervals (up to 100 frames) enables our system to significantly reduce annotation effort with respect to state of the art systems. Even though the segmentation results are less than perfect, they are obtained efficiently and can be used in weakly supervised learning from video or for video content description. We do not rely on a discriminative object appearance model and allow extracting multiple foreground objects together, saving user time if more than one object is present. Additional experiments with unsupervised clustering based on occlusion boundaries demonstrate the importance of this cue for video segmentation and thus validate our system design.
{"title":"Interactive video segmentation using occlusion boundaries and temporally coherent superpixels","authors":"Radu Dondera, Vlad I. Morariu, Yulu Wang, L. Davis","doi":"10.1109/WACV.2014.6836023","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836023","url":null,"abstract":"We propose an interactive video segmentation system built on the basis of occlusion and long term spatio-temporal structure cues. User supervision is incorporated in a superpixel graph clustering framework that differs crucially from prior art in that it modifies the graph according to the output of an occlusion boundary detector. Working with long temporal intervals (up to 100 frames) enables our system to significantly reduce annotation effort with respect to state of the art systems. Even though the segmentation results are less than perfect, they are obtained efficiently and can be used in weakly supervised learning from video or for video content description. We do not rely on a discriminative object appearance model and allow extracting multiple foreground objects together, saving user time if more than one object is present. Additional experiments with unsupervised clustering based on occlusion boundaries demonstrate the importance of this cue for video segmentation and thus validate our system design.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"180 1","pages":"784-791"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88468919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836106
Yibing Song, Linchao Bao, Qingxiong Yang
This paper presents a real-time decolorization method. Given the human visual systems preference for luminance information, the luminance should be preserved as much as possible during decolorization. As a result, the proposed decolorization method measures the amount of color contrast/detail lost when converting color to luminance. The detail loss is estimated by computing the difference between two intermediate images: one obtained by applying bilateral filter to the original color image, and the other obtained by applying joint bilateral filter to the original color image with its luminance as the guidance image. The estimated detail loss is then mapped to a grayscale image named residual image by minimizing the difference between the image gradients of the input color image and the objective grayscale image that is the sum of the residual image and the luminance. Apparently, the residual image will contain pixels with all zero values (that is the two intermediate images will be the same) only when no visual detail is missing in the luminance. Unlike most previous methods, the proposed decolorization method preserves both contrast in the color image and the luminance. Quantitative evaluation shows that it is the top performer on the standard test suite. Meanwhile it is very robust and can be directly used to convert videos while maintaining the temporal coherence. Specifically it can convert a high-resolution video (1280 × 720) in real time (about 28 Hz) on a 3.4 GHz i7 CPU.
{"title":"Real-time video decolorization using bilateral filtering","authors":"Yibing Song, Linchao Bao, Qingxiong Yang","doi":"10.1109/WACV.2014.6836106","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836106","url":null,"abstract":"This paper presents a real-time decolorization method. Given the human visual systems preference for luminance information, the luminance should be preserved as much as possible during decolorization. As a result, the proposed decolorization method measures the amount of color contrast/detail lost when converting color to luminance. The detail loss is estimated by computing the difference between two intermediate images: one obtained by applying bilateral filter to the original color image, and the other obtained by applying joint bilateral filter to the original color image with its luminance as the guidance image. The estimated detail loss is then mapped to a grayscale image named residual image by minimizing the difference between the image gradients of the input color image and the objective grayscale image that is the sum of the residual image and the luminance. Apparently, the residual image will contain pixels with all zero values (that is the two intermediate images will be the same) only when no visual detail is missing in the luminance. Unlike most previous methods, the proposed decolorization method preserves both contrast in the color image and the luminance. Quantitative evaluation shows that it is the top performer on the standard test suite. Meanwhile it is very robust and can be directly used to convert videos while maintaining the temporal coherence. Specifically it can convert a high-resolution video (1280 × 720) in real time (about 28 Hz) on a 3.4 GHz i7 CPU.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"55 1","pages":"159-166"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90052446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836090
Hamidreza Odabai Fard, M. Chaouch, Q. Pham, A. Vacavant, T. Chateau
In addition to multi-class classification, the multi-class object detection task consists further in classifying a dominating background label. In this work, we present a novel approach where relevant classes are ranked higher and background labels are rejected. To this end, we arrange the classes into a tree structure where the classifiers are trained in a joint framework combining ranking and classification constraints. Our convex problem formulation naturally allows to apply a tree traversal algorithm that searches for the best class label and progressively rejects background labels. We evaluate our approach on the PASCAL VOC 2007 dataset and show a considerable speed-up of the detection time with increased detection performance.
{"title":"Joint hierarchical learning for efficient multi-class object detection","authors":"Hamidreza Odabai Fard, M. Chaouch, Q. Pham, A. Vacavant, T. Chateau","doi":"10.1109/WACV.2014.6836090","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836090","url":null,"abstract":"In addition to multi-class classification, the multi-class object detection task consists further in classifying a dominating background label. In this work, we present a novel approach where relevant classes are ranked higher and background labels are rejected. To this end, we arrange the classes into a tree structure where the classifiers are trained in a joint framework combining ranking and classification constraints. Our convex problem formulation naturally allows to apply a tree traversal algorithm that searches for the best class label and progressively rejects background labels. We evaluate our approach on the PASCAL VOC 2007 dataset and show a considerable speed-up of the detection time with increased detection performance.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"58 1","pages":"261-268"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90557973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836043
Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu
This paper presents a novel approach to cross-view action recognition. Traditional cross-view action recognition methods typically rely on local appearance/motion features. In this paper, we take advantage of the recent developments of depth cameras to build a more discriminative cross-view action representation. In this representation, an action is characterized by the spatio-temporal configuration of 3D Poselets, which are discriminatively discovered with a novel Poselet mining algorithm and can be detected with view-invariant 3D Poselet detectors. The Kinect skeleton is employed to facilitate the 3D Poselet mining and 3D Poselet detectors learning, but the recognition is solely based on 2D video input. Extensive experiments have demonstrated that this new action representation significantly improves the accuracy and robustness for cross-view action recognition.
{"title":"Mining discriminative 3D Poselet for cross-view action recognition","authors":"Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu","doi":"10.1109/WACV.2014.6836043","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836043","url":null,"abstract":"This paper presents a novel approach to cross-view action recognition. Traditional cross-view action recognition methods typically rely on local appearance/motion features. In this paper, we take advantage of the recent developments of depth cameras to build a more discriminative cross-view action representation. In this representation, an action is characterized by the spatio-temporal configuration of 3D Poselets, which are discriminatively discovered with a novel Poselet mining algorithm and can be detected with view-invariant 3D Poselet detectors. The Kinect skeleton is employed to facilitate the 3D Poselet mining and 3D Poselet detectors learning, but the recognition is solely based on 2D video input. Extensive experiments have demonstrated that this new action representation significantly improves the accuracy and robustness for cross-view action recognition.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"69 1","pages":"634-639"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77063414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836097
Praveen Kulkarni, Gaurav Sharma, J. Zepeda, Louis Chevallier
Retrieving images for an arbitrary user query, provided in textual form, is a challenging problem. A recently proposed method addresses this by constructing a visual classifier with images returned by an internet image search engine, based on the user query, as positive images while using a fixed pool of negative images. However, in practice, not all the images obtained from internet image search are always pertinent to the query; some might contain abstract or artistic representation of the content and some might have artifacts. Such images degrade the performance of on-the-fly constructed classifier. We propose a method for improving the performance of on-the-fly classifiers by using transfer learning via attributes. We first map the textual query to a set of known attributes and then use those attributes to prune the set of images downloaded from the internet. This pruning step can be seen as zero-shot learning of the visual classifier for the textual user query, which transfers knowledge from the attribute domain to the query domain. We also use the attributes along with the on-the-fly classifier to score the database images and obtain a hybrid ranking. We show interesting qualitative results and demonstrate by experiments with standard datasets that the proposed method improves upon the baseline on-the-fly classification system.
{"title":"Transfer learning via attributes for improved on-the-fly classification","authors":"Praveen Kulkarni, Gaurav Sharma, J. Zepeda, Louis Chevallier","doi":"10.1109/WACV.2014.6836097","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836097","url":null,"abstract":"Retrieving images for an arbitrary user query, provided in textual form, is a challenging problem. A recently proposed method addresses this by constructing a visual classifier with images returned by an internet image search engine, based on the user query, as positive images while using a fixed pool of negative images. However, in practice, not all the images obtained from internet image search are always pertinent to the query; some might contain abstract or artistic representation of the content and some might have artifacts. Such images degrade the performance of on-the-fly constructed classifier. We propose a method for improving the performance of on-the-fly classifiers by using transfer learning via attributes. We first map the textual query to a set of known attributes and then use those attributes to prune the set of images downloaded from the internet. This pruning step can be seen as zero-shot learning of the visual classifier for the textual user query, which transfers knowledge from the attribute domain to the query domain. We also use the attributes along with the on-the-fly classifier to score the database images and obtain a hybrid ranking. We show interesting qualitative results and demonstrate by experiments with standard datasets that the proposed method improves upon the baseline on-the-fly classification system.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"168 1","pages":"220-226"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86887252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836110
Matthias Richter, J. Beyerer
The color of a material is one of the most frequently used features in automated visual inspection systems. While this is sufficient for many “easy” tasks, mixed and organic materials usually require more complex features. Spectral signatures, especially in the near infrared range, have been proven useful in many cases. However, hyperspectral imaging devices are still very costly and too slow to use them in practice. As a work-around, off-the-shelve cameras and optical filters are used to extract few characteristic features from the spectra. Often, these filters are selected by a human expert in a time consuming and error prone process; surprisingly few works are concerned with automatic selection of suitable filters. We approach this problem by stating filter selection as feature selection problem. In contrast to existing techniques that are mainly concerned with filter design, our approach explicitly selects the best out of a large set of given filters. Our method becomes most appealing for use in an industrial setting, when this selection represents (physically) available filters. We show the application of our technique by implementing six different selection strategies and applying each to two real-world sorting problems.
{"title":"Optical filter selection for automatic visual inspection","authors":"Matthias Richter, J. Beyerer","doi":"10.1109/WACV.2014.6836110","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836110","url":null,"abstract":"The color of a material is one of the most frequently used features in automated visual inspection systems. While this is sufficient for many “easy” tasks, mixed and organic materials usually require more complex features. Spectral signatures, especially in the near infrared range, have been proven useful in many cases. However, hyperspectral imaging devices are still very costly and too slow to use them in practice. As a work-around, off-the-shelve cameras and optical filters are used to extract few characteristic features from the spectra. Often, these filters are selected by a human expert in a time consuming and error prone process; surprisingly few works are concerned with automatic selection of suitable filters. We approach this problem by stating filter selection as feature selection problem. In contrast to existing techniques that are mainly concerned with filter design, our approach explicitly selects the best out of a large set of given filters. Our method becomes most appealing for use in an industrial setting, when this selection represents (physically) available filters. We show the application of our technique by implementing six different selection strategies and applying each to two real-world sorting problems.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"2021 1","pages":"123-128"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87954008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836035
Sid Ying-Ze Bao, A. Furlan, Li Fei-Fei, S. Savarese
We present a novel framework for robustly understanding the geometrical and semantic structure of a cluttered room from a small number of images captured from different viewpoints. The tasks we seek to address include: i) estimating the 3D layout of the room - that is, the 3D configuration of floor, walls and ceiling; ii) identifying and localizing all the foreground objects in the room. We jointly use multiview geometry constraints and image appearance to identify the best room layout configuration. Extensive experimental evaluation demonstrates that our estimation results are more complete and accurate in estimating 3D room structure and recognizing objects than alternative state-of-the-art algorithms. In addition, we show an augmented reality mobile application to highlight the high accuracy of our method, which may be beneficial to many computer vision applications.
{"title":"Understanding the 3D layout of a cluttered room from multiple images","authors":"Sid Ying-Ze Bao, A. Furlan, Li Fei-Fei, S. Savarese","doi":"10.1109/WACV.2014.6836035","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836035","url":null,"abstract":"We present a novel framework for robustly understanding the geometrical and semantic structure of a cluttered room from a small number of images captured from different viewpoints. The tasks we seek to address include: i) estimating the 3D layout of the room - that is, the 3D configuration of floor, walls and ceiling; ii) identifying and localizing all the foreground objects in the room. We jointly use multiview geometry constraints and image appearance to identify the best room layout configuration. Extensive experimental evaluation demonstrates that our estimation results are more complete and accurate in estimating 3D room structure and recognizing objects than alternative state-of-the-art algorithms. In addition, we show an augmented reality mobile application to highlight the high accuracy of our method, which may be beneficial to many computer vision applications.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"27 1","pages":"690-697"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89065362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836022
Wenbin Li, Yang Chen, JeeHang Lee, Gang Ren, D. Cosker
Optical flow estimation is a difficult task given real-world video footage with camera and object blur. In this paper, we combine a 3D pose&position tracker with an RGB sensor allowing us to capture video footage together with 3D camera motion. We show that the additional camera motion information can be embedded into a hybrid optical flow framework by interleaving an iterative blind deconvolution and warping based minimization scheme. Such a hybrid framework significantly improves the accuracy of optical flow estimation in scenes with strong blur. Our approach yields improved overall performance against three state-of-the-art baseline methods applied to our proposed ground truth sequences, as well as in several other real-world sequences captured by our novel imaging system.
{"title":"Robust optical flow estimation for continuous blurred scenes using RGB-motion imaging and directional filtering","authors":"Wenbin Li, Yang Chen, JeeHang Lee, Gang Ren, D. Cosker","doi":"10.1109/WACV.2014.6836022","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836022","url":null,"abstract":"Optical flow estimation is a difficult task given real-world video footage with camera and object blur. In this paper, we combine a 3D pose&position tracker with an RGB sensor allowing us to capture video footage together with 3D camera motion. We show that the additional camera motion information can be embedded into a hybrid optical flow framework by interleaving an iterative blind deconvolution and warping based minimization scheme. Such a hybrid framework significantly improves the accuracy of optical flow estimation in scenes with strong blur. Our approach yields improved overall performance against three state-of-the-art baseline methods applied to our proposed ground truth sequences, as well as in several other real-world sequences captured by our novel imaging system.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"108 1","pages":"792-799"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87611216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-03-24DOI: 10.1109/WACV.2014.6836056
A. Angelova, Philip M. Long
This paper presents a systematic evaluation of recent methods in the fine-grained categorization domain, which have shown significant promise. More specifically, we investigate an automatic segmentation algorithm, a region pooling algorithm which is akin to pose-normalized pooling [31] [28], and a multi-class optimization method. We considered the largest and most popular datasets for fine-grained categorization available in the field: the Caltech-UCSD 200 Birds dataset [27], the Oxford 102 Flowers dataset [19], the Stanford 120 Dogs dataset [16], and the Oxford 37 Cats and Dogs dataset [21]. We view this work from a practitioner's perspective, answering the question: what are the methods that can create the best possible fine-grained recognition system which can be applied in practice? Our experiments provide insights of the relative merit of these methods. More importantly, after combining the methods, we achieve the top results in the field, outperforming the state-of-the-art methods by 4.8% and 10.3% for birds and dogs datasets, respectively. Additionally, our method achieves a mAP of 37.92 on the of 2012 Imagenet Fine-Grained Categorization Challenge [1], which outperforms the winner of this challenge by 5.7 points.
{"title":"Benchmarking large-scale Fine-Grained Categorization","authors":"A. Angelova, Philip M. Long","doi":"10.1109/WACV.2014.6836056","DOIUrl":"https://doi.org/10.1109/WACV.2014.6836056","url":null,"abstract":"This paper presents a systematic evaluation of recent methods in the fine-grained categorization domain, which have shown significant promise. More specifically, we investigate an automatic segmentation algorithm, a region pooling algorithm which is akin to pose-normalized pooling [31] [28], and a multi-class optimization method. We considered the largest and most popular datasets for fine-grained categorization available in the field: the Caltech-UCSD 200 Birds dataset [27], the Oxford 102 Flowers dataset [19], the Stanford 120 Dogs dataset [16], and the Oxford 37 Cats and Dogs dataset [21]. We view this work from a practitioner's perspective, answering the question: what are the methods that can create the best possible fine-grained recognition system which can be applied in practice? Our experiments provide insights of the relative merit of these methods. More importantly, after combining the methods, we achieve the top results in the field, outperforming the state-of-the-art methods by 4.8% and 10.3% for birds and dogs datasets, respectively. Additionally, our method achieves a mAP of 37.92 on the of 2012 Imagenet Fine-Grained Categorization Challenge [1], which outperforms the winner of this challenge by 5.7 points.","PeriodicalId":73325,"journal":{"name":"IEEE Winter Conference on Applications of Computer Vision. IEEE Winter Conference on Applications of Computer Vision","volume":"83 1","pages":"532-539"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89952993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}